# BioConceptVec Tutorial

This tutorial provides a fundemental introduction to our BioConceptVec models. It illustrates (1) how to load the model, (2) how to get concept vectors, and (3) how to search top K similar concepts. [Source](https://github.com/ncbi/BioConceptVec/blob/master/bioconcept_tutorial.ipynb)

## 1. Prerequisites

Install gensim to load BioConceptVec.

In [1]:
from gensim.models import KeyedVectors
import os, sys, json, numpy as np

## 2. Load BioConceptVec

Go to data.ipynb to download the data. Then set the path to the data directory.

In [2]:
YOUR_BIOCONCEPTVEC_PATH = '../embeddings/bioconceptvec_glove.bin'
YOUR_JSON_PATH = '../embeddings/concept_glove.json'

Let's create a function using gensim to load BioConceptVec.

In [3]:
def load_embedding(path, binary):
    embedding = KeyedVectors.load_word2vec_format(path, binary=binary)
    print('embedding loaded from', path)
    return embedding

Let's load any one version of BioConceptVec. This might take a few minutes to load.

In [4]:
model = load_embedding(YOUR_BIOCONCEPTVEC_PATH, binary=True)

embedding loaded from ../embeddings/bioconceptvec_glove.bin


## 3. Load concept vectors only (an alternative approach)

If you only need concept vectors rather than other common word vectors, you could also load the json file that contains concept vectors only.

In [5]:
with open(YOUR_JSON_PATH) as json_file:  
    concept_vectors = json.load(json_file)
print('load', len(concept_vectors), 'concepts')

load 402712 concepts


In [None]:
import pandas as pd
keys = concept_vectors.keys()
# Turn keys into a dataframe
df = pd.DataFrame(list(keys)[:500], columns=['concept'])

## 4. Get concept vectors

Now you could specify a concept ID and get the vector.

For the complete BioConceptVec model, you could use:

In [6]:
concept_vec = model['Gene_2997']
concept_vec

array([-0.205002, -0.858642,  0.293454,  0.420697,  0.819198, -0.014607,
        0.648112,  0.375288,  0.402923, -0.486146,  0.061995, -0.597637,
       -0.18606 , -0.030133,  0.057708,  0.0928  , -0.231601, -0.102536,
        0.113617,  0.096256, -0.716546,  0.129387,  0.097503,  0.516359,
        0.150461, -0.128564,  0.617396, -0.70626 , -0.214449, -0.609342,
       -0.610461,  0.202288,  0.509137,  0.601489,  0.49775 ,  0.046024,
        0.477214,  0.170123,  0.8417  ,  0.35232 ,  0.084167,  0.006656,
        0.052083,  0.27484 , -0.222268,  0.738473, -0.021782,  0.917502,
        0.208649,  0.21989 , -0.262444,  0.110046,  0.838382,  0.243271,
        0.584187, -0.17198 , -0.35021 ,  0.255658,  0.495955, -0.438681,
       -0.480301, -0.998345, -0.303257,  0.60441 , -0.235623,  0.19677 ,
       -0.311972,  0.406385,  0.067994,  0.487211, -0.016605, -0.42778 ,
       -1.023635,  0.342992, -0.103875,  0.076305,  0.890983,  0.101807,
       -0.019141, -0.469659, -0.081776, -0.074963, 

For the json file, it is a dictionary, so the code is exactly the same:

In [7]:
np.array(concept_vectors['Gene_2997'])

array([-0.20500199, -0.85864198,  0.29345399,  0.420697  ,  0.81919801,
       -0.014607  ,  0.648112  ,  0.37528801,  0.40292299, -0.486146  ,
        0.061995  , -0.597637  , -0.18606   , -0.030133  ,  0.057708  ,
        0.0928    , -0.231601  , -0.102536  ,  0.113617  ,  0.096256  ,
       -0.716546  ,  0.12938701,  0.097503  ,  0.51635897,  0.150461  ,
       -0.128564  ,  0.617396  , -0.70626003, -0.214449  , -0.60934198,
       -0.610461  ,  0.202288  ,  0.50913697,  0.60148901,  0.49775001,
        0.046024  ,  0.47721401,  0.170123  ,  0.84170002,  0.35231999,
        0.084167  ,  0.006656  ,  0.052083  ,  0.27484   , -0.222268  ,
        0.738473  , -0.021782  ,  0.91750199,  0.20864899,  0.21989   ,
       -0.26244399,  0.110046  ,  0.83838201,  0.24327099,  0.58418697,
       -0.17197999, -0.35021001,  0.255658  ,  0.49595499, -0.43868101,
       -0.48030099, -0.99834502, -0.30325699,  0.60440999, -0.235623  ,
        0.19677   , -0.31197199,  0.406385  ,  0.067994  ,  0.48

Similarly, you could use the concept IDs that we provided in the json file to get more concept vectors.

## 5.Compute the similarity between concepts

Now we can use the concept vectors to find similar concepts.

First let's create a function to calculate the Cosine similarity

In [8]:
def cosine(a, b):
    norm1 = np.linalg.norm(a)
    norm2 = np.linalg.norm(b)
    return np.dot(a, b) / (norm1 * norm2)

For example, for the interleukin 10 gene (Gene_3586), let's find out which gene is more similar to it, interleukin 4 gene (Gene_3565) or HUWE1 (Gene_10075)?

In [9]:
cosine(model['Gene_3586'], model['Gene_3565'])

0.88254434

In [10]:
cosine(model['Gene_3586'], model['Gene_10075'])

0.1340888

The results show that interleukin 4 gene is more similar to interleukin 10 gene. Indeed, they share GO terms.

## 6. Find top K similar terms via BioConceptVec

You can also use the embedding to find out the top K similar terms:

In [9]:
model.most_similar(positive=['Gene_432248'], topn=3)

[('Gene_399342', 0.6710575222969055),
 ('Gene_399275', 0.6615089774131775),
 ('Gene_378586', 0.6413488984107971)]

## Read in equations

### Read in mappings

In [4]:
import pickle   
# Load from pickle file
with open('/Users/danielgeorge/Documents/work/ml/bioconceptvec-explorer/bioconceptvec-explorer/mappings/concept_descriptions.pkl', 'rb') as f:
    concept_descriptions = pickle.load(f)
concept_descriptions

{'Disease_MESH_D001845': 'Bone Cysts',
 'Gene_2799940': 'matrix protein M2-1;matrix protein M2-2',
 'Gene_3726_54751': 'JunB proto-oncogene, AP-1 transcription factor subunit',
 'Disease_MESH_D000652': 'Amniotic Band Syndrome',
 'ProteinMutation_p_R450H_RS_189261858': ['uncharacterized LOC101928462',
  'thyroid stimulating hormone receptor'],
 'ProteinMutation_p_I123V': 'ProteinMutation I123V',
 'SNP_rs11944405': ['late endosomal/lysosomal adaptor, MAPK and MTOR activator 3'],
 'Chemical_MESH_C511970': 'M58373',
 'ProteinMutation_p_Q106E': 'ProteinMutation Q106E',
 'Gene_56941_116255': '5-hydroxymethylcytosine binding, ES cell specific',
 'SNP_rs12536544': ['Rac family small GTPase 1'],
 'Chemical_MESH_C558709': '16-(4-(3-(imidazol-1-yl)propoxy)-3-methoxybenzylidene)-5-androstene-3b,17b-diol',
 'Gene_100513670': 'non-SMC condensin I complex subunit G',
 'Gene_30244': 'engrailed homeobox 1a',
 'Chemical_MESH_C543977': 'wogonin-5-0-beta-D-glucuronide methyl ester',
 'Chemical_MESH_C02344

In [1]:
import pandas as pd

def process_line(line):
    parts = line.strip().split(' | ')

    equation = parts[0].split(': ')[1]
    solution = parts[1].split(': ')[1]
    similarity = float(parts[2].split(': ')[1])
    
    equation_parts = equation.split(' ')
    part1, operation1, part2, operation2, part3, equals, result = equation_parts
    
    return part1, operation1, part2, operation2, part3, result, solution, similarity

data = []

with open('../equations.txt', 'r') as file:
    for line in file:
        data.append(process_line(line))

df = pd.DataFrame(data, columns=['Part1', 'Operation1', 'Part2', 'Operation2', 'Part3', 'Result', 'Solution', 'Similarity'])

In [2]:
df

Unnamed: 0,Part1,Operation1,Part2,Operation2,Part3,Result,Solution,Similarity
0,Gene_91624,-,Gene_341346,+,Gene_4316_4318_4322,x,Gene_91624,0.970477
1,Chemical_MESH_C106874,-,DNAMutation_c_54C_TAG,+,DNAMutation_c_1615delG_RS_797044819,x,Chemical_MESH_C106874,0.971954
2,DNAMutation_c_5191C_A,-,Chemical_MESH_C526999,+,Chemical_MESH_C012737,x,Chemical_MESH_C012737,0.992852
3,Chemical_MESH_C517756,-,ProteinMutation_p_Y428F,+,Disease_MESH_C562832,x,Disease_MESH_C562832,0.976333
4,Chemical_MESH_D017638,-,ProteinMutation_p_M274I,+,Chemical_MESH_C033333,x,Chemical_MESH_D017638,0.990920
...,...,...,...,...,...,...,...,...
131,Chemical_MESH_C059394,-,Chemical_MESH_C587979,+,Chemical__4_AP20187,x,Chemical_MESH_C059394,0.967140
132,Chemical_MESH_C039995,-,Chemical_MESH_C541201,+,Chemical_MESH_C541190,x,Chemical_MESH_C039995,0.957338
133,Species_72446,-,Chemical_MESH_C014863,+,Chemical_MESH_C106129,x,Species_72446,0.979094
134,Chemical_MESH_C071262,-,Chemical_MESH_C000588636,+,Gene_819320,x,Chemical_MESH_C071262,0.976970


In [5]:
# Go through Part1, Part2, Part3, Solution and map to concept descriptions. Search in concept_descriptions for the best match.
# Have both the original and the mapped version in the dataframe.

def get_best_match(concept_descriptions, concept):
    if concept in concept_descriptions:
        return concept_descriptions[concept]
    else:
        return concept

df['Part1_mapped'] = df['Part1'].apply(lambda x: get_best_match(concept_descriptions, x))
df['Part2_mapped'] = df['Part2'].apply(lambda x: get_best_match(concept_descriptions, x))
df['Part3_mapped'] = df['Part3'].apply(lambda x: get_best_match(concept_descriptions, x))
df['Solution_mapped'] = df['Solution'].apply(lambda x: get_best_match(concept_descriptions, x))
# Change the order of the columns to have the mapped ones next to the original ones
df = df[['Part1', 'Part1_mapped', 'Operation1', 'Part2', 'Part2_mapped', 'Operation2', 'Part3', 'Part3_mapped', 'Result', 'Solution', 'Solution_mapped', 'Similarity']]
df

Unnamed: 0,Part1,Part1_mapped,Operation1,Part2,Part2_mapped,Operation2,Part3,Part3_mapped,Result,Solution,Solution_mapped,Similarity
0,Gene_91624,nexilin F-actin binding protein,-,Gene_341346,single-pass membrane protein with coiled-coil ...,+,Gene_4316_4318_4322,matrix metallopeptidase 7,x,Gene_91624,nexilin F-actin binding protein,0.970477
1,Chemical_MESH_C106874,N-(2-(1-naphthalenyl)ethyl)cyclobutanecarboxamide,-,DNAMutation_c_54C_TAG,DNAMutation c 54C TAG,+,DNAMutation_c_1615delG_RS_797044819,DNAMutation c 1615delG RS 797044819,x,Chemical_MESH_C106874,N-(2-(1-naphthalenyl)ethyl)cyclobutanecarboxamide,0.971954
2,DNAMutation_c_5191C_A,DNAMutation c 5191C A,-,Chemical_MESH_C526999,N-(2-(1-((4-fluorophenyl)sulfonyl)-1H-indol-4-...,+,Chemical_MESH_C012737,anatabine,x,Chemical_MESH_C012737,anatabine,0.992852
3,Chemical_MESH_C517756,1-cyclopentyl-2-(5-fluoro-2-methoxy-phenyl)-1-...,-,ProteinMutation_p_Y428F,ProteinMutation Y428F,+,Disease_MESH_C562832,Pulmonary Atresia with Intact Ventricular Septum,x,Disease_MESH_C562832,Pulmonary Atresia with Intact Ventricular Septum,0.976333
4,Chemical_MESH_D017638,"Asbestos, Crocidolite",-,ProteinMutation_p_M274I,ProteinMutation M274I,+,Chemical_MESH_C033333,N-acetyl-S-(1-phenyl-3-hydroxypropyl)cysteine,x,Chemical_MESH_D017638,"Asbestos, Crocidolite",0.990920
...,...,...,...,...,...,...,...,...,...,...,...,...
131,Chemical_MESH_C059394,tyrosyl-arginyl-phenylalanyl-alanine,-,Chemical_MESH_C587979,pyranonigrin E,+,Chemical__4_AP20187,Chemical 4 AP20187,x,Chemical_MESH_C059394,tyrosyl-arginyl-phenylalanyl-alanine,0.967140
132,Chemical_MESH_C039995,"3-aminopyridine-1,N(6)-ethenoadenine dinucleot...",-,Chemical_MESH_C541201,sclerienone A,+,Chemical_MESH_C541190,6-(1-hydroxyethyl)-4-methoxy-3-methyl-2H-pyran...,x,Chemical_MESH_C039995,"3-aminopyridine-1,N(6)-ethenoadenine dinucleot...",0.957338
133,Species_72446,Labeo catla,-,Chemical_MESH_C014863,"1,4,8-trichlorodibenzofuran",+,Chemical_MESH_C106129,NNC 191228,x,Species_72446,Labeo catla,0.979094
134,Chemical_MESH_C071262,SK&F 82958,-,Chemical_MESH_C000588636,tortuosene A,+,Gene_819320,Protein kinase superfamily protein,x,Chemical_MESH_C071262,SK&F 82958,0.976970


In [6]:
# Turn df into an excel file
df.to_excel('equations.xlsx')