## BioSentVec

This notebook provides a fundemental introduction to our BioSentVec models. It illustrates   
(1) how to load the model,  
(2) an example function to preprocess sentences,  
(3) an example application that uses the model and  
(4) further resources for using the model more broadly.

### 1. Prequisite

Please download BioSentVec model and install all the related python libraries

In [1]:
import sent2vec
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from scipy.spatial import distance

### 2. Load BioSentVec model

Please specify the location of the BioSentVec model to model_path. It may take a while to load the model at the first time.

Download the model from [this location](https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioSentVec_PubMed_MIMICIII-bigram_d700.bin) and mention the path of this file in `model_path` where you save the model.



In [None]:
model_path = YOUR_MODEL_LOCATION
model = sent2vec.Sent2vecModel()
try:
    model.load_model(model_path)
except Exception as e:
    print(e)
print('model successfully loaded')

In [None]:
import nltk
nltk.download('punkt')

### 3. Preprocess sentences
There is no one-size-fits-all solution to preprocess sentences. Demonstrating a representative code example as below. This is also consistent with the preprocessing appaorach when we trained BioSentVec models.

In [4]:
stop_words = set(stopwords.words('english'))
def preprocess_sentence(text):
    text = text.replace('/', ' / ')
    text = text.replace('.-', ' .- ')
    text = text.replace('.', ' . ')
    text = text.replace('\'', ' \' ')
    text = text.lower()

    tokens = [token for token in word_tokenize(text) if token not in punctuation and token not in stop_words]

    return ' '.join(tokens)

An example of using the preprocess_sentence function:

In [5]:
sentence = preprocess_sentence('Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis.')
print(sentence)

breast cancers her2 amplification higher risk cns metastasis poorer prognosis


### 4. Retrieve a sentence vector
Once a sentence is preprocessed, we can pass it to the BioSentVec model to retrieve a vector representation of the sentence.

In [None]:
sentence_vector = model.embed_sentence(sentence)
print(sentence_vector)

### 5. Compute sentence similarity
In this section, we demonstrate how to compute the sentence similarity between a sentence pair using the BioSentVec model. We firstly use the above code examples to get vector representations of sentences. Then we compute the cosine similarity between the pair.

In [None]:
sentence_vector1 = model.embed_sentence(preprocess_sentence('Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis.'))
sentence_vector2 = model.embed_sentence(preprocess_sentence('Breast cancers with HER2 amplification are more aggressive, have a higher risk of CNS metastasis, and poorer prognosis.'))

cosine_sim = 1 - distance.cosine(sentence_vector1.flatten(), sentence_vector2.flatten())
print('cosine similarity:', cosine_sim)

Here's an example sentence of something which is less similar.

In [None]:
sentence_vector3 = model.embed_sentence(preprocess_sentence('Furthermore, increased CREB expression in breast tumors is associated with poor prognosis, shorter survival and higher risk of metastasis.'))
cosine_sim = 1 - distance.cosine(sentence_vector1.flatten(), sentence_vector3.flatten())
print('cosine similarity:', cosine_sim)

Similarity and distance are inversely proportional to each other, meaning that the more similar two things are, the smaller the distance between them, and vice versa.