<a href="https://colab.research.google.com/github/Yohnjparra/CosineSimilarity-HealthDisparities/blob/main/Generative_AI_Health_Disparities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Domain-Specific BERT Models




# 1. Introduction

Number of researchers have created their own domain-specific language models. These models are created by  training the BERT architecture *from scratch* on a domain-specific corpus rather than the general purpose English text corpus used to train the original BERT model. This leads to a model with vocabulary and word embeddings better suited than the original BERT model to domain-specific NLP problems. Some examples include: 

- SciBERT (biomedical and computer science literature corpus)
- FinBERT (financial services corpus)
- BioBERT (biomedical literature corpus)
- ClinicalBERT (clinical notes corpus)
- mBERT (corpora from multiple languages)
- patentBERT (patent corpus)






## 1.1 Why not do my own pre-training?

**1. Pre-training BERT requires a huge corpus**

BERT-base is a 12-layer neural network with roughly 110 million weights. This enormous size is key to BERT's impressive performance. To train such a complex model, though, (and expect it to work) requires an enormous dataset, on the order of 1B words. Wikipedia is a suitable corpus, for example, with its ~10 million articles. For the majority of applications I assume you won't have a dataset with that many documents. 

**2. Huge Model + Huge Corpus = Lots of GPUs**

Pre-Training BERT is expensive. The cost of pre-training is a whole subject of discussion, and there's been a lot of work done on bringing the cost down, but a *single* pre-training experiment could easily cost you thousands of dollars in GPU or TPU time. 

That's why these domain-specific pre-trained models are so interesting. Other organizations have footed the bill to produce and share these models which, while not pre-trained on your specific dataset, may at least be much closer to yours than "generic" BERT.


# 2. Using a Community-Submitted Model

## 2.1. Library of Models

In [None]:
!pip install transformers



The `transformers` library includes classes for different model architectures (e.g., `BertModel`, `AlbertModel`, `RobertaModel`, ...). With whatever model you're using, it needs to be loaded with the correct class (based on its architecture), which may not be immediately apparent. 

Luckily, the `transformers` library has a solution for this, demonstrated in the following cell. These "Auto" classes will choose the correct architecture for you! 

That's a nice feature, but I'd still prefer to know what I'm working with, so I'm printing out the class names (which show that SciBERT uses the original BERT classes).


In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

scibert_tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
scibert_model = AutoModel.from_pretrained("allenai/scibert_scivocab_uncased")

print('scibert_tokenizer is type:', type(scibert_tokenizer))
print('    scibert_model is type:', type(scibert_model))


# 3. Comparing SciBERT and BERT

## 3.1. Comparing Vocabularies

The most apparent difference between SciBERT and the original BERT should be the model's vocabulary, since they were trained on such different corpuses.

Both tokenizers have a 30,000 word vocabulary that was automatically built based on the most frequently seen words and subword units in their respective corpuses. 

The authors of SciBERT note:

> "The resulting token overlap between [BERT vocabulary] and
[SciBERT vocabulary] is 42%, illustrating a substantial difference in frequently used words between scientific and general domain texts."

Let's load the original BERT as well and do some of our own comparisons.

*Side note: BERT used a "WordPiece" model for tokenization, whereas SciBERT employs a newer approach called "SentencePiece", but the difference is mostly cosmetic. I cover SentencePiece in more detail in our [ALBERT eBook](https://www.chrismccormick.ai/offers/HaABTJQH).*


In [None]:
from transformers import BertTokenizer, BertModel

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

Let's apply both tokenizers to health disparities text and see how they compare. 

I took the below sentence from  CHATGPT4. 

In [None]:
text = "During a routine check-up, a patient from a historically marginalized community discussed their struggles with accessing affordable healthcare, highlighting the ongoing issue of health disparities in the doctor's room"

# Split the sentence into tokens, with both BERT and SciBERT.
bert_tokens = bert_tokenizer.tokenize(text)
scibert_tokens = scibert_tokenizer.tokenize(text)

# Pad out the scibert list to be the same length.
while len(scibert_tokens) < len(bert_tokens):
    scibert_tokens.append("")

# Label the columns.
print('{:<12} {:<12}'.format("BERT", "SciBERT"))
print('{:<12} {:<12}'.format("----", "-------"))

# Display the tokens.
for tup in zip(bert_tokens, scibert_tokens):
    print('{:<12} {:<12}'.format(tup[0], tup[1]))


BERT         SciBERT     
----         -------     
during       during      
a            a           
routine      routine     
check        check       
-            -           
up           up          
,            ,           
a            a           
patient      patient     
from         from        
a            a           
historically historically
marginal     marginal    
##ized       ##ized      
community    community   
discussed    discussed   
their        their       
struggles    struggle    
with         ##s         
access       with        
##ing        accessing   
affordable   affordable  
healthcare   healthcare  
,            ,           
highlighting highlighting
the          the         
ongoing      ongoing     
issue        issue       
of           of          
health       health      
di           disparities 
##spar       in          
##ities      the         
in           doctor      
the          '           
doctor       s           
'           

SciBERT apparently has embeddings for the words 'accesing' and 'disparities', whereas BERT had to break these down into two and three subwords each. (Remember that the '##' in a token is just a way to flag it as a subword that is not the first subword). 

I ask CHATGPT4 terms--check out the different numbers of tokens required by each model.

In [None]:
# Use pandas just for table formatting.
import pandas as pd

# Some terms used in health disparities.
words = ['Food desert', 
         'Medical redlining',
         'Social determinants of health',
         'Health equity',
         'Intersectionality',
         'Infant mortality rate',
         'Life expectancy',
         'Poverty rate',
         'Health insurance coverage rate',
         'Disparity ratio',
         'non-compliant',
         'Difficult'
         ]

# For each term...
for word in words:
    
    # Print it out
    print('\n\n', word, '\n')

    # Start a list of tokens for each model, with the first one being the model name.
    list_a = ["BERT:"]
    list_b = ["SciBERT:"]

    # Run both tokenizers.
    list_a.extend(bert_tokenizer.tokenize(word))
    list_b.extend(scibert_tokenizer.tokenize(word))

    # Pad the lists to the same length.
    while len(list_a) < len(list_b):
        list_a.append("")
    while len(list_b) < len(list_a):
        list_b.append("")

    # Wrap them in a DataFrame to display a pretty table.
    df = pd.DataFrame([list_a, list_b])
    
    display(df)




 Food desert 



Unnamed: 0,0,1,2
0,BERT:,food,desert
1,SciBERT:,food,desert




 Medical redlining 



Unnamed: 0,0,1,2,3,4
0,BERT:,medical,red,##lining,
1,SciBERT:,medical,red,##lin,##ing




 Social determinants of health 



Unnamed: 0,0,1,2,3,4,5,6
0,BERT:,social,deter,##mina,##nts,of,health
1,SciBERT:,social,determinants,of,health,,




 Health equity 



Unnamed: 0,0,1,2
0,BERT:,health,equity
1,SciBERT:,health,equity




 Intersectionality 



Unnamed: 0,0,1,2
0,BERT:,intersection,##ality
1,SciBERT:,intersection,##ality




 Infant mortality rate 



Unnamed: 0,0,1,2,3
0,BERT:,infant,mortality,rate
1,SciBERT:,infant,mortality,rate




 Life expectancy 



Unnamed: 0,0,1,2,3
0,BERT:,life,expect,##ancy
1,SciBERT:,life,expectancy,




 Poverty rate 



Unnamed: 0,0,1,2
0,BERT:,poverty,rate
1,SciBERT:,poverty,rate




 Health insurance coverage rate 



Unnamed: 0,0,1,2,3,4
0,BERT:,health,insurance,coverage,rate
1,SciBERT:,health,insurance,coverage,rate




 Disparity ratio 



Unnamed: 0,0,1,2,3,4
0,BERT:,di,##spar,##ity,ratio
1,SciBERT:,disparity,ratio,,




 non-compliant 



Unnamed: 0,0,1,2,3
0,BERT:,non,-,compliant
1,SciBERT:,non,-,compliant




 Difficult 



Unnamed: 0,0,1
0,BERT:,difficult
1,SciBERT:,difficult


The fact that SciBERT is able to represent all of these terms in fewer tokens seems like a good sign!

In [None]:
import random

# ======== BERT ========
bert_examples = []

count = 0

# For each token in the vocab...
for token in bert_tokenizer.vocab:
    
    # If there's a digit in the token...
    # (But don't count those reserved tokens, e.g. "[unused59]")
    if any(i.isdigit() for i in token) and not ('unused' in token):
        # Count it.
        count += 1

        # Keep ~1% as examples to print.
        if random.randint(0, 100) == 1:
            bert_examples.append(token)

# Calculate the count as a percentage of the total vocab.
prcnt = float(count) / len(bert_tokenizer.vocab)

# Print the result.
print('In BERT:    {:>5,} tokens ({:.2%}) include a digit.'.format(count, prcnt))

# ======== SciBERT ========
scibert_examples = []
count = 0

# For each token in the vocab...
for token in scibert_tokenizer.vocab:

    # If there's a digit in the token...
    # (But don't count those reserved tokens, e.g. "[unused59]")
    if any(i.isdigit() for i in token) and not ('unused' in token):
        # Count it.
        count += 1

        # Keep ~1% as examples to print.
        if random.randint(0, 100) == 1:
            scibert_examples.append(token)
   

# Calculate the count as a percentage of the total vocab.
prcnt = float(count) / len(scibert_tokenizer.vocab)

# Print the result.
print('In SciBERT: {:>5,} tokens ({:.2%}) include a digit.'.format(count, prcnt))

print('')
print('Examples from BERT:', bert_examples)
print('Examples from SciBERT:', scibert_examples)

In BERT:    1,109 tokens (3.63%) include a digit.
In SciBERT: 3,345 tokens (10.76%) include a digit.

Examples from BERT: ['1973', '1904', '1870', '113', '1811', '1758', '420', '176', '159', '412', '1649', '930']
Examples from SciBERT: ['##–80', '##004', '(0.8', '##–28', ')2', '2012),', '3),', '##37', '##-13', '(2.5', '##(6)', '##301', '77.', '51', '##95', '##2,', '##4;', '1976', '2005)', '70%', '10.', '98,', '306', '##-3-3', '##22,', '##–30', '##–18', '##<0.05', '17%', '2011),']


So it looks like:
- SciBERT has about 3x as many tokens with digits. 
- BERT's tokens are whole integers, and many look like they could be dates. 
- SciBERT's number tokens are much more diverse. They are often subwords, and many include decimal places or  other symbols like `%` or `(`.

## 3.2. Comparing Embeddings


**Semantic Similarity on Health Disparities Text**

To create a simple demonstration of SciBERT's value, Nick and I figured we could create a semantic similarity example where we show that SciBERT is better able to recognize similarities and differences within some scientific text than generic BERT. 

We implemented this idea, but the examples we tried don't appear to show SciBERT as being better! 


**Our Approach**

In our semantic similarity task, we have three pieces of text--call them "query", "A", and "B", that are all on health dispatities topics. We pick these such that the query text is always more similar to A than to B. 

Here's an example:

* query: "During a routine check-up, a patient from a historically marginalized community discussed their struggles with accessing affordable healthcare, highlighting the ongoing issue of health disparities in the doctor's room."
* A: ""
* B: ""

`query` and `A` are both about mitochondria, whereas `B` is about ribosomes. However, to recognize the similarity between `query` and `A`, you would need to know that mitochondria are responsible for producing ATP.  

Our intuition was that SciBERT, being trained on biomedical text, would better distinguish the similarities than BERT. 





**Interpreting Cosine Similarities**

When comparing two different models for semantic similarity, it's best to look at how well they *rank* the similarities, and not to compare the specific cosine similarity *values* across the two models.

It's for this reason that we've structured our example as "is `query` more similar to `A` or to `B`?"


**Embedding Functions**

In order to try out different examples, we've defined a `get_embedding` function below. It takes the average of the embeddings from the second-to-last layer of the model to use as a sentence embedding.

`get_embedding` also supports calculating an embedding for a specific word or sequence of words within the sentence. 

To locate the indeces of the tokens for these words, we've also defined the `get_word_indeces` helper function below. 

To calculate the word embedding, we again take the average of its token embeddings from the second-to-last layer of the model.


#### get_word_indeces


In [None]:
import numpy as np

def get_word_indeces(tokenizer, text, word):
    '''
    Determines the index or indeces of the tokens corresponding to `word`
    within `text`. `word` can consist of multiple words, e.g., "cell biology".
    
    Determining the indeces is tricky because words can be broken into multiple
    tokens. I've solved this with a rather roundabout approach--I replace `word`
    with the correct number of `[MASK]` tokens, and then find these in the 
    tokenized result. 
    '''
    # Tokenize the 'word'--it may be broken into multiple tokens or subwords.
    word_tokens = tokenizer.tokenize(word)

    # Create a sequence of `[MASK]` tokens to put in place of `word`.
    masks_str = ' '.join(['[MASK]']*len(word_tokens))

    # Replace the word with mask tokens.
    text_masked = text.replace(word, masks_str)

    # `encode` performs multiple functions:
    #   1. Tokenizes the text
    #   2. Maps the tokens to their IDs
    #   3. Adds the special [CLS] and [SEP] tokens.
    input_ids = tokenizer.encode(text_masked)

    # Use numpy's `where` function to find all indeces of the [MASK] token.
    mask_token_indeces = np.where(np.array(input_ids) == tokenizer.mask_token_id)[0]

    return mask_token_indeces


#### get_embedding

In [None]:
def get_embedding(b_model, b_tokenizer, text, word=''):
    '''
    Uses the provided model and tokenizer to produce an embedding for the
    provided `text`, and a "contextualized" embedding for `word`, if provided.
    '''

    # If a word is provided, figure out which tokens correspond to it.
    if not word == '':
        word_indeces = get_word_indeces(b_tokenizer, text, word)

    # Encode the text, adding the (required!) special tokens, and converting to
    # PyTorch tensors.
    encoded_dict = b_tokenizer.encode_plus(
                        text,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        return_tensors = 'pt',     # Return pytorch tensors.
                )

    input_ids = encoded_dict['input_ids']
    
    b_model.eval()

    # Run the text through the model and get the hidden states.
    bert_outputs = b_model(input_ids)
    
    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():

        outputs = b_model(input_ids)

        # Evaluating the model will return a different number of objects based on 
        # how it's  configured in the `from_pretrained` call earlier. In this case, 
        # becase we set `output_hidden_states = True`, the third item will be the 
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]

    # `hidden_states` has shape [13 x 1 x <sentence length> x 768]

    # Select the embeddings from the second to last layer.
    # `token_vecs` is a tensor with shape [<sent length> x 768]
    token_vecs = hidden_states[-2][0]

    # Calculate the average of all token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)

    # Convert to numpy array.
    sentence_embedding = sentence_embedding.detach().numpy()

    # If `word` was provided, compute an embedding for those tokens.
    if not word == '':
        # Take the average of the embeddings for the tokens in `word`.
        word_embedding = torch.mean(token_vecs[word_indeces], dim=0)

        # Convert to numpy array.
        word_embedding = word_embedding.detach().numpy()
    
        return (sentence_embedding, word_embedding)
    else:
        return sentence_embedding


Retrieve the models and tokenizers for both BERT and SciBERT

In [None]:
from transformers import BertModel, BertTokenizer

# Retrieve SciBERT.
scibert_model = BertModel.from_pretrained("allenai/scibert_scivocab_uncased",
                                  output_hidden_states=True)
scibert_tokenizer = BertTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")

scibert_model.eval()

# Retrieve generic BERT.
bert_model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True) 
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

bert_model.eval()

Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.s

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

Test out the function.

In [None]:
text = "During a routine check-up, a patient from a historically marginalized community discussed their struggles with accessing affordable healthcare, highlighting the ongoing issue of health disparities in the doctor's room."
word = 'marginalized'

# Get the embedding for the sentence, as well as an embedding for 'hydrogels'.
(sen_emb, word_emb) = get_embedding(scibert_model, scibert_tokenizer, text, word)

print('Embedding sizes:')
print(sen_emb.shape)
print(word_emb.shape)

Embedding sizes:
(768,)
(768,)


Here's the code for calculating cosine similarity. We'll test it by comparing the word embedding with the sentence embedding--not a very interesting comparison, but a good sanity check.

In [None]:
from scipy.spatial.distance import cosine

# Calculate the cosine similarity of the two embeddings.
sim = 1 - cosine(sen_emb, word_emb)

print('Cosine similarity: {:.2}'.format(sim))

Cosine similarity: 0.84


#### Sentence Comparison Examples

In this example, `query` and `A` are about biomedical "hydrogels", and `B` is from astrophysics.

Both models make the correct distinction, but generic BERT seems to be better...

In [None]:
# Three sentences; query is more similar to A than B.
text_query = "While race can be an indicator of certain health risks, such as sickle cell anemia in individuals of African descent or Tay-Sachs disease in individuals of Ashkenazi Jewish descent, it is crucial to consider each patient's unique medical history and genetic makeup. Using race as the only factor in making health decisions can perpetuate stereotypes and lead to inequities in healthcare. "
text_A = "AI has the potential to perpetuate health disparities if not developed and used responsibly. One example of AI creating health disparities is the use of biased algorithms in healthcare decision-making. If an algorithm is trained on data that is biased towards certain racial or ethnic groups, it may perpetuate those biases and result in disparities in health outcomes. For example, if an algorithm is trained to recommend more aggressive treatments for patients with higher income levels, it may result in patients from lower income backgrounds receiving suboptimal care."
text_B = "AI has the potential to help reduce health disparities by improving access to quality healthcare and promoting equitable health outcomes. For example, AI can be used to develop personalized treatment plans that take into account an individual's unique medical history, genetic makeup, and environmental factors. This approach can help address health disparities by providing tailored care that meets the specific needs of each patient. "

# Get embeddings for each.
emb_query = get_embedding(scibert_model, scibert_tokenizer, text_query)
emb_A = get_embedding(scibert_model, scibert_tokenizer, text_A)
emb_B = get_embedding(scibert_model, scibert_tokenizer, text_B)

# Compare query to A and B with cosine similarity.
sim_query_A = 1 - cosine(emb_query, emb_A)
sim_query_B = 1 - cosine(emb_query, emb_B)

print("'query' should be more similar to 'A' than to 'B'...\n")

print('SciBERT:')
print('  sim(query, A): {:.2}'.format(sim_query_A))
print('  sim(query, B): {:.2}'.format(sim_query_B))

# Repeat with BERT.
emb_query = get_embedding(bert_model, bert_tokenizer, text_query)
emb_A = get_embedding(bert_model, bert_tokenizer, text_A)
emb_B = get_embedding(bert_model, bert_tokenizer, text_B)

# Compare query to A and B with cosine similarity.
sim_query_A = 1 - cosine(emb_query, emb_A)
sim_query_B = 1 - cosine(emb_query, emb_B)

print('')
print('BERT:')
print('  sim(query, A): {:.2}'.format(sim_query_A))
print('  sim(query, B): {:.2}'.format(sim_query_B))


'query' should be more similar to 'A' than to 'B'...

SciBERT:
  sim(query, A): 0.96
  sim(query, B): 0.96

BERT:
  sim(query, A): 0.91
  sim(query, B): 0.9


In this example, `query` and `A` are both about mitochondria, while `B` is about ribosomes. 

Neither model seems to recognize the distinction!



In [None]:
# Three sentences; query is more similar to A than B.
text_query = "While race can be an indicator of certain health risks, such as sickle cell anemia in individuals of African descent or Tay-Sachs disease in individuals of Ashkenazi Jewish descent, it is crucial to consider each patient's unique medical history and genetic makeup. Using race as the only factor in making health decisions can perpetuate stereotypes and lead to inequities in healthcare"
text_A = "As healthcare providers, we must strive to provide individualized care that takes into account each patient's unique circumstances and works towards addressing any health disparities that may exist. By using a holistic approach that considers all aspects of a patient's health, we can ensure that our decisions are grounded in evidence-based medicine and prioritize the well-being of our patients." 
text_B = "AI has the potential to perpetuate health disparities if not developed and used responsibly. One example of AI creating health disparities is the use of biased algorithms in healthcare decision-making. If an algorithm is trained on data that is biased towards certain racial or ethnic groups, it may perpetuate those biases and result in disparities in health outcomes. For example, if an algorithm is trained to recommend more aggressive treatments for patients with higher income levels, it may result in patients from lower income backgrounds receiving suboptimal care."
#text_B = "Molecular biology deals with the structure and function of the macromolecules (e.g. proteins and nucleic acids) essential to life." 

# Get embeddings for each.
emb_query = get_embedding(scibert_model, scibert_tokenizer, text_query)
emb_A = get_embedding(scibert_model, scibert_tokenizer, text_A)
emb_B = get_embedding(scibert_model, scibert_tokenizer, text_B)

# Compare query to A and B with cosine similarity.
sim_query_A = 1 - cosine(emb_query, emb_A)
sim_query_B = 1 - cosine(emb_query, emb_B)

print("'query' should be more similar to 'A' than to 'B'...\n")

print('SciBERT:')
print('  sim(query, A): {:.2}'.format(sim_query_A))
print('  sim(query, B): {:.2}'.format(sim_query_B))

# Repeat with BERT.
emb_query = get_embedding(bert_model, bert_tokenizer, text_query)
emb_A = get_embedding(bert_model, bert_tokenizer, text_A)
emb_B = get_embedding(bert_model, bert_tokenizer, text_B)

# Compare query to A and B with cosine similarity.
sim_query_A = 1 - cosine(emb_query, emb_A)
sim_query_B = 1 - cosine(emb_query, emb_B)

print('')
print('BERT:')
print('  sim(query, A): {:.2}'.format(sim_query_A))
print('  sim(query, B): {:.2}'.format(sim_query_B))


'query' should be more similar to 'A' than to 'B'...

SciBERT:
  sim(query, A): 0.95
  sim(query, B): 0.96

BERT:
  sim(query, A): 0.85
  sim(query, B): 0.91


#### Word Comparison Examples



We also payed with comparing words that have both scientific and non-scientific meaning. For example, the word "cell" can refer to biological cells, but it can also refer (perhaps more commonly) to prison cells, cells in Colab notebooks, cellphones, etc.

In this example we'll use the word "cell" in a sentence with two other words that evoke its scientific and non-scientific usage: "animal" and "prison." 

> "The man in prison watched the animal from his cell."

Both BERT and SciBERT output "contextualized" embeddings, meaning that the representation of each word in a sentence will change depending on the words that occur around it.

In our example sentence, it's clear from the context that "cell" refers to *prison cell*, but we theorized that SciBERT would be more biased towards the biological interpretation of the word. The result below seems to confirm this.

In [None]:
text = "food deserts have been linked to health disparities."

print('"' + text + '"\n')

# ======== SciBERT ========

# Get contextualized embeddings for "prison", "animal", and "cell"
(emb_sen, emb_food) = get_embedding(scibert_model, scibert_tokenizer, text, word="food")
(emb_sen, emb_deserts) = get_embedding(scibert_model, scibert_tokenizer, text, word="deserts")
(emb_sen, emb_health) = get_embedding(scibert_model, scibert_tokenizer, text, word="health")

# Compare the embeddings
print('SciBERT:')
print('  sim(deserts, food): {:.2}'.format((1 - cosine(emb_food, emb_deserts))))
print('  sim(deserts, health): {:.2}'.format(1 - cosine(emb_food, emb_health)))

print('')

# ======== BERT ========

# Get contextualized embeddings for "prison", "animal", and "cell"
(emb_sen, emb_food ) = get_embedding(bert_model, bert_tokenizer, text, word="food")
(emb_sen, emb_deserts) = get_embedding(bert_model, bert_tokenizer, text, word="deserts")
(emb_sen, emb_health) = get_embedding(bert_model, bert_tokenizer, text, word="health")

# Compare the embeddings
print('BERT:')
print('  sim(deserts, food): {:.2}'.format((1 - cosine(emb_food, emb_deserts))))
print('  sim(deserts, health): {:.2}'.format(1 - cosine(emb_food, emb_health)))


"food deserts have been linked to health disparities."

SciBERT:
  sim(deserts, food): 0.88
  sim(deserts, health): 0.87

BERT:
  sim(deserts, food): 0.53
  sim(deserts, health): 0.55


Let us know if you find some more interesting examples to try!

# Appendix: BioBERT vs. SciBERT

I don't have much insight into the merits of BioBERT versus SciBERT, but I thought I would at least share what I do know.

**Publish Dates & Authors**

* *BioBERT*
    * First submitted to arXiv: `Jan 25th, 2019`
        * [link](https://arxiv.org/abs/1901.08746)
    * First Author: Jinhyuk Lee
    * Organization: Korea University, Clova AI (also Korean)

* *SciBERT*
   * First submitted to arXiv: `Mar 26, 2019`
       * [arXiv](https://arxiv.org/abs/1903.10676), [pdf](https://arxiv.org/pdf/1903.10676.pdf)
    * First Author: Iz Beltagy
    * Organization: Allen AI

**Differences**

* BioBERT used the same tokens as the original BERT, rather than choosing a new vocabulary of tokens based on their corpus. Their justification was "to maintain compatibility", which I don't entirely understand.
* SciBERT learned a new vocabulary of tokens, but they also found that this detail is less important--it's training on the specialized corpus that really makes the difference.
* SciBERT is more recent, and outperforms BioBERT on many, but not all, scientific NLP benchmarks.
* The difference in naming seems unfortunate--SciBERT is also trained primarily on biomedical research papers, but the name "BioBERT" was already taken, so....

**huggingface transformers**

* Allen AI published their SciBERT models for the transformers library, and you can see their popularity:
    * [SciBERT uncased](https://huggingface.co/allenai/scibert_scivocab_uncased): ~16.7K downloads (from 5/22/20 - 6/22/20)
        * `allenai/scibert_scivocab_uncased`
    * [SciBERT cased](https://huggingface.co/allenai/scibert_scivocab_cased ): ~3.8k downloads (from 5/22/20 - 6/22/20)
        * `allenai/scibert_scivocab_cased`
* The BioBERT team has published their models, but not for the `transformers` library, as far as I can tell. 
    * The most popular BioBERT model in the huggingface community appears to be [this one](https://huggingface.co/monologg/biobert_v1.1_pubmed): `monologg/biobert_v1.1_pubmed`, with ~8.6K downloads (from 5/22/20 - 6/22/20)
       * You could also download BioBERT's pre-trained weights yourself from https://github.com/naver/biobert-pretrained, but I'm not sure what it would take to pull these into the `transformers` library exactly. 

