# Examples and questions for NLP using BERT

In [2]:
from IPython.display import HTML, IFrame

## 1. Start by watching the following two videos that introduce Neural Networks (optional) and  BERT (run the cell to embed the videos in this notebook) then answer the questions

In [3]:
## OPTIONAL Intro to NN
IFrame(src="https://www.youtube.com/embed/fkqZyYo_ebs?rel=0&amp;controls=0&amp;showinfo=0", width="560", height="315")

In [4]:
## REQUIRED Intro to BERT
IFrame("https://www.youtube.com/embed/xI0HHN5XKDo?rel=0&amp;controls=0&amp;showinfo=0", width="560", height="315")

#### Question 1: Name two issues with the LSTM recurrent neural networks that Transformer networks address

#### Answer: 

#### Question 2: How do the Transformer networks address the two issues?

#### Answer: 

#### Question 3: What are the two primary components of Transformer networks?

#### Answer: 

#### Question 4: Which component of the Transfomer network is stacked to create a BERT network?

#### Answer: 

#### Question 5: What are 4 problems that the video mentions BERT networks can address?

#### Answer: 

#### Question 6: What are the 2 steps to solving problems with BERT? What is the primary goal of each of the steps?

#### Answer: 

#### Question 7: How does the Masked Language Model task help BERT to understand language?

#### Answer: 

#### Question 8: How does the predication task help BERT to understand language?

#### Answer: 

#### Question 9: What change to BERT network architecture is needed to fine tune?

#### Answer: 

#### Question 10: True/False - each input token to BERT is a human readable token representing a single word?

#### Answer: 

#### Question 11: In your own words, what is a WordPiece model (HINT: see [this blog post](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#21-special-tokens)?

#### Answer: 

#### Question 12: True/False - the output word vectors of a BERT model are generated sequentially?

#### Answer: 

#### Question 13 (BONUS): True/False - the loss function for BERT training is [cross-entropy](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html)?

#### Answer: 

#### Question 14 (BONUS): True/False - the loss function of the model is calculated for masked words only in order to increase the network's attention to context?

#### Answer: 

#### Question 15 (BONUS):  True/False - the BERT base model has more than 300 million paramaters?

#### Answer: 

## 2. Exploring Transformer models using the Huggingface Transformer package

In this vignette/assignment we are using Transformer models from the Transformer package (https://github.com/huggingface/transformers). I recommend that you visit the github page for the project and check out the various demos in the "Online demos" section. I have set up this Jupyter notebook so that you can import the transformers library and pytorch. This allows you to run several of the test cases from the libraries detailed documentation located at https://huggingface.co/transformers. 

#### 2a. Masked language modeling example - this example is explained at https://huggingface.co/transformers/task_summary.html#masked-language-modeling.

First we import a helpful utility class called 'pipeline'. This class implements several common NLP workflows - see https://huggingface.co/transformers/main_classes/pipelines.html

In [1]:
from transformers import pipeline

We use the fill-mask pipeline using a basic BERT model

In [2]:
nlp = pipeline("fill-mask")

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
from pprint import pprint
pprint(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))

[{'score': 0.17927460372447968,
  'sequence': '<s>HuggingFace is creating a tool that the community uses to '
              'solve NLP tasks.</s>',
  'token': 3944,
  'token_str': 'Ġtool'},
 {'score': 0.1134939044713974,
  'sequence': '<s>HuggingFace is creating a framework that the community uses '
              'to solve NLP tasks.</s>',
  'token': 7208,
  'token_str': 'Ġframework'},
 {'score': 0.05243545398116112,
  'sequence': '<s>HuggingFace is creating a library that the community uses to '
              'solve NLP tasks.</s>',
  'token': 5560,
  'token_str': 'Ġlibrary'},
 {'score': 0.03493543714284897,
  'sequence': '<s>HuggingFace is creating a database that the community uses '
              'to solve NLP tasks.</s>',
  'token': 8503,
  'token_str': 'Ġdatabase'},
 {'score': 0.02860247902572155,
  'sequence': '<s>HuggingFace is creating a prototype that the community uses '
              'to solve NLP tasks.</s>',
  'token': 17715,
  'token_str': 'Ġprototype'}]


##### Question 17: FOR YOU TO DO:  Replace the '...' in the following Python statement with a sentence of your own that uses the 
##### `{nlp.tokenizer.mask_token}` and then run the cell:

In [None]:
pprint(nlp(f"... {nlp.tokenizer.mask_token} ..."))

#### Question 18. True/False - Since all BERT models are trained using word masking, this model should perform as well on clinical statements as on statements about any other topic.

#### Answer: 

#### 2b. An example that doesn't use the pipeline helper - this follows the example at https://huggingface.co/transformers/task_summary.html#named-entity-recognition 

This next example uses a model and a tokenizer. Note that this example is using a specific BERT model called 'bert-base-cased'. You can learn more about this model here: https://huggingface.co/bert-base-cased

After reading more about the bert-base-cased, take note that there are many other models available on the same site: https://huggingface.co/models

#### Question 19. Search the [models posted by the huggingface community](https://huggingface.co/models). Use tags dropdown and search form on the website. What are two models that you found that are possible interest to you? Why?

#### Answer: 

Back to the example, as the [transformers website explains](https://huggingface.co/transformers/task_summary.html#named-entity-recognition ), the of process of using a transformer for NER is the following:

1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it with the weights stored in the checkpoint.

2. Define the label list with which the model was trained on.

3. Define a sequence with known entities, such as “Hugging Face” as an organisation and “New York City” as a location.

4. Split words into tokens so that they can be mapped to predictions. We use a small hack by, first, completely encoding and decoding the sequence, so that we’re left with a string that contains the special tokens.

5. Encode that sequence into IDs (special tokens are added automatically).

6. Retrieve the predictions by passing the input to the model and getting the first output. This results in a distribution over the 9 possible classes for each token. We take the argmax to retrieve the most likely class for each token.

7. Zip together each token with its prediction and print it.

In [4]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english", return_dict=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

label_list = [
    "O",       # Outside of a named entity
    "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC",  # Miscellaneous entity
    "B-PER",   # Beginning of a person's name right after another person's name
    "I-PER",   # Person's name
    "B-ORG",   # Beginning of an organisation right after another organisation
    "I-ORG",   # Organisation
    "B-LOC",   # Beginning of a location right after another location
    "I-LOC"    # Location
]

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs).logits
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())])

[('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]


To understand how this example works, you need to understand how textual data is preprocessed the role of a `tokenizer`. Read these two pages and then answer the questions: 

a. [Preprocessing](https://huggingface.co/transformers/preprocessing.html)

b. [Tokenizer](https://huggingface.co/transformers/tokenizer_summary.html)

#### Question 20: Why do you need to preprocess the sentences you want to pass to a BERT model using a tokenizer?

##### Answer: 

#### Question 21: What problem to [sub-word tokenization algorithms](https://huggingface.co/transformers/tokenizer_summary.html) solve?

##### Answer: 

## Comparing fine tuned clinicalBERT and fine tuned generic BERT for Medical Natural Language Inference

As you learned from the video, BERT models have a general understanding of language and should be fine-tuned to address specific NLP problems. In general, this process involves the following:

1. Developing or acquiring a training/test set so that the model can be fine-tuned using supervised learning.

2. Transforming the training/test set data to the format required by the BERT Transformer encoder.

3. Configuring the hyperparameters for how the BERT model will learn during fine tuning.

4. Training the model.

5. Testing the model.

I have gone through these steps for you for the Medical Natural Language Inference NLP problem. This problem involves inferring a clinical fact from the text of a clinical note. The training/test set I used was the [MedNLI](https://physionet.org/content/mednli/1.0.0/)[[1]] dataset created using MIMIC III notes [[2]]. That dataset has a number of sentences from clinical notes that have been labeled by clinicians for specific inferences about the patient's clinical status. Here are some examples:

```
[
 ('Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.', ' Patient has elevated Cr', 'entailment'), 
 ('Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.', ' Patient has normal Cr', 'contradiction'), 
 ('Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.', ' Patient has elevated BUN', 'neutral')
]
```
I fine tuned two different BERT models:

1. [clinicalBERT](https://github.com/EmilyAlsentzer/clinicalBERT) - read about it [here](https://www.aclweb.org/anthology/W19-1909/)[[3.]]

2. [bert-base-cased](https://huggingface.co/bert-base-cased) - read about it [here](https://arxiv.org/abs/1810.04805)[[4.]]

The code I used to fine tune the models was downloaded from here: https://github.com/EmilyAlsentzer/clinicalBERT/tree/master/downstream_tasks

Fine tuning took less than 20 minutes for both BERT models using a server with a NVIDIA GLX 2080 Ti GPU. 

The cells below allow you to run both models on sample sentences. **Run the cells and then answer the questions that follow.**


*References*

[[1.]] Shivade, C. (2019). MedNLI - A Natural Language Inference Dataset For The Clinical Domain (version 1.0.0). PhysioNet. https://doi.org/10.13026/C2RS98.

[[2.]] Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

[[3.]] Alsentzer, Emily, et al. "Publicly available clinical BERT embeddings." arXiv preprint arXiv:1904.03323 (2019).

[[4.]] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

In [None]:
## Sample sentences
stplL = [
    ("In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."," The patient is hemodynamically stable"),
    ("Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4."," Patient has elevated Cr"),
    ("Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4."," Patient has normal Cr"),
    ("Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4."," Patient has elevated BUN"),
    ('No history of blood clots or DVTs, has never had chest pain prior to one week ago.', ' Patient has angina'), 
    ('No history of blood clots or DVTs, has never had chest pain prior to one week ago.', ' Patient has had multiple PEs'), 
    ('No history of blood clots or DVTs, has never had chest pain prior to one week ago.', ' Patient has CAD'),
    ('In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA.', ' The patient is hemodynamically stable '),
    ('In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA.', ' The patient is hemodynamically unstable'),
    ('In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA.', ' The patient is in pain.')      
]

In [None]:
## First test clinicalBERT
from transformers import BertTokenizer, BertForSequenceClassification
import torch

In [None]:
## Imports the tokenizer that has the vocabulary (embeddings) and the fine tuned model
tokenizer = BertTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = BertForSequenceClassification.from_pretrained("/home/ubuntu/clinBertFineTunedMedNLI/",num_labels=3,return_dict=True)

In [None]:
## The labels used during fine tuning
label_list = ["contradiction","entailment","neutral"]

In [None]:
## Iterates through the sentence pairs, tokenizes the sentences, passess the sentences 
## through the model, and obtains the highest scoring label (see label_list above) from the 
## logits output by the model 
for tpl in stplL:
    (s1,s2) = tpl
    inputs = tokenizer(s1, s2, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
    outputs = model(**inputs, labels=labels)
    loss = outputs.loss
    logits = outputs.logits
    predictions = torch.argmax(logits[0], dim=0)
    
    print("sentence 1: " + s1)
    print("sentence 2: " + s2)
    print("prediction: " + label_list[predictions])
    print("loss: " + str(loss))
    print()
    

In [None]:
## now, rerun all of the same steps above using the generic cased BERT that has not been fine tuned
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertForSequenceClassification.from_pretrained('bert-base-cased', return_dict=True)

In [None]:
## The labels used during fine tuning
label_list = ["contradiction","entailment","neutral"]

In [None]:
for tpl in stplL:
    (s1,s2) = tpl
    inputs = tokenizer(s1, s2, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
    outputs = model(**inputs, labels=labels)
    loss = outputs.loss
    logits = outputs.logits
    predictions = torch.argmax(logits[0], dim=0)
    
    print("sentence 1: " + s1)
    print("sentence 2: " + s2)
    print("prediction: " + label_list[predictions])
    print("loss: " + str(loss))
    print()
    

In [None]:
## Finally, rerun all of the same steps above using the generic cased BERT that HAS been fine tuned
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertForSequenceClassification.from_pretrained("/home/ubuntu/BertCasedFineTunedMedNLI/",num_labels=3,return_dict=True)

In [None]:
## The labels used during fine tuning
label_list = ["contradiction","entailment","neutral"]

In [None]:
for tpl in stplL:
    (s1,s2) = tpl
    inputs = tokenizer(s1, s2, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
    outputs = model(**inputs, labels=labels)
    loss = outputs.loss
    logits = outputs.logits
    predictions = torch.argmax(logits[0], dim=0)
    
    print("sentence 1: " + s1)
    print("sentence 2: " + s2)
    print("prediction: " + label_list[predictions])
    print("loss: " + str(loss))
    print()
    

#### Question 22. What is your impression about which model(s) do better at clinical natural language inference? If so, can you offer a possible reason for why? 

#### Answer:

#### Question 22. Please describe how you would formally compare the performance of the three models using the MedNLI training/test set? Be specific about the metrics you would use and what criteria you would use to determine that one model is better than another?

#### Answer:

#### Question 22. What role, if any, would inter-rater agreement statistics such as Kappa play in a formal evaluation?

#### Answer:

#### Question 23:  In your opinion, what kinds of Biomedical NLP problems might BERT not be a good fit for, and why?

#### Answer:

### This notebook was created by Rich Boyce and Billy Reynolds with help from Sanya B. Taneja and NLP expert Denis R Newman-Griffis <dnewmangriffis@pitt.edu>. Contact Denis if you are interested in research in biomedical NLP.