# BERT - Pair Programming

## Introduction

**BERT (Bidrectional Encoder Representation from Transformer)** is a linguistic embedding model published by Google. It is a context-based model, unlike other embedding models such as word2vec, which are context-free. The context-sensitive nature of BERT was built upon a dataset of 3.3 billion words, in particular approximately 2.5 billion from Wikipedia and the balance from Google's [BookCorpus](https://www.english-corpora.org/googlebooks/#).

## Objectives

You will be able to: 

* To understand how to implement BERT in Python
* To apply BERT to NLP
* Understand the possibility of bias when working with BERT


## Some details of the BERT Model

Based on our previous discussion of the transformer, we can see where the terms "encoder representation from transformer" come from. But what about "Bidirectional?" Bidrectional simply mean the encoder can read the sentence in both directions, e.g. both Cogito ergo sum to I think therefore I am and vice versa.

BERT has three main hyperparameters
* $L$ is the number of encoder layers
* $A$ is the number of attention heads
* $H$ is the number of hidden units

The model also comes in some pre-specified configurations, and here are the two standard ones
* BERT-base: $L=12$, $A=12$, $H=768$
* BERT-large: $L=42$, $A=16$, $H=1,024$

In particular, we'll be using BERT to help discover the missing word in a sentence. BERT can also be used for translation and Next Sentence Prediction (NSP) as well as a myriad of other applications.

## Using BERT

We'll need to use the [Python library `transformers`](https://huggingface.co/transformers/v3.0.2/index.html). The `transformers` library provides general-purpose architectures such as BERT for NLP, with over 32 pretrained models in more than 100 languages.

The intent is to run this exercise in SaturnCloud since there can be some issues when trying to [install `transformers` locally](https://huggingface.co/docs/transformers/installation).

In [1]:
# Import the german libraries
from transformers import pipeline

## Masking with BERT

The model ```bert-base-uncased``` is one of the pretrained BERT models and it has 110 million parameters. [Details of this model can be found on Hugging Face](https://huggingface.co/bert-base-uncased). We'll be using ```bert-base-uncased``` for masking.

You may get a comment from BERT regarding weights of ```bert-base-uncased```, but this is nothing to worry about for our purposes.

In [2]:
# Define our function unmasker
unmasker = pipeline('fill-mask', model='bert-base-uncased')

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Let's try a sentence and see how BERT does.

In [3]:
# [MASK] goes in the place you want BERT to predict the correct word
unmasker("Artificial Intelligence [MASK] take over the world.")

[{'score': 0.31823936104774475,
  'token': 2064,
  'token_str': 'can',
  'sequence': 'artificial intelligence can take over the world.'},
 {'score': 0.18299679458141327,
  'token': 2097,
  'token_str': 'will',
  'sequence': 'artificial intelligence will take over the world.'},
 {'score': 0.056001096963882446,
  'token': 2000,
  'token_str': 'to',
  'sequence': 'artificial intelligence to take over the world.'},
 {'score': 0.04519473388791084,
  'token': 2015,
  'token_str': '##s',
  'sequence': 'artificial intelligences take over the world.'},
 {'score': 0.04515324905514717,
  'token': 2052,
  'token_str': 'would',
  'sequence': 'artificial intelligence would take over the world.'}]

The top five possibilities are shown. Further, the token string with the highest score is the one with the highest probability of being correct according to BERT. In this example, it is "can" as in "artificial intelligence can take over the world" at a 32% probability.

On supposes we should be happy that "can" has a higher probability than "will."

In the output, ```token``` refers to the position of the masked token in the list that is generated from the transformer. For our purposes, we don't have to worry about that, but only ```score``` and ```token_str``` with the corresponding ```sequence```.

### Task 1: Masking Twice

What happens if one used ```[MASK]``` two times in a sentence?

For example, run the following in the code block below and interpret the results.


```
unmasker("Artificial Intelligence [MASK] take over the [MASK].")
```


In [4]:
# Using [MASK] twice
unmasker("Artificial Intelligence [MASK] take over the [MASK].")

[[{'score': 0.2080228477716446,
   'token': 2064,
   'token_str': 'can',
   'sequence': '[CLS] artificial intelligence can take over the [MASK]. [SEP]'},
  {'score': 0.11164135485887527,
   'token': 2097,
   'token_str': 'will',
   'sequence': '[CLS] artificial intelligence will take over the [MASK]. [SEP]'},
  {'score': 0.04858841747045517,
   'token': 2052,
   'token_str': 'would',
   'sequence': '[CLS] artificial intelligence would take over the [MASK]. [SEP]'},
  {'score': 0.04662349075078964,
   'token': 3001,
   'token_str': 'systems',
   'sequence': '[CLS] artificial intelligence systems take over the [MASK]. [SEP]'},
  {'score': 0.0387875996530056,
   'token': 2000,
   'token_str': 'to',
   'sequence': '[CLS] artificial intelligence to take over the [MASK]. [SEP]'}],
 [{'score': 0.13239632546901703,
   'token': 2088,
   'token_str': 'world',
   'sequence': '[CLS] artificial intelligence [MASK] take over the world. [SEP]'},
  {'score': 0.10707884281873703,
   'token': 2208,
   '

*Explain and interpret the "double-mask" here.*

### Task 2: Using unmasker

Use unmasker on three other sentences. At least one of them should be a "double-mask." Explain and interpret each one.

In [5]:
# Your code here, you may want a separate code block for each of the three sentences.
unmasker("Artificial Intelligence [MASK] take over the [MASK].")

[[{'score': 0.2080228477716446,
   'token': 2064,
   'token_str': 'can',
   'sequence': '[CLS] artificial intelligence can take over the [MASK]. [SEP]'},
  {'score': 0.11164135485887527,
   'token': 2097,
   'token_str': 'will',
   'sequence': '[CLS] artificial intelligence will take over the [MASK]. [SEP]'},
  {'score': 0.04858841747045517,
   'token': 2052,
   'token_str': 'would',
   'sequence': '[CLS] artificial intelligence would take over the [MASK]. [SEP]'},
  {'score': 0.04662349075078964,
   'token': 3001,
   'token_str': 'systems',
   'sequence': '[CLS] artificial intelligence systems take over the [MASK]. [SEP]'},
  {'score': 0.0387875996530056,
   'token': 2000,
   'token_str': 'to',
   'sequence': '[CLS] artificial intelligence to take over the [MASK]. [SEP]'}],
 [{'score': 0.13239632546901703,
   'token': 2088,
   'token_str': 'world',
   'sequence': '[CLS] artificial intelligence [MASK] take over the world. [SEP]'},
  {'score': 0.10707884281873703,
   'token': 2208,
   '

### Literary Interlude

How does ```unmasker``` perform with a quote from literature or other notable work?

Let's look first a "To be, or not to be, that is the question" from William Shakespeare's *Hamlet* (Act 3, Scene 1).

In [6]:
# Let's mask "question"
unmasker("To be, or not to be, that is the [MASK]:")

[{'score': 0.18241997063159943,
  'token': 3160,
  'token_str': 'question',
  'sequence': 'to be, or not to be, that is the question :'},
 {'score': 0.12240371108055115,
  'token': 3437,
  'token_str': 'answer',
  'sequence': 'to be, or not to be, that is the answer :'},
 {'score': 0.09915117174386978,
  'token': 2553,
  'token_str': 'case',
  'sequence': 'to be, or not to be, that is the case :'},
 {'score': 0.03269127383828163,
  'token': 2168,
  'token_str': 'same',
  'sequence': 'to be, or not to be, that is the same :'},
 {'score': 0.027760706841945648,
  'token': 2518,
  'token_str': 'thing',
  'sequence': 'to be, or not to be, that is the thing :'}]

We can see that the highest probability does give us the correct answer.

Let's look at another one.

The opening line of James Joyce's Ulysses is “Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed.”

In [7]:
# Let's mask "plump"
unmasker("Stately, [MASK] Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed.")

[{'score': 0.22325874865055084,
  'token': 2214,
  'token_str': 'old',
  'sequence': 'stately, old buck mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed.'},
 {'score': 0.10755084455013275,
  'token': 1996,
  'token_str': 'the',
  'sequence': 'stately, the buck mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed.'},
 {'score': 0.09360961616039276,
  'token': 2402,
  'token_str': 'young',
  'sequence': 'stately, young buck mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed.'},
 {'score': 0.07783861458301544,
  'token': 3335,
  'token_str': 'miss',
  'sequence': 'stately, miss buck mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed.'},
 {'score': 0.06260837614536285,
  'token': 2909,
  'token_str': 'sir',
  'sequence': 'stately, sir buck mulligan came from the stairhead, bearing a bowl of la

We see that the actual word- "plump"- did not make the top 5.

Now let's unmask "plump" and mask "lather."

In [8]:
# Let's mask "lather"
unmasker("Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of [MASK] on which a mirror and a razor lay crossed.")

[{'score': 0.16707152128219604,
  'token': 2300,
  'token_str': 'water',
  'sequence': 'stately, plump buck mulligan came from the stairhead, bearing a bowl of water on which a mirror and a razor lay crossed.'},
 {'score': 0.07017841935157776,
  'token': 8416,
  'token_str': 'cloth',
  'sequence': 'stately, plump buck mulligan came from the stairhead, bearing a bowl of cloth on which a mirror and a razor lay crossed.'},
 {'score': 0.058426160365343094,
  'token': 7815,
  'token_str': 'soap',
  'sequence': 'stately, plump buck mulligan came from the stairhead, bearing a bowl of soap on which a mirror and a razor lay crossed.'},
 {'score': 0.05204063653945923,
  'token': 20717,
  'token_str': 'stew',
  'sequence': 'stately, plump buck mulligan came from the stairhead, bearing a bowl of stew on which a mirror and a razor lay crossed.'},
 {'score': 0.04700753837823868,
  'token': 4511,
  'token_str': 'wine',
  'sequence': 'stately, plump buck mulligan came from the stairhead, bearing a bow

While "lather" is not picked, the 3rd choice of the model is "soap," which is a synonym.

### Task 3: A quote from literature or other notable work

Now it is your turn.

Find a quote from literature or other notable work such as from a philosophical or religious text and make sure to state where the quote is from.

Mask at least two different words and see how BERT performs.

In [9]:
# Type your quote with the source and then your code.
unmasker("Artificial Intelligence [MASK] take over the [MASK].")

[[{'score': 0.2080228477716446,
   'token': 2064,
   'token_str': 'can',
   'sequence': '[CLS] artificial intelligence can take over the [MASK]. [SEP]'},
  {'score': 0.11164135485887527,
   'token': 2097,
   'token_str': 'will',
   'sequence': '[CLS] artificial intelligence will take over the [MASK]. [SEP]'},
  {'score': 0.04858841747045517,
   'token': 2052,
   'token_str': 'would',
   'sequence': '[CLS] artificial intelligence would take over the [MASK]. [SEP]'},
  {'score': 0.04662349075078964,
   'token': 3001,
   'token_str': 'systems',
   'sequence': '[CLS] artificial intelligence systems take over the [MASK]. [SEP]'},
  {'score': 0.0387875996530056,
   'token': 2000,
   'token_str': 'to',
   'sequence': '[CLS] artificial intelligence to take over the [MASK]. [SEP]'}],
 [{'score': 0.13239632546901703,
   'token': 2088,
   'token_str': 'world',
   'sequence': '[CLS] artificial intelligence [MASK] take over the world. [SEP]'},
  {'score': 0.10707884281873703,
   'token': 2208,
   '

### Task 4: Bias in the model

Run the following two code cells.

In [10]:
# Men at work
unmasker("The man worked as a [MASK].")

[{'score': 0.09747515618801117,
  'token': 10533,
  'token_str': 'carpenter',
  'sequence': 'the man worked as a carpenter.'},
 {'score': 0.05238299444317818,
  'token': 15610,
  'token_str': 'waiter',
  'sequence': 'the man worked as a waiter.'},
 {'score': 0.04962710663676262,
  'token': 13362,
  'token_str': 'barber',
  'sequence': 'the man worked as a barber.'},
 {'score': 0.03788600116968155,
  'token': 15893,
  'token_str': 'mechanic',
  'sequence': 'the man worked as a mechanic.'},
 {'score': 0.037680864334106445,
  'token': 18968,
  'token_str': 'salesman',
  'sequence': 'the man worked as a salesman.'}]

In [11]:
# Women at work
unmasker("The woman worked as a [MASK].")

[{'score': 0.21981407701969147,
  'token': 6821,
  'token_str': 'nurse',
  'sequence': 'the woman worked as a nurse.'},
 {'score': 0.15974010527133942,
  'token': 13877,
  'token_str': 'waitress',
  'sequence': 'the woman worked as a waitress.'},
 {'score': 0.1154722198843956,
  'token': 10850,
  'token_str': 'maid',
  'sequence': 'the woman worked as a maid.'},
 {'score': 0.037968605756759644,
  'token': 19215,
  'token_str': 'prostitute',
  'sequence': 'the woman worked as a prostitute.'},
 {'score': 0.030423644930124283,
  'token': 5660,
  'token_str': 'cook',
  'sequence': 'the woman worked as a cook.'}]

What do you notice about the top five responses for men and women? Explain.

## Summary

We were introduced to using `transformers` in Python with the BERT pretrained model of `bert-base-uncased`.