# Bert tutorial

Erik Tjong Kim Sang, e.tjongkimsang@esciencecenter.nl, Netherlands eScience Center

[Bert](https://en.wikipedia.org/wiki/BERT_(language_model)) is a program for processing natural language text. This Jupyter notebook provides a few examples on how you can run Bert. The notebook consists of explanatory text and code snippets. The code snippets can be identified by the two square brackets and the colon on their left [ ]:. You can run the code snippets by clicking on them and then pressing the Shit and Enter keys at the same time. Alternatively you can click on "Run" in the top menu and then select "Run Selected Cells".

## 1. An example of Bert

Bert is a type of the [transformers](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) machine learning models, developed by Google in 2017.In order to be able to run it, we need to first load the transformers library. This is a standard library of the programming language Python, which we will use in this Jupyter notebook.

In [2]:
import transformers

Running Bert results in some warning messages. We can turn these off with the next command. If you want to see the warning messages, you can skip the next command. If you want to see the warning messages after having run the next command then you can turn them on again with the command `transformers.utils.logging.set_verbosity_warning()` . Other available message levels are `info` and `debug`.

In [3]:
transformers.utils.logging.set_verbosity_error()

Bert has been trained on large volumes of English text. It's task was to predict a word (in Bert terms called *token*) given the surrounding words. This is an impossible task to be executed perfectly. However by training Bert on this task, it learned relations between words: certain words can appear in a context while others can not. This is valuable information when processing language.

Transformers can perform different tasks based on different models. To run Bert we first need to specify which task we want to perform and what model we want to use. Bert's training task is called `fill-mask`, which involves guessing one missing word in sentence which has been replaced by the special token `[MASK]`. For the model we choose Bert. The other words in the model name will be explained later. 

In [6]:
run_bert_pipeline = transformers.pipeline(task='fill-mask', model='bert-base-cased')

The command has `transformers.pipeline` creates the pipeline `run_bert`. A pipeline is a sequence of software modules which perform a task. Our pipeline `run_bert` can take a sentence with a masked word as input and generate suggestions for the masked word as output. It contains three software modules: 1. a tokenizer, which converts the sentence to numbers; 2. the Bert model, which only processes numbers; 3. post-processing code for among others converting Bert output to text.

Next we can apply the Bert pipeline to a sentence. We store the results in the variable `results`.

In [9]:
results = run_bert_pipeline("John travelled to [MASK] to visit the Eiffel Tower")

The output of Bert can be inspected with the following command: 

In [10]:
results

[{'score': 0.533401608467102,
  'token': 2123,
  'token_str': 'Paris',
  'sequence': 'John travelled to Paris to visit the Eiffel Tower'},
 {'score': 0.18619701266288757,
  'token': 1498,
  'token_str': 'London',
  'sequence': 'John travelled to London to visit the Eiffel Tower'},
 {'score': 0.08536867052316666,
  'token': 1699,
  'token_str': 'France',
  'sequence': 'John travelled to France to visit the Eiffel Tower'},
 {'score': 0.041358355432748795,
  'token': 9062,
  'token_str': 'Brussels',
  'sequence': 'John travelled to Brussels to visit the Eiffel Tower'},
 {'score': 0.021977808326482773,
  'token': 1980,
  'token_str': 'Europe',
  'sequence': 'John travelled to Europe to visit the Eiffel Tower'}]

Bert has provided five suggestions for the masked word. These have been ranked in the output by their score. The highest scoring suggestion is *Paris* which is a good one. *France* and *Europe* are also good suggestions. However *London* and *Brussels* are incorrect in this context.

As you can see Bert has suggested only locations for the missing words. This shows the strength of the Bert language model. There are many words that do not fit at the position of the missing word, like `the` and `knitting`. Bert has learned about this while reading many texts and it does not suggest such words as the missing word of this sentence. 

### Assignment 1

Make five alternative sentences with a missing word and process them with Bert by replacing the Eiffel Tower sentence above and rerunning the code. Does Bert generate plausible suggestions for the missing words?

## 2. Different variants of Bert

There are many different variants of the Bert language model. Alternatives involve the number of training data, the type of training data and the language of the training data. Here are a few frequent keywords of variants:
- **base**: the model is trained on a small data set
- **large**: the model is trained on a large data set
- **cased**: the training data contains upper case characters and lower case characters
- **uncased**: the training data was converted to only lower case characters (useful for processing social media text)
- **multilingual**: the training data contains text in different languages

The website [huggingface.co](https://huggingface.co) contains a large collection of transformer models among which many Bert models.

### Assignment 2

Look for an alternative Bert model on the website [huggingface.co](https://huggingface.co), for example a model for another language than English. Create a pipeline for this model by replacing `bert-based-cased` above with the name of this model. Then create five sentences with a missing word and test the model with your sentences. What are your observations? Does Bert generate plausible suggestions for the missing words? If you chose an alternative English model: how does it compare with `bert-based-cased`? Are the missing word suggestions better? How do the processing times compare with `bert-based-cased`?

## 3. Other Bert tasks

In natural language processing we work on other tasks than predicting a missing word in a sentence. Examples of such tasks are part-of-speech tagging (finding the syntactic classes of words) and named entity recognition (finding names in text). A challenge for these tasks is that we need human-made examples. These require time and effort to create and we need many of them to train a model that performs well.

This is where a large general language model like Bert is of great value. Instead of starting the training process for a natural language processing task from zero, we can start from a Bert model. Then we can benefit from the vocabulary and word relations already present in the Bert model. We do not need many examples of our task because Bert will infer that related words unmentioned in our training data need to be treated the same as the words present in our training data.

Take for example the task of named entity recognition. If we provide Bert with the sentence *John travelled to Paris to visit the Eiffel Tower* and tell it that *Paris* is a location, it will infer that *London*, *France*, *Brussels*, *Europe* and many other similar words are also locations. Bert can draw this conclusion because internally it represents all words as numeric vectors and after its initial training process all words used in similar contexts will have obtained a similar word vector.

The website [huggingface.co](https://huggingface.co) also contains pretrained Bert models for several natural language tasks. Let's test named entity recognition for English. We can use the same pipeline command as above to load the model but this time we need to specify `ner` as task.

In [11]:
run_bert_pipeline = transformers.pipeline(task='ner', model='dslim/bert-base-NER')

In [12]:
results = run_bert_pipeline("John travelled to Paris to see the Eiffel Tower")

In [13]:
results

[{'entity': 'B-PER',
  'score': 0.99826694,
  'index': 1,
  'word': 'John',
  'start': 0,
  'end': 4},
 {'entity': 'B-LOC',
  'score': 0.99977285,
  'index': 4,
  'word': 'Paris',
  'start': 18,
  'end': 23},
 {'entity': 'B-LOC',
  'score': 0.98286414,
  'index': 8,
  'word': 'E',
  'start': 35,
  'end': 36},
 {'entity': 'I-LOC',
  'score': 0.7802712,
  'index': 9,
  'word': '##iff',
  'start': 36,
  'end': 39},
 {'entity': 'B-LOC',
  'score': 0.72092605,
  'index': 10,
  'word': '##el',
  'start': 39,
  'end': 41},
 {'entity': 'I-LOC',
  'score': 0.9886309,
  'index': 11,
  'word': 'Tower',
  'start': 42,
  'end': 47}]

The output of Bert consists of a list of the names found with their type and score: *John* (person name) and *Paris* and *Eiffel Tower* (locations). We see that *Eiffel Tower* has been split in four parts. From its initial training, Bert has a fixed number of tokens/words. Whenever it encounters a word in a text that is not in this vocabulary (like *Eiffel* in this sentence), it will split this word in tokens that it does know. This is why *Eiffel* is split in *E*, *iff* and *el*. In order to make clear that a word/token was split, the non-initial parts are preceded by two hashes (##) to make clear that they were attached to the previous token(s) in the input. We need to do extra work to put them together from the Bert output.

The labels of the tokens in the `entity` fields show another problem of the Bert output. There are two label types in the output `PER` (person name) and `LOC` (location). The first letter of the label indicates whether it is the first word of the entity (`B`) or if it is not (`I`). *John*, *Paris* and *E* are all correctly labelled as entity-initial. But *el* is also labelled as entity-initial and this is incorrect. This is a typical error of Bert-based entity recognizers. In this case it can be corrected with extra post processing which does not allow entities to start in the middle of a word.

### Assignment 3

Look on [huggingface.co](https://huggingface.co) for a Bert natural language processing model for your favourite language. Other task definitions than `fill-mask` and `ner` can be found in the [pipeline documentation](https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html). Create five example sentences for this task and test the model with your sentences. What are your observations? Does Bert generate correct output? Do you observe any weaknesses?

## 4. Training a Bert model for your own NLP task (aka fine-tuning Bert)

Training a Bert model for your own natural language task goes beyond the scope of this tutorial. But if you are interested in this, talk to us to see what the possibilities are. Or take a look at this [Huggingface tutorial](https://huggingface.co/docs/transformers/training).