# Week 3: Transformers, BERT and Transfer Learning with Language Models

### 1. Subword tokenization

**(a) Compare the tokenizations of the mBERT tokenizer of texts from two different language(-varietie)s you are able to understand/read. Use English, the dominant language in mBERT, with a lower-resource language variety (for example Danish). If you only know 1 language, try to use a different variety of the language (for example for English, use social media abbreviations or typos, e.g.: c u tmrw). You can collect data from any source, or make up your own sentences.**

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")

english_text = "My name is Andreas, and I am a guest student at ITU"
danish_text = "Mit navn er Andreas, og jeg er en gæstestuderende ved ITU"

print(tokenizer.tokenize(english_text))
print(tokenizer.tokenize(danish_text))

  from .autonotebook import tqdm as notebook_tqdm


['My', 'name', 'is', 'Andreas', ',', 'and', 'I', 'am', 'a', 'guest', 'student', 'at', 'ITU']
['Mit', 'navn', 'er', 'Andreas', ',', 'og', 'jeg', 'er', 'en', 'g', '##æste', '##stu', '##deren', '##de', 'ved', 'ITU']




**(b) Now test the tokenizer of a language model that is trained for your target language. You can find language models on https://huggingface.co/search/full-text?type=model. Can you observe any differences in the results (in amount/length of subwords)? Do the results match your intuition of separating mostly short meaning-carrying subwords?**

In [6]:
danish_tokenizer = AutoTokenizer.from_pretrained("Maltehb/danish-bert-botxo")

print(danish_tokenizer.tokenize(english_text)) # it doesn't work here ofc
print(danish_tokenizer.tokenize(danish_text)) # but now it works here

['my', 'name', 'is', 'andreas', ',', 'and', 'i', 'am', 'a', 'gu', '##est', 'student', 'at', 'it', '##u']
['mit', 'navn', 'er', 'andreas', ',', 'og', 'jeg', 'er', 'en', 'gæste', '##studerende', 'ved', 'it', '##u']


**(c) Think of two example inputs where the tokenizer might struggle to find a meaningful segmentation (for example by introducing typos). Why are these cases difficult?, did the tokenizer do something sensible?**

In [7]:
# try examples with spelling errors
bad_english_text = "The students arenlt doring a good  job with the ai resourses they have"
bad_danish_text = "De studerndnfe gør det ringe med den ksunstige inteligens de har"

print(tokenizer.tokenize(bad_english_text))
print(danish_tokenizer.tokenize(bad_danish_text))

# works bad but still catches a lot of meaningful subwords

['The', 'students', 'aren', '##lt', 'dor', '##ing', 'a', 'good', 'job', 'with', 'the', 'ai', 'reso', '##urs', '##es', 'they', 'have']
['de', 'studer', '##nd', '##n', '##fe', 'gør', 'det', 'ringe', 'med', 'den', 'k', '##su', '##ns', '##tige', 'intel', '##igen', '##s', 'de', 'har']


_______

### 2. Cross-domain transfer

**(a) Train a sentiment analysis model with BERT (bert-base-cased) on the English SST data. Evaluate it on the SST data as well as on the English Twitter data from SemEval2013. Is there a similar performance drop as in assignment 1?**

In [9]:
!python3 sentiment/bert-classification.py sentiment/sst.train sentiment/sst.dev

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


reading data...
tokenizing...
tokenizer_config.json: 100%|█████████████████| 49.0/49.0 [00:00<00:00, 63.1kB/s]
config.json: 100%|█████████████████████████████| 570/570 [00:00<00:00, 3.02MB/s]
vocab.txt: 100%|█████████████████████████████| 213k/213k [00:00<00:00, 1.16MB/s]
tokenizer.json: 100%|████████████████████████| 436k/436k [00:00<00:00, 4.21MB/s]
converting to batches...
initializing model...
model.safetensors: 100%|█████████████████████| 436M/436M [00:48<00:00, 9.00MB/s]
training...
starting epoch 0
Loss: 3969.92
Acc(dev): 89.24

starting epoch 1
Loss: 2006.38
Acc(dev): 90.74

starting epoch 2
Loss: 1250.86
Acc(dev): 89.70

starting epoch 3
Loss: 776.82
Acc(dev): 91.20

starting epoch 4
Loss: 501.08
Acc(dev): 91.20



In [10]:
!python3 sentiment/bert-classification.py sentiment/sst.train sentiment/semeval2013.dev

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


reading data...
tokenizing...
converting to batches...
initializing model...
training...
starting epoch 0
Loss: 3969.92
Acc(dev): 87.10

starting epoch 1
Loss: 2006.38
Acc(dev): 87.10

starting epoch 2
Loss: 1250.86
Acc(dev): 86.70

starting epoch 3
Loss: 776.82
Acc(dev): 85.37

starting epoch 4
Loss: 501.08
Acc(dev): 84.31



The final accuracies drop from 91.20% to 84.31% (sst.dev vs. semeval2013.dev).

In assignment 1, the accuracy drop was from 77.18% to 64.74% (sst.dev vs. semeval2013.dev).

**(b) Inspect the code and try to understand the steps of the inference and the training procedure. What is the shape of the output scores variable of the forward function?, what do the dimensions represent?**

*Dimensions:*
- The shape of the output_scores variable is (batch_size, num_labels).
- batch_size is the number of sentences in the batch (set to 16 in the code).
- num_labels is the number of classes in the dataset (in the code it's set to the length of list of unique labels).

*Inference:*
1. BERT gets tensors of wordpiece indices and a mask (both of shape (batch_size, max_sent_len)) as input, then producing hidden states.
2. We extract [CLS] tokens, which works as a sort of sentence summary.
3. A linear layer transforms the CLS representation into a vector of output scores for each label, which represents the probability of the sentence being of a certain label.

**(c) Now train a sentiment model with the twitter embeddings (cardiffnlp/twitter-xlm-roberta-base) with the SST train data. Does it transfer better to the English Twitter data compared to the mBERT model?**

In [11]:
!python3 sentiment/bert-classification2.py sentiment/sst.train sentiment/semeval2013.dev

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


reading data...
tokenizing...
config.json: 100%|█████████████████████████████| 652/652 [00:00<00:00, 1.36MB/s]
sentencepiece.bpe.model: 100%|█████████████| 5.07M/5.07M [00:00<00:00, 10.2MB/s]
tokenizer.json: 100%|██████████████████████| 9.10M/9.10M [00:01<00:00, 7.09MB/s]
converting to batches...
initializing model...
pytorch_model.bin: 100%|███████████████████| 1.11G/1.11G [01:53<00:00, 9.82MB/s]
Some weights of XLMRobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-xlm-roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
training...
starting epoch 0
Traceback (most recent call last):
  File "/Users/andreasalkemade/Desktop/NLP/assignment2/sentiment/bert-classification2.py", line 170, in <module>
    optimizer.step()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packa

Not possible to run (due to lack of memory, I think).

____

### 3. Cross-lingual transfer

**(a) Train an English BERT model (bert-base-cased on huggingface) on the reviews data from SST, and evaluate it on the SST data as well as the Danish Twitter data. Is there a performance drop when going to the Danish data? How does the performance on the Danish data compare to the majority baseline?**

In [17]:
!python3 sentiment/bert-classification3.py train sentiment/sst.train

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


reading data...
tokenizing...
converting to batches...
initializing model...
training...
starting epoch 0
Loss: 3969.92

starting epoch 1
Loss: 2006.38

starting epoch 2
Loss: 1250.86

starting epoch 3
Loss: 776.82

starting epoch 4
Loss: 501.08

Model saved to sentiment_model.pth


In [19]:
!python3 sentiment/bert-classification3.py evaluate sentiment/sst.dev

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


reading test data...
tokenizing...
converting to batches...
Acc(test): 91.20


In [18]:
!python3 sentiment/bert-classification3.py evaluate sentiment/twitter-da.dev

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


reading test data...
tokenizing...
converting to batches...
Acc(test): 63.59


Big drop in performance.

**(b) Train an mBERT model (bert-base-multilingual-cased on huggingface) model on the reviews data from SST, and evaluate it on the Danish development split, what is the performance? How does it compare to the English BERT model?**

___

# Week 4: Autoregressive language models

You can use the code in qa.py as a starting point. For evaluation, it checks whether should be noted that
the evaluation metric is a custom metric “designed” by Rob, it checks whether at least half of the gold words
in the predicted output.
1. Add 5 common-knowledge questions and answers in the corresponding python lists.

I used these questions and answers for this task:

questions = [
"What is the most populated country in the world?",
"What is a boy to his mom?",
"Which country lost second world war?",
"What city is called 'The big apple'?",
"What country was Chistopher Columbus looking for when he discovered America?"]

answers = [
"India.",
"Her son.",
"Germany.",
"New York.",
"India.",
]

2) What is the performance of FLAN-t5 base on your questions?

In [1]:
!python3 qa/qa.py

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 What is the most populated country in the world? 
sweden

 What is a boy to his mom? 
a sailor

 Which country lost second world war? 
poland

 What city is called 'The big apple'? 
san francisco

 What country was Chistopher Columbus looking for when he discovered America? 
el reino de espaa

0 out of 5 correct


It answered 5 out 5 incorrectly. Bad performance.

3. What are possible pitfals of the evaluation metric?

It is very binary. There is some information loss. Furthermore, some answers might induce words that are just more likely by random, and this is not evaluated through the metric.

4. If the model has made some errors: why is this the case?

It doesn't seem to understand which words are the important. E.g. the last question, where it knows it is something about a ship, but it doesn't recognise that it is specifically Bobba Fett's ship. The correct answer was not probable enough, maybe because Boba Fett is a rare word.

5. Experiment with at least 2 different prefixes and postfixes; do they improve performance?

In [4]:
# First set of prefix and postfix

!python3 qa/qa.py


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The following is a question you need to answer: What is the most populated country in the world? The answer to that question is..
China

The following is a question you need to answer: What language do they speak in Brazil? The answer to that question is..
Portuguese Language

The following is a question you need to answer: Which country lost second world war? The answer to that question is..
poland

The following is a question you need to answer: What city is called 'The big apple'? The answer to that question is..
san francisco

T

The performance is better for both sets - especially for my last set of prefix and postfix. The difference could be that I added the adjectives "simple" and "short". It might make it work for an more intuitive answer.

### 4. Question answering with FLAN-T5

Now we will evaluate the FLAN-T5 model on the Star Wars domain. For this, I have scraped 66 Star Wars
trivia questions from https://parade.com/1161189/alexandra-hurtado/star-wars-trivia/ they have
been pre-processed and are available in questions.txt and answers.txt.
1. What is the performance of the FLAN-T5 model out-of-the-box?
2. Can you improve performance with your pre- and post- fixes? Why?
3. We also provide you with the raw text from Wookieepedia, this is a fandom wiki with information about the Star Wars universe written in English. It has been scraped using the procedure described on https://robvanderg.github.io/datasets/wikia/, and is available in starwarsfandomcom-20200223.txt.cleaned.tok.uniq.txt.gz. Use the words from the questions to find the 5 sentences with the highest word overlap (the raw sentences with the largest coverage of words with respect to the question). Add these sentences as a prefix, separated with newlines. Does performance increase?
4. Bonus: experiment with better variants of data selection, what is the highest score you can obtain?
Note that ChatGPT achieved a score of 51; feel free to also use larger language models, for example
google/flan-t5-large.

1. What is the performance of the FLAN-T5 model out-of-the-box?

In [38]:
# Pass the questions and answers to the QA model
!python3 qa/qa.py

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 Where did Obi-Wan take Luke after his birth? 
san diego

 Who is Palpatine's granddaughter? 
elizabeth

 Who was Anakin Skywalker's Padawan? 
taiwan

 Where is Jabba the Hutt's Palace located? 
san francisco

 What's the name of Boba Fett's ship? 
sailor s sailor

 Who are Kylo Ren's parents? 
evan ron

 Who killed Qui-Gon Jinn? 
taekwondo player

 According to Yoda, there are always how many Sith Lords...no more, no less? 
ten

 Who built C-3PO? 
samuel h. savage

 What is the name of Han Solo's ship? 
sailor of saigon

 Who acted

2. Can you improve performance with your pre- and post- fixes? Why?

In [43]:
# I'm using this:
# Prefix: 'The following is a question inspired by the Star Wars series. The question is:'
# Postfix: 'A short and precise answer is:'

!python3 qa/qa.py 

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The following question is about Star Wars: Where did Obi-Wan take Luke after his birth? The answer to the question is:
the Star Wars galaxy

The following question is about Star Wars: Who is Palpatine's granddaughter? The answer to the question is:
Princess Leia

The following question is about Star Wars: Who was Anakin Skywalker's Padawan? The answer to the question is:
Padawan

The following question is about Star Wars: Where is Jabba the Hutt's Palace located? The answer to the question is:
in the middle of the Star Wars galaxy



It performs slightly better, but still not great. When I reveal in the pre-fix that it is questions about Star Wars it is obvious that it has a better idea about where in the space of words to be looking, but it is lacking some preciness. It is probably because the question are very specific, and there are to many answers with an close to equal amount of probability. It has simply not been trained enough to be able to distinguish these specific questions. Also there is a big person gallery to remember in Star Wars and the questions contain a lot of names, which makes it harder to predict the right answer.


3. We also provide you with the raw text from Wookieepedia, this is a fandom wiki with information about the Star Wars universe written in English. It has been scraped using the procedure described on https://robvanderg.github.io/datasets/wikia/, and is available in starwarsfandomcom-20200223.txt.cleaned.tok.uniq.txt.gz. Use the words from the questions to find the 5 sentences with the highest word overlap (the raw sentences with the largest coverage of words with respect to the question). Add these sentences as a prefix, separated with newlines. Does performance increase?


4. Bonus: experiment with better variants of data selection, what is the highest score you can obtain?

### 5. Domain Adaptation through Retrieval Augmented Generation

### 6. Domain adaptation through fine tuning