<a href="https://colab.research.google.com/github/dgromann/MCMLR_2023W/blob/main/Bonus_Exercise2_MCMLR_2023W.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bonus Exercises 2: Multilingual and Crosslingual Methods and Language Resources**

This notebook represents the second bonus exercise for the lecture Multilingual and Crosslingual Methods and Language Resources (2023W 340168-1). For each of the bonus exercises you can obtain a maximum of 3 points that are added to the points of your final exam. The sections where you need to complete the code are marked with 👋 ⚒.


## **Explore LMs internal vocabularies**

This exercsise loads pretrained language models and allows you to explore their vocabulary. As always, we first need to load the transformers library to access the transformer models on HuggingFace.


In [3]:
!pip install transformers
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


Next we will need to load the first model, which is the multilingual XLM-RoBERTa-base model for this tutorial.

In [4]:
import torch
from transformers import XLMRobertaTokenizer

xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base" )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Now we will explore the vocabulary of the model by using the function `tokenizer.get_vocab()`. Keep in mind that the tokenizer stores tokens and identifiers. Here we are mainly interested in exploring the tokens.

👋 ⚒ Explore the following aspects about the vocabulary of XLM-R:


*   What datatype does the function `tokenizer.get_vocab()`return?
*   How many tokens are in the vocabulary of XLM-R?
*   When representing all tokens as a list, which token is stored on position 300?






In [5]:
#Your code should go here
xlmr_tokenizer.get_vocab()

{'<s>': 0,
 '<pad>': 1,
 '</s>': 2,
 '<unk>': 3,
 ',': 4,
 '.': 5,
 '▁': 6,
 's': 7,
 '▁de': 8,
 '-': 9,
 '▁a': 10,
 'a': 11,
 ':': 12,
 'e': 13,
 'i': 14,
 '▁(': 15,
 ')': 16,
 '▁i': 17,
 't': 18,
 'n': 19,
 '▁-': 20,
 '▁la': 21,
 '▁en': 22,
 '▁in': 23,
 '▁na': 24,
 "'": 25,
 '’': 26,
 '...': 27,
 '▁e': 28,
 '▁на': 29,
 '。': 30,
 'o': 31,
 '?': 32,
 'en': 33,
 'u': 34,
 '▁и': 35,
 '▁o': 36,
 '、': 37,
 '!': 38,
 'm': 39,
 '▁se': 40,
 '▁que': 41,
 'r': 42,
 '的': 43,
 '▁"': 44,
 '▁di': 45,
 '▁–': 46,
 '▁to': 47,
 '▁da': 48,
 '▁в': 49,
 '،': 50,
 '▁un': 51,
 '▁“': 52,
 'y': 53,
 '▁do': 54,
 '▁je': 55,
 'er': 56,
 '▁sa': 57,
 '"': 58,
 'а': 59,
 '▁og': 60,
 '▁за': 61,
 '▁A': 62,
 '”': 63,
 '/': 64,
 '▁و': 65,
 'an': 66,
 'te': 67,
 '▁die': 68,
 '▁да': 69,
 '▁the': 70,
 'd': 71,
 '▁er': 72,
 'in': 73,
 ';': 74,
 '▁u': 75,
 'na': 76,
 '▁не': 77,
 '▁si': 78,
 '▁ja': 79,
 '▁za': 80,
 '▁v': 81,
 '▁et': 82,
 '▁is': 83,
 '▁у': 84,
 'da': 85,
 'ne': 86,
 '▁I': 87,
 '▁el': 88,
 'и': 89,
 'es': 90,


As a next step we want to store the entire vocabulary.

👋 ⚒ Run through the vocabulary and write each token on a separate line in a local file, i.e., store all line-separated tokens in a file.  

In [None]:
#Your code to write tokens separated by line to a file should go here

## **SentencePiece vs. WordPiece Tokenization**

We have explored the WordPiece before when investigating the vocabulary of BERT. In contrast to BERT, XLM-R uses SentencePiece tokenization. Here we will explore the difference between these two. Both try to separate the intput into subword tokens.

When looking at the tokenization of the word "philosphy", the output of the two tokenizers is:  

```
WordPiece:  phil  ##os  ##phy
SentencePiece:	▁phil   os    phy
```

In WordPiece, subwords are denoted by two hash characters, except the *first* subword in a word.

Instead of splitting words based on spaces, SentencePiece treats the input as a raw input stream and includes spaces in its representation in the form of _.  Decoding with SentencePiece is very easy since all tokens can just be concatenated and "▁" is replaced by a space.

For a more detailed description of the two tokenizers, please see the [HuggingFace Summary of Tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary#sentencepiece).

To explore the difference with some examples, we need to load the BERT-base model as well.


In [None]:
from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

👋 ⚒ Compare the two tokenizers `bert_tokenizer`and `xlmr_tokenizer`by tokenizing the following `example_sentence` and output the result.

In [None]:
example_sentence = "I don't even need a GPU for this!"

#Your code should go here


👋 ⚒ What difference can you observe between the way these two tokenizers tokenize the example sentence?

Next, we will try to improve the output format of the vocabulary comparison.

👋 ⚒ Write a function that allows us to directly compare the two outputs by assigning IDs to each element in the list and printing the tokens of both per index position, i.e., ID. You can do this by creating a print format yourself or using other libraries, e.g. Pandas Dataframes.  

In [32]:
def compare_tokenizer_outputs(xlmr_tokens, bert_tokens):


## **Single Character Tokens**

All single character tokens in the vocabulary are not preceded by an underscore.

👋 ⚒ Write a function that finds all the single character tokens in the XLM-R vocabulary and then prints them in a way that 30 characters are printed in per line.


In [None]:
# Your code goes here

## **Explore Token Length**

In this section, we will explore the distribution of token length.

👋 ⚒ Create a list that contains the length of each token called `token_lengths`, which is then passed to the Pandas Dataframe to visualize the length distribution with a seaborn count plot.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

sns.set(style='darkgrid')

# Set the plot and font size
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (10,5)

# Create your list of token_lengths here

df = pd.DataFrame(token_lengths, columns=["Token Length"])

# Plot the number of tokens of each length
sns.countplot(df, x= "Token Length")
plt.title('Vocab Token Lengths')
plt.xlabel('Token Length')
plt.ylabel('# of Tokens')

print('Maximum token length:', max(token_lengths), '\n\n')

plt.show()

👋 ⚒ Write a few lines of code or a function that will output all tokens of the XLM-R model of length 16 exactly.

In [None]:
# Your code here

## **English Words**
In this section, we will use WordNet to identify all the tokens that represent English words.

In [1]:
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

👋 ⚒ Write a function that runs through all the tokens and checks if they are in WordNet or not. If they are in WordNet, append them to a list that you return as a result of your function. Then output the length of the list of English words, the percentage they make up compared to the total number of tokens, and print the first ten words in the list.

In [None]:
# Your code here