# Content:
1. Check SweBERT Model Accessibility
2. Simple Model Application (Masked Token Prediction)

#### Note: Make sure to run this notebook in a virtual environment with the libraries listed in requirements.txt installed

In [1]:
import torch
import tensorflow as tf
from transformers import BertTokenizer, BertModel, TFBertModel, BertForMaskedLM 
from tokenizers import BertWordPieceTokenizer

import warnings; warnings.filterwarnings('ignore')

To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html


# 0. Choose SweBERT model

We have to choose one of the pretrained SweBERT models:

In [2]:
pretrained_model_name = 'af-ai-center/bert-base-swedish-uncased'
# pretrained_model_name = af-ai-center/bert-large-swedish-uncased

# 1. Check SweBERT Model Accessibility

First, we are going to check that the chosen pretrained SweBERT model is accessible through the transformers library.
If it is, we should be able to instantiate a tokenizer and a (PyTorch/TensorFlow) model from it. 

Note that this may take a while the first time you run it as the model needs to be downloaded. 

### a. Tokenizer

In [3]:
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name, do_lower_case=False)

### b. Model PyTorch

In [4]:
model = BertModel.from_pretrained(pretrained_model_name)

### c. Model TensorFlow

In [5]:
model = TFBertModel.from_pretrained(pretrained_model_name)

# 2. Simple Model Application (Masked Token Prediction)

We are now going to apply the (PyTorch) SweBERT model on an example sentence, loosely following https://huggingface.co/transformers/quickstart.html#quick-tour-usage

We will
1. Tokenize the example using BertTokenizer
2. Tokenize the example using BertWordPieceTokenizer
3. Mask one of the tokens
4. Use SweBERT to predict back the masked token

In [6]:
example = 'Jag är ett barn, och det här är mitt hem. Alltså är det ett barnhem!'
example

'Jag är ett barn, och det här är mitt hem. Alltså är det ett barnhem!'

### 1. Tokenize the example using BertTokenizer

The pretrained SweBERT models are uncased. 

In principle, we could account for this by instantiating the BertTokenizer (https://huggingface.co/transformers/model_doc/bert.html#berttokenizer) with the parameter `do_lower_case=True`.
However, the BertTokenizer does not handle the Swedish letters `å, ä & ö` properly (they get replaced by `a & o`).

To avoid this problem, we manually lowercase all text before tokenization instead.

In [7]:
bert_tokenizer = BertTokenizer.from_pretrained(pretrained_model_name, do_lower_case=False)

#### a. lowercase 

In [8]:
example_uncased = example.lower()
example_uncased

'jag är ett barn, och det här är mitt hem. alltså är det ett barnhem!'

#### b. special tokens 

The input of BERT models needs to be provided with special tokens '[CLS]' and '[SEP]':

In [9]:
example_preprocessed = f'[CLS] {example_uncased} [SEP]'
example_preprocessed

'[CLS] jag är ett barn, och det här är mitt hem. alltså är det ett barnhem! [SEP]'

#### c. Tokenize the preprocessed example

In [10]:
tokens = bert_tokenizer.tokenize(example_preprocessed)

print(f'{len(tokens)} tokens:')
print(tokens)

21 tokens:
['[CLS]', 'jag', 'är', 'ett', 'barn', ',', 'och', 'det', 'här', 'är', 'mitt', 'hem', '.', 'alltså', 'är', 'det', 'ett', 'barn', '##hem', '!', '[SEP]']


#### d. Convert the tokens to token ids

In [11]:
indexed_tokens = bert_tokenizer.convert_tokens_to_ids(tokens)
print(indexed_tokens)

[101, 1112, 1100, 1115, 1255, 1010, 1095, 1102, 1174, 1100, 1352, 1345, 1012, 1492, 1100, 1102, 1115, 1255, 2760, 999, 102]


### 2. Tokenize the example using BertWordPieceTokenizer

An alternative is to use the BertWordPieceTokenizer from the tokenizers library (https://github.com/huggingface/tokenizers).
It handles the special Swedish letters properly if the parameters `lowercase=True` & `strip_accents=False` are used. 

In [12]:
bert_word_piece_tokenizer = BertWordPieceTokenizer("vocab_swebert.txt", lowercase=True, strip_accents=False)

#### c. Tokenize the preprocessed example & d. Convert the tokens to token ids

In [13]:
output = bert_word_piece_tokenizer.encode(example)  # attributes: output.ids, output.tokens, output.offsets

In [14]:
tokens_2 = output.tokens

print(f'{len(tokens_2)} tokens:')
print(tokens_2)

21 tokens:
['[CLS]', 'jag', 'är', 'ett', 'barn', ',', 'och', 'det', 'här', 'är', 'mitt', 'hem', '.', 'alltså', 'är', 'det', 'ett', 'barn', '##hem', '!', '[SEP]']


In [15]:
indexed_tokens_2 = output.ids
print(indexed_tokens_2)

[101, 1112, 1100, 1115, 1255, 1010, 1095, 1102, 1174, 1100, 1352, 1345, 1012, 1492, 1100, 1102, 1115, 1255, 2760, 999, 102]


In [16]:
# check that BertTokenizer & BertWordPieceTokenizer lead to the same results
assert tokens == tokens_2
assert indexed_tokens == indexed_tokens_2

### 3. Mask one of the tokens

In [17]:
masked_index = 17  # 'barn'

In [18]:
tokens[masked_index] = '[MASK]'
print(tokens)

['[CLS]', 'jag', 'är', 'ett', 'barn', ',', 'och', 'det', 'här', 'är', 'mitt', 'hem', '.', 'alltså', 'är', 'det', 'ett', '[MASK]', '##hem', '!', '[SEP]']


In [19]:
# Mask token with BertTokenizer
indexed_tokens[masked_index] = bert_tokenizer.convert_tokens_to_ids('[MASK]')
print(indexed_tokens)

[101, 1112, 1100, 1115, 1255, 1010, 1095, 1102, 1174, 1100, 1352, 1345, 1012, 1492, 1100, 1102, 1115, 103, 2760, 999, 102]


In [20]:
# Mask token with BertWordPieceTokenizer
indexed_tokens[masked_index] = bert_word_piece_tokenizer.token_to_id('[MASK]')
print(indexed_tokens)

[101, 1112, 1100, 1115, 1255, 1010, 1095, 1102, 1174, 1100, 1352, 1345, 1012, 1492, 1100, 1102, 1115, 103, 2760, 999, 102]


### 4. Use SweBERT to predict back the masked token

In [21]:
# convert indexed_tokens to torch tensor
indexed_tokens_tensor = torch.tensor([indexed_tokens])
print(indexed_tokens_tensor)

tensor([[ 101, 1112, 1100, 1115, 1255, 1010, 1095, 1102, 1174, 1100, 1352, 1345,
         1012, 1492, 1100, 1102, 1115,  103, 2760,  999,  102]])


In [22]:
# instantiate model
model = BertForMaskedLM.from_pretrained(pretrained_model_name)
_ = model.eval()

In [23]:
# predict all tokens
with torch.no_grad():
    outputs = model(indexed_tokens_tensor)

predictions = outputs[0]
print(predictions.shape)  # 1 example, 21 tokens, 30522 vocab size

torch.Size([1, 21, 30522])


In [24]:
# show prediction for masked token index
predicted_index = torch.argmax(predictions[0, masked_index])
print(predicted_index)

tensor(1255)


In [25]:
# show prediction for masked token
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)

barn


In [26]:
assert predicted_token == 'barn'

#### Appendix

Instead of looking only at the top model prediction, we can also consider the top 5 predictions.

In [27]:
# show top5 predictions for masked token index
predicted_index_top5 = torch.argsort(predictions[0, masked_index], descending=True)[:5]
predicted_index_top5

tensor([ 1255,  8032, 14829,  1251,  1264])

In [28]:
# show top5 predictions for masked token
for predicted_index in predicted_index_top5:
    predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
    print(predicted_token)

barn
foster
barndoms
dock
dag


# Conclusions

- We have checked the accessibility of the SweBERT models through the transformers library. 
- We have demonstrated a very simple model application, where the SweBERT model successfully predicts a masked token.

For additional use cases and information, we refer to the documentation of the transformers library. 