## BertTagger

BERT (Bidirectional Encoder Representations from Transformers) developed by Google is a method of pre-training language representations. 
[Devlin et al. (2019)](https://arxiv.org/pdf/1810.04805.pdf) introduces the Bert model, and the Estonian specific EstBERT model is proposed by [Tanvir et al. (2021)](https://aclanthology.org/2021.nodalida-main.2).

We can use pre-trained BERT models to get contextual embeddings for sentences. 

*Note: you need to install [estnltk_neural](https://github.com/estnltk/estnltk/tree/main/estnltk_neural) package for using this functionality.*

`esnltk_neural` provides BertTagger for computing contextual embeddings for tokens and words. The model required by the tagger needs to be downloaded separately. There are two ways for downloading the model:

* If you create a new instance of BertTagger and the model has not been downloaded yet, you'll be prompted with a question asking for permission to download the model;
* Alternatively, you can pre-download model manually via download function:

In [1]:
from estnltk import download
download('berttagger')

Resource 'estbert_from_tartunlp_hf_2022-03-10' has already been downloaded.


True

Now, let's use the model with BertTagger:

In [2]:
from estnltk import Text
from estnltk_neural.taggers import BertTagger

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
bert_tagger = BertTagger()

Some weights of BertModel were not initialized from the model checkpoint at C:\Programmid\Miniconda3\envs\py39_devel\lib\site-packages\estnltk-1.7.2-py3.9-win-amd64.egg\estnltk\estnltk_resources\estbert\model_hf_tartunlp_2022-03-10\ and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Providing model manually.** Alternatively, we can also manually download the EstBERT model from [here](https://huggingface.co/tartuNLP/EstBERT#). 
Once the model is downloaded, we should have the following files in the downloaded folder: <br>
<ul>
 <li>  bert_config.json </li>
 <li>  config.json </li>
 <li>  pytorch_model.bin </li>
 <li>  special_tokens_map.json </li>
 <li>  tokenizer_config.json </li>
 <li>  model.ckpt.data-00000-of-00001 </li>
 <li>  model.ckpt.index </li>
 <li>  model.ckpt.meta </li>
 <li>  vocab.txt </li>
</ul>

Then we can use the bert_location parameter to specify the location of the model directory:

```python
from estnltk_neural.taggers import BertTagger
# Specify the location of the model manually
bert_tagger = BertTagger(bert_location='C:/Users/kittask/Documents/multilingual_bert')
```

BertTagger expects sentences layer to be present, so we need to tag the layer: 

In [4]:
text = Text("Aga mulle tundub, et kogu maailm ootab muusikamaailmalt midagi erutavalt uut minimalismi kõrvale.")
text.tag_layer('sentences')

text
"Aga mulle tundub, et kogu maailm ootab muusikamaailmalt midagi erutavalt uut minimalismi kõrvale."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,15
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,15


In [5]:
bert_tagger.tag(text)

text
"Aga mulle tundub, et kogu maailm ootab muusikamaailmalt midagi erutavalt uut minimalismi kõrvale."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,15
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,15
bert_embeddings,"token, bert_embedding",,,False,20


BertTagger created new layer called bert_embeddings. By default, it annotates tokens tokenized with BertTokenizer and not full words.  

In [6]:
text.bert_embeddings

layer name,attributes,parent,enveloping,ambiguous,span count
bert_embeddings,"token, bert_embedding",,,False,20

text,token,bert_embedding
Aga,aga,"[0.3278198838233948, -0.16953659057617188, 0.20027776062488556, -0.0736533999443 ..., type: <class 'list'>, length: 3072"
mulle,mulle,"[-0.5129477381706238, -0.1304159015417099, -0.514543354511261, -0.15106612443923 ..., type: <class 'list'>, length: 3072"
tundub,tundub,"[-0.7856529951095581, 0.8352845907211304, 0.172186478972435, -0.1075660660862922 ..., type: <class 'list'>, length: 3072"
",",",","[0.10949549078941345, 0.3002486824989319, 0.36521056294441223, 0.043865583837032 ..., type: <class 'list'>, length: 3072"
et,et,"[0.5369061827659607, 0.1513473242521286, 0.24274933338165283, -0.299760907888412 ..., type: <class 'list'>, length: 3072"
kogu,kogu,"[-1.4076721668243408, 0.30222129821777344, -0.2031559944152832, -0.4452161788940 ..., type: <class 'list'>, length: 3072"
maailm,maailm,"[-0.25198614597320557, 1.330830693244934, -0.27605557441711426, 0.09248819202184 ..., type: <class 'list'>, length: 3072"
ootab,ootab,"[0.9260630011558533, 0.3899831771850586, 0.29357632994651794, -0.313877910375595 ..., type: <class 'list'>, length: 3072"
muusika,muusika,"[-0.3184152841567993, 4.209578037261963e-06, -0.7241317629814148, -0.46743878722 ..., type: <class 'list'>, length: 3072"
maailma,##maailma,"[-0.12080671638250351, 0.8332154154777527, -0.24726587533950806, -0.157874032855 ..., type: <class 'list'>, length: 3072"


### BERT layers

There are two versions of BERT models:

   * The BASE model - number of transformer blocks: 12, Hidden layer size: 768
   * The LARGE model - number of transformer blocks: 24, Hidden layer size: 1024

Usually, just using the base model is good enough. And BertTagger's [default model](https://huggingface.co/tartuNLP/EstBERT#) is also a base model.

BERT base model uses 12 layers and BERT large model 24 of transformer encoders, we can use each of these layers as our embeddings. 
[According to creators of BERT](https://arxiv.org/pdf/1810.04805.pdf), it is best to concatenate the last 4 layers together and use as embeddings. 
This gives an embedding vector of size 3072 for each token.
We have made the concatenation of last 4 layers also the default setting of BertTagger. 

If we still want to change the layers, we need to initialize BertTagger with updated `bert_layers` parameter:
```python
bert_tagger = BertTagger(bert_layers=[-3,-2,-1])
```
This gives us the last three layers from the model.

Let's try it out. 
We have to change the name of the output layer as well, because we already have `bert_embeddings` layer attached to our Text object.

In [7]:
bert_tagger_3 = BertTagger(bert_layers=[-3, -2, -1], output_layer='bert_embeddings_3')

Some weights of BertModel were not initialized from the model checkpoint at C:\Programmid\Miniconda3\envs\py39_devel\lib\site-packages\estnltk-1.7.2-py3.9-win-amd64.egg\estnltk\estnltk_resources\estbert\model_hf_tartunlp_2022-03-10\ and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
bert_tagger_3.tag(text)

text
"Aga mulle tundub, et kogu maailm ootab muusikamaailmalt midagi erutavalt uut minimalismi kõrvale."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,15
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,15
bert_embeddings,"token, bert_embedding",,,False,20
bert_embeddings_3,"token, bert_embedding",,,False,20


In [9]:
text.bert_embeddings_3

layer name,attributes,parent,enveloping,ambiguous,span count
bert_embeddings_3,"token, bert_embedding",,,False,20

text,token,bert_embedding
Aga,aga,"[0.2625231444835663, -0.1883794665336609, 0.08827835321426392, -0.03921275958418 ..., type: <class 'list'>, length: 2304"
mulle,mulle,"[-0.43733400106430054, -0.4143901765346527, -0.4378586411476135, -0.100203357636 ..., type: <class 'list'>, length: 2304"
tundub,tundub,"[-0.87529057264328, 0.5557077527046204, 0.14822815358638763, 0.1134534478187561, ..., type: <class 'list'>, length: 2304"
",",",","[-0.1359112411737442, 0.15923729538917542, 0.11842511594295502, 0.01949885487556 ..., type: <class 'list'>, length: 2304"
et,et,"[0.43496760725975037, 0.21906685829162598, 0.1792183518409729, -0.36516264081001 ..., type: <class 'list'>, length: 2304"
kogu,kogu,"[-1.450974464416504, -0.028160540387034416, 0.0961918756365776, -0.0531213544309 ..., type: <class 'list'>, length: 2304"
maailm,maailm,"[-0.3651334047317505, 1.2177222967147827, 0.12997516989707947, 0.108812123537063 ..., type: <class 'list'>, length: 2304"
ootab,ootab,"[0.7885776162147522, 0.16598431766033173, 0.38945138454437256, -0.33931100368499 ..., type: <class 'list'>, length: 2304"
muusika,muusika,"[-0.22488510608673096, -0.09191486239433289, -0.2897561192512512, -0.06728636473 ..., type: <class 'list'>, length: 2304"
maailma,##maailma,"[-0.1504383236169815, 0.45006293058395386, -0.1899731457233429, 0.04855503886938 ..., type: <class 'list'>, length: 2304"


We can also sum the embeddings. For that we need to specify the method for BertTagger. To add the embeddings, we need to set `method='add'`. 

In [10]:
bert_tagger_sum = BertTagger(method='add', bert_layers=[-3, -2, -1], output_layer='bert_embeddings_add_3')
bert_tagger_sum.tag(text)
text.bert_embeddings_add_3

Some weights of BertModel were not initialized from the model checkpoint at C:\Programmid\Miniconda3\envs\py39_devel\lib\site-packages\estnltk-1.7.2-py3.9-win-amd64.egg\estnltk\estnltk_resources\estbert\model_hf_tartunlp_2022-03-10\ and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


layer name,attributes,parent,enveloping,ambiguous,span count
bert_embeddings_add_3,"token, bert_embedding",,,False,20

text,token,bert_embedding
Aga,aga,"[0.855210542678833, -1.0155982971191406, -0.13039588928222656, 0.319050461053848 ..., type: <class 'list'>, length: 768"
mulle,mulle,"[-1.4259059429168701, -1.4321746826171875, -1.7620819807052612, 0.29196101427078 ..., type: <class 'list'>, length: 768"
tundub,tundub,"[-2.7143688201904297, 1.2798110246658325, -0.9171760082244873, 0.478315949440002 ..., type: <class 'list'>, length: 768"
",",",","[-0.05343534052371979, 0.8557734489440918, -0.42028823494911194, 2.9934716224670 ..., type: <class 'list'>, length: 768"
et,et,"[1.4572105407714844, 0.07563519477844238, 0.15774992108345032, -0.81109833717346 ..., type: <class 'list'>, length: 768"
kogu,kogu,"[-4.441354751586914, -0.604876697063446, -0.805764377117157, 0.13433724641799927 ..., type: <class 'list'>, length: 768"
maailm,maailm,"[-1.747345209121704, 3.923570156097412, -0.10644489526748657, 1.6448873281478882 ..., type: <class 'list'>, length: 768"
ootab,ootab,"[3.1839985847473145, 0.22699567675590515, -0.03589129447937012, -0.9233179092407 ..., type: <class 'list'>, length: 768"
muusika,muusika,"[-0.6032030582427979, -0.10156792402267456, -0.49635764956474304, 0.666948795318 ..., type: <class 'list'>, length: 768"
maailma,##maailma,"[-0.9806268215179443, 1.0891016721725464, -0.23461979627609253, 1.01726484298706 ..., type: <class 'list'>, length: 768"


We can see that adding the embeddings will result to embeddings with shape (768,), but the concatenated ones larger: 

In [13]:
# size of summed up embeddings
len(text.bert_embeddings_add_3[0].bert_embedding)

768

In [12]:
# size of concatenated embeddings
len(text.bert_embeddings[0].bert_embedding)

3072

We can also return all the embeddings from the layers chosen. We can see that the output is now a list containing 3 embeddings: 

In [14]:
bert_tagger_all = BertTagger(method='all', bert_layers=[-3, -2, -1], output_layer='bert_embeddings_all_3')
bert_tagger_all.tag(text)
text.bert_embeddings_all_3

Some weights of BertModel were not initialized from the model checkpoint at C:\Programmid\Miniconda3\envs\py39_devel\lib\site-packages\estnltk-1.7.2-py3.9-win-amd64.egg\estnltk\estnltk_resources\estbert\model_hf_tartunlp_2022-03-10\ and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


layer name,attributes,parent,enveloping,ambiguous,span count
bert_embeddings_all_3,"token, bert_embedding",,,False,20

text,token,bert_embedding
Aga,aga,"[[0.2625231444835663, -0.1883794665336609, 0.08827835321426392, -0.0392127595841 ..., type: <class 'list'>, length: 3"
mulle,mulle,"[[-0.43733400106430054, -0.4143901765346527, -0.4378586411476135, -0.10020335763 ..., type: <class 'list'>, length: 3"
tundub,tundub,"[[-0.87529057264328, 0.5557077527046204, 0.14822815358638763, 0.1134534478187561 ..., type: <class 'list'>, length: 3"
",",",","[[-0.1359112411737442, 0.15923729538917542, 0.11842511594295502, 0.0194988548755 ..., type: <class 'list'>, length: 3"
et,et,"[[0.43496760725975037, 0.21906685829162598, 0.1792183518409729, -0.3651626408100 ..., type: <class 'list'>, length: 3"
kogu,kogu,"[[-1.450974464416504, -0.028160540387034416, 0.0961918756365776, -0.053121354430 ..., type: <class 'list'>, length: 3"
maailm,maailm,"[[-0.3651334047317505, 1.2177222967147827, 0.12997516989707947, 0.10881212353706 ..., type: <class 'list'>, length: 3"
ootab,ootab,"[[0.7885776162147522, 0.16598431766033173, 0.38945138454437256, -0.3393110036849 ..., type: <class 'list'>, length: 3"
muusika,muusika,"[[-0.22488510608673096, -0.09191486239433289, -0.2897561192512512, -0.0672863647 ..., type: <class 'list'>, length: 3"
maailma,##maailma,"[[-0.1504383236169815, 0.45006293058395386, -0.1899731457233429, 0.0485550388693 ..., type: <class 'list'>, length: 3"


### Word level embeddings

As BERT uses tokenizer that splits words into subwords, we get embeddings for those subwords and not for whole words. If we still want word-level embeddings we can _add up all the subword embeddings_ that constitute a word. It will follow the EstNLTK default tokenizer to get the separate words. 

To get the word level embeddings, we need to tell the BertTagger that we do not want token level embeddings, so we have to set `token_level=False`. 

In [15]:
bert_tagger_word = BertTagger(bert_layers=[-3, -2, -1], output_layer='bert_embeddings_word', token_level=False)
bert_tagger_word.tag(text)
text.bert_embeddings_word

Some weights of BertModel were not initialized from the model checkpoint at C:\Programmid\Miniconda3\envs\py39_devel\lib\site-packages\estnltk-1.7.2-py3.9-win-amd64.egg\estnltk\estnltk_resources\estbert\model_hf_tartunlp_2022-03-10\ and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


layer name,attributes,parent,enveloping,ambiguous,span count
bert_embeddings_word,"token, bert_embedding",,,False,15

text,token,bert_embedding
Aga,['aga'],"[0.2625231444835663, -0.1883794665336609, 0.08827835321426392, -0.03921275958418 ..., type: <class 'list'>, length: 2304"
mulle,['mulle'],"[-0.43733400106430054, -0.4143901765346527, -0.4378586411476135, -0.100203357636 ..., type: <class 'list'>, length: 2304"
tundub,['tundub'],"[-0.87529057264328, 0.5557077527046204, 0.14822815358638763, 0.1134534478187561, ..., type: <class 'list'>, length: 2304"
",","[',']","[-0.1359112411737442, 0.15923729538917542, 0.11842511594295502, 0.01949885487556 ..., type: <class 'list'>, length: 2304"
et,['et'],"[0.43496760725975037, 0.21906685829162598, 0.1792183518409729, -0.36516264081001 ..., type: <class 'list'>, length: 2304"
kogu,['kogu'],"[-1.450974464416504, -0.028160540387034416, 0.0961918756365776, -0.0531213544309 ..., type: <class 'list'>, length: 2304"
maailm,['maailm'],"[-0.3651334047317505, 1.2177222967147827, 0.12997516989707947, 0.108812123537063 ..., type: <class 'list'>, length: 2304"
ootab,['ootab'],"[0.7885776162147522, 0.16598431766033173, 0.38945138454437256, -0.33931100368499 ..., type: <class 'list'>, length: 2304"
muusikamaailmalt,"['muusika', '##maailma', '##lt']","[-0.3695043921470642, 0.3853032886981964, -0.25838983058929443, -0.1448982357978 ..., type: <class 'list'>, length: 2304"
midagi,['midagi'],"[-0.2821528911590576, 0.5402743220329285, 0.20932498574256897, 0.196935504674911 ..., type: <class 'list'>, length: 2304"


We can also see, which of the tokens have been concatenated from the `token` attribute. 

If word-level embeddings are used with the option `method='all'` (return all the embeddings), then BertTagger creates an ambiguous layer, where words corresponding to multiple Bert tokens also have a separate embeddings annotation for each Bert token:

In [16]:
bert_tagger_word_all = BertTagger(method='all', bert_layers=[-3, -2, -1], 
                                  output_layer='bert_embeddings_word_all', token_level=False)
bert_tagger_word_all.tag(text)
text.bert_embeddings_word_all

Some weights of BertModel were not initialized from the model checkpoint at C:\Programmid\Miniconda3\envs\py39_devel\lib\site-packages\estnltk-1.7.2-py3.9-win-amd64.egg\estnltk\estnltk_resources\estbert\model_hf_tartunlp_2022-03-10\ and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


layer name,attributes,parent,enveloping,ambiguous,span count
bert_embeddings_word_all,"token, bert_embedding",,,True,15

text,token,bert_embedding
Aga,aga,"[[0.2625231444835663, -0.1883794665336609, 0.08827835321426392, -0.0392127595841 ..., type: <class 'list'>, length: 3"
mulle,mulle,"[[-0.43733400106430054, -0.4143901765346527, -0.4378586411476135, -0.10020335763 ..., type: <class 'list'>, length: 3"
tundub,tundub,"[[-0.87529057264328, 0.5557077527046204, 0.14822815358638763, 0.1134534478187561 ..., type: <class 'list'>, length: 3"
",",",","[[-0.1359112411737442, 0.15923729538917542, 0.11842511594295502, 0.0194988548755 ..., type: <class 'list'>, length: 3"
et,et,"[[0.43496760725975037, 0.21906685829162598, 0.1792183518409729, -0.3651626408100 ..., type: <class 'list'>, length: 3"
kogu,kogu,"[[-1.450974464416504, -0.028160540387034416, 0.0961918756365776, -0.053121354430 ..., type: <class 'list'>, length: 3"
maailm,maailm,"[[-0.3651334047317505, 1.2177222967147827, 0.12997516989707947, 0.10881212353706 ..., type: <class 'list'>, length: 3"
ootab,ootab,"[[0.7885776162147522, 0.16598431766033173, 0.38945138454437256, -0.3393110036849 ..., type: <class 'list'>, length: 3"
muusikamaailmalt,muusika,"[[-0.22488510608673096, -0.09191486239433289, -0.2897561192512512, -0.0672863647 ..., type: <class 'list'>, length: 3"
,##maailma,"[[-0.1504383236169815, 0.45006293058395386, -0.1899731457233429, 0.0485550388693 ..., type: <class 'list'>, length: 3"


## RobertaTagger

`esnltk_neural` also contains RobertaTagger, which allows to tag embeddings with the [Est-RoBERTa](https://huggingface.co/EMBEDDIA/est-roberta) model [(Ulčar & Robnik-Šikonja 2022)](https://link.springer.com/chapter/10.1007/978-3-031-16500-9_14).
The interface of RobertaTagger is analogous to that of BertTagger. 

The model required by RobertaTagger can be pre-downloaded via `snapshot_download`:

```python
from huggingface_hub import snapshot_download
snapshot_download('EMBEDDIA/est-roberta')
```

Alternatively, if you create a new instance of RobertaTagger and the model has not been downloaded yet, it'll be downloaded automatically.

Usage example:
```python
from estnltk import Text
from estnltk_neural.taggers import RobertaTagger
roberta_tagger = RobertaTagger()
# Create input text and add required layers
text = Text("Aga mulle tundub, et kogu maailm ootab muusikamaailmalt midagi erutavalt uut minimalismi kõrvale.")
text.tag_layer(['words', 'sentences'])
# Tag embeddings
roberta_tagger.tag(text)
# Browse results
text.roberta_embeddings
```