## BertTagger

BERT (Bidirectional Encoder Representations from Transformers) developed by Google is a method of pre-training language representations. We can use pre-trained BERT models to get contextual embeddings for sentences. 

### Installation 

Before we can use BertTagger, we need to download libraries that the tagger depends on. We can find the installation intstructions from here: 

1. [transformers](https://github.com/huggingface/transformers) <br>
2. [PyTorch](https://pytorch.org/) <br>
3. [tensorflow]()

We also need a pre-trained BERT model. For Estonian, we can download  EstBERT model, which we can download [here](https://huggingface.co/tartuNLP/EstBERT#).

Once the model is downloaded, we should have the following files in the downloaded folder: <br>
<ul>
 <li>  bert_config.json </li>
 <li>  config.json </li>
 <li>  pytorch_model.bin </li>
 <li>  special_tokens_map.json </li>
 <li>  tokenizer_config.json </li>
 <li>  model.ckpt.data-00000-of-00001 </li>
 <li>  model.ckpt.index </li>
 <li>  model.ckpt.meta </li>
 <li>  vocab.txt </li>
</ul>


 Now we can import the BertTagger. We need to specify the location of directory, where the model is. 

In [1]:
from estnltk import Text
from estnltk.taggers.embeddings.bert.bert_tagger import BertTagger

In [2]:
bert_tagger = BertTagger(bert_location='C:/Users/kittask/Documents/multilingual_bert')

BertTagger expects sentences layer to be present, so we need to add segmentation analysis: 

In [3]:
text = Text("Aga mulle tundub, et kogu maailm ootab muusikamaailmalt midagi erutavalt uut minimalismi kõrvale.")
text.analyse('segmentation')

text
"Aga mulle tundub, et kogu maailm ootab muusikamaailmalt midagi erutavalt uut minimalismi kõrvale."

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,1
words,normalized_form,,,True,15


In [4]:
bert_tagger.tag(text)

text
"Aga mulle tundub, et kogu maailm ootab muusikamaailmalt midagi erutavalt uut minimalismi kõrvale."

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,1
words,normalized_form,,,True,15
bert_embeddings,"token, bert_embedding",,,True,31


BertTagger created new layer called bert_embeddings. By default, it annotates tokens tokenized with BertTokenizer and not full words.  

In [5]:
text.bert_embeddings

layer name,attributes,parent,enveloping,ambiguous,span count
bert_embeddings,"token, bert_embedding",,,True,31

text,token,bert_embedding
Aga,aga,[-0.56507105 -0.50884926 0.23410653 ... 1.8962842 0.85748005\n 0.0204182 ]
mu,mu,[-0.01081562 0.20636141 -0.15503098 ... 1.0888627 0.43207562\n -0.1550504 ]
lle,##lle,[ 0.6407423 -0.08208487 -0.02856553 ... 1.5318996 0.6025036\n -0.13246572]
tun,tun,[ 0.36771452 0.01013857 -0.40088403 ... 0.94661087 0.7602785\n -0.6301877 ]
dub,##dub,[ 0.18848763 0.9540447 -0.05482637 ... 1.533448 0.6384699\n -0.4229727 ]
",",",",[0.12251471 0.2885881 0.50010693 ... 1.7256243 0.4238516 0.06740805]
et,et,[-0.2964522 0.4062226 0.7311516 ... 1.8742213 0.7969719\n 0.17700657]
kogu,kogu,[-0.06668212 -0.18681507 0.8217494 ... 0.6215325 0.6152663\n 0.12196311]
maa,maa,[ 0.7036054 -0.6845788 0.37784645 ... 0.25743842 0.6815776\n -0.07652788]
ilm,##ilm,[ 1.165508 -0.24385548 0.12746628 ... 0.75258493 0.831807\n -0.2675484 ]


### BERT layers

There are two versions of BERT models: <br>
<ul>
    <li> The BASE model - number of transformer blocks: 12, Hidden layer size: 768</li>
    <li> The LARGE model - number of transformer blocks: 24, Hidden layer size: 1024</li>
</ul>
Usually, just using the base model is good enough. <br>
BERT base model uses 12 layers and BERT large model 24 of transformer encoders, we can use each of these layers as our embeddings. According to creators of BERT it is best to concatenate the last 4 layers together and use this as embeddings. We have set the default settings of BertTagger as this also, so no additional configuration has to be done. 
<br>
If we still want to change the layers, we need to add the layers to a list like this: <br>
bert_layers=[-3,-2,-1] <br>
Here we want to get the last three layers from the model. <br>
Let's try it out. 
We have to change the output layer as well, because we already have bert_embeddings layer attached to our text object.

In [6]:
bert_tagger_3 = BertTagger(bert_location='C:/Users/kittask/Documents/multilingual_bert', bert_layers=[-3, -2, -1], output_layer='bert_embeddings_3')

In [7]:
bert_tagger_3.tag(text)

text
"Aga mulle tundub, et kogu maailm ootab muusikamaailmalt midagi erutavalt uut minimalismi kõrvale."

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,1
words,normalized_form,,,True,15
bert_embeddings,"token, bert_embedding",,,True,31
bert_embeddings_3,"token, bert_embedding",,,True,31


In [8]:
text.bert_embeddings_3

layer name,attributes,parent,enveloping,ambiguous,span count
bert_embeddings_3,"token, bert_embedding",,,True,31

text,token,bert_embedding
Aga,aga,[-0.84229594 -0.72933775 0.4524976 ... 1.8962842 0.85748005\n 0.0204182 ]
mu,mu,[-0.49052215 -0.48088354 0.15140066 ... 1.0888627 0.43207562\n -0.1550504 ]
lle,##lle,[ 0.3192304 -0.7496845 0.25810474 ... 1.5318996 0.6025036\n -0.13246572]
tun,tun,[ 0.28671557 -0.06923172 0.1254334 ... 0.94661087 0.7602785\n -0.6301877 ]
dub,##dub,[ 0.11578339 0.6409123 0.40151653 ... 1.533448 0.6384699\n -0.4229727 ]
",",",",[-0.21661654 0.318155 0.8258475 ... 1.7256243 0.4238516\n 0.06740805]
et,et,[-0.50128955 0.51604974 1.1569327 ... 1.8742213 0.7969719\n 0.17700657]
kogu,kogu,[ 0.11237316 -0.36568075 1.2336712 ... 0.6215325 0.6152663\n 0.12196311]
maa,maa,[ 0.21957126 -0.7018786 0.4476072 ... 0.25743842 0.6815776\n -0.07652788]
ilm,##ilm,[ 0.46605277 -0.31966358 0.6227435 ... 0.75258493 0.831807\n -0.2675484 ]


We can also sum the embeddings. For that we need to specify the method for BertTagger. To add the embeddings, we need to set method='add'. 

In [9]:
bert_tagger_sum = BertTagger(bert_location='C:/Users/kittask/Documents/multilingual_bert', method='add', bert_layers=[-3, -2, -1], output_layer='bert_embeddings_add_3')
bert_tagger_sum.tag(text)
text.bert_embeddings_add_3

layer name,attributes,parent,enveloping,ambiguous,span count
bert_embeddings_add_3,"token, bert_embedding",,,True,31

text,token,bert_embedding
Aga,aga,"[-2.15782952e+00 -2.16185927e+00 2.20844221e+00 3.18383276e-01\n 1.34746146e+0 ..., type: <class 'numpy.ndarray'>, length: 768"
mu,mu,"[-1.31272113e+00 -1.83470631e+00 1.67795300e+00 1.86497152e+00\n -4.26746905e-0 ..., type: <class 'numpy.ndarray'>, length: 768"
lle,##lle,"[ 3.01882565e-01 -2.48939705e+00 2.55384779e+00 1.47929657e+00\n 1.43865240e+0 ..., type: <class 'numpy.ndarray'>, length: 768"
tun,tun,"[ 7.5338298e-01 -3.1799060e-01 1.0457667e+00 2.3616266e+00\n 2.8496487e+00 1. ..., type: <class 'numpy.ndarray'>, length: 768"
dub,##dub,"[ 6.63526833e-01 1.39409125e+00 2.15573502e+00 1.80749440e+00\n 1.83501768e+0 ..., type: <class 'numpy.ndarray'>, length: 768"
",",",","[-1.48790240e+00 7.73387909e-01 2.26397276e+00 2.56439257e+00\n 2.67729044e+0 ..., type: <class 'numpy.ndarray'>, length: 768"
et,et,"[-2.0268567e+00 1.0270715e+00 4.1144314e+00 1.2934051e+00\n 7.9481196e-01 -9. ..., type: <class 'numpy.ndarray'>, length: 768"
kogu,kogu,"[ 5.46933115e-02 -1.44821286e+00 4.35730362e+00 1.00094569e+00\n -1.37542737e+0 ..., type: <class 'numpy.ndarray'>, length: 768"
maa,maa,"[ 1.70376554e-01 -2.03198910e+00 2.58545780e+00 1.75018990e+00\n 5.32155454e-0 ..., type: <class 'numpy.ndarray'>, length: 768"
ilm,##ilm,"[ 7.1914423e-01 -1.2868505e+00 2.8574309e+00 1.1519217e+00\n 2.9510667e+00 7. ..., type: <class 'numpy.ndarray'>, length: 768"


We can see that adding the embeddings will result to embeddings with shape (768,), but the concatenated ones larger: 

In [10]:
text.bert_embeddings[0].bert_embedding[0].shape, text.bert_embeddings_add_3[0].bert_embedding[0].shape

((3072,), (768,))

We can also return all the embeddings from the layers chosen. We can see that the output is now a list containing 3 embeddings: 

In [11]:
bert_tagger_all = BertTagger(bert_location='C:/Users/kittask/Documents/multilingual_bert', method='all', bert_layers=[-3, -2, -1], output_layer='bert_embeddings_all_3')
bert_tagger_all.tag(text)
text.bert_embeddings_all_3

layer name,attributes,parent,enveloping,ambiguous,span count
bert_embeddings_all_3,"token, bert_embedding",,,True,31

text,token,bert_embedding
Aga,aga,"[array([-8.42295945e-01, -7.29337752e-01, 4.52497602e-01, 4.88240778e-01,\n ..., type: <class 'list'>, length: 3"
mu,mu,"[array([-4.90522146e-01, -4.80883539e-01, 1.51400656e-01, 1.05536187e+00,\n ..., type: <class 'list'>, length: 3"
lle,##lle,"[array([ 3.19230407e-01, -7.49684513e-01, 2.58104742e-01, 1.06968212e+00,\n ..., type: <class 'list'>, length: 3"
tun,tun,"[array([ 2.86715567e-01, -6.92317188e-02, 1.25433400e-01, 1.21642137e+00,\n ..., type: <class 'list'>, length: 3"
dub,##dub,"[array([ 1.15783393e-01, 6.40912294e-01, 4.01516527e-01, 1.21943367e+00,\n ..., type: <class 'list'>, length: 3"
",",",","[array([-2.16616541e-01, 3.18154991e-01, 8.25847507e-01, 1.22031486e+00,\n ..., type: <class 'list'>, length: 3"
et,et,"[array([-5.01289546e-01, 5.16049743e-01, 1.15693271e+00, 8.90543103e-01,\n ..., type: <class 'list'>, length: 3"
kogu,kogu,"[array([ 1.12373158e-01, -3.65680754e-01, 1.23367119e+00, 7.54094660e-01,\n ..., type: <class 'list'>, length: 3"
maa,maa,"[array([ 2.19571263e-01, -7.01878607e-01, 4.47607189e-01, 7.75421262e-01,\n ..., type: <class 'list'>, length: 3"
ilm,##ilm,"[array([ 4.66052771e-01, -3.19663584e-01, 6.22743487e-01, 5.81399322e-01,\n ..., type: <class 'list'>, length: 3"


### Word level embeddings

As BERT uses tokenizer that splits words into subwords, we get embeddings for those subwords and not for whole words. If we still want word-level embeddings we can add up all the subword embeddings that constitute a word.  It will follow the EstNLTK default tokenizer to get the separate words. To get the word level embeddings we need to tell the BertTagger that we do not want token_level embeddings, so we have to set token_level=False. 

In [12]:
bert_tagger_word = BertTagger(bert_location='C:/Users/kittask/Documents/multilingual_bert',  bert_layers=[-3, -2, -1], output_layer='bert_embeddings_word', token_level=False)
bert_tagger_word.tag(text)
text.bert_embeddings_word

layer name,attributes,parent,enveloping,ambiguous,span count
bert_embeddings_word,"token, bert_embedding",,,True,15

text,token,bert_embedding
Aga,['aga'],[-0.84229594 -0.72933775 0.4524976 ... 1.8962842 0.85748005\n 0.0204182 ]
mulle,"['mu', '##lle']",[-0.17129174 -1.230568 0.4095054 ... 2.6207623 1.0345793\n -0.28751612]
tundub,"['tun', '##dub']",[ 0.40249896 0.57168055 0.52694994 ... 2.480059 1.3987484\n -1.0531604 ]
",","[',']",[-0.21661654 0.318155 0.8258475 ... 1.7256243 0.4238516\n 0.06740805]
et,['et'],[-0.50128955 0.51604974 1.1569327 ... 1.8742213 0.7969719\n 0.17700657]
kogu,['kogu'],[ 0.11237316 -0.36568075 1.2336712 ... 0.6215325 0.6152663\n 0.12196311]
maailm,"['maa', '##ilm']",[ 0.685624 -1.0215422 1.0703506 ... 1.0100234 1.5133846\n -0.34407628]
ootab,"['o', '##ota', '##b']",[-1.9043984 0.98316085 1.9329679 ... 2.5574958 0.95062137\n -1.3202546 ]
muusikamaailmalt,"['muu', '##sika', '##maa', '##ilm', '##alt']",[-2.899208 -1.2364726 5.422554 ... 3.2912736 4.559246 -2.062479 ]
midagi,"['mida', '##gi']",[-0.5580274 -0.13899168 -0.2758302 ... 2.8380585 1.452673\n -0.55280805]


We can also see, which of the tokens have been concatenated from the token attribute. 