## BertAugmentor

BERT (Bidirectional Encoder Representation from Transformers) is a method of pre-training language representations.

BERT is pre-trained on two tasks:  <br>
1. Masked Language Modeling (MLM)  
2. Next-Sentence Prediction (NSP)
    
We can use BERT models ability to fill in the masked word using the context to create new augmented sentences. <br>
For example, in this next sentence, we replaced word `book` with [MASK] token: <br>
"She borrowed the [MASK] from him many years ago and hasn't yet returned it." <br>
BERT (in this example it is [bert-base-cased](https://huggingface.co/bert-base-cased?text=She+borrowed+the+%5BMASK%5D+from+him+many+years+ago+and+hasn%27t+yet+returned+it.)) will output new words in place of [MASK] token: <br>
`necklace`<br>
`ring` <br>
`painting` <br>
`key` <br>
`book`<br>

*Note: you need to install [estnltk_neural](https://github.com/estnltk/estnltk/tree/main/estnltk_neural) package for using this component.*

For augmenting sentences, we can use `BertAugmentor`.

In [1]:
from estnltk import Text
from estnltk_neural.tools.bert.augmentation.bert_sentence_augmentor import BertAugmentor

By default BertAugmentor uses pre-trained [EstBERT](https://huggingface.co/tartuNLP/EstBERT_512) model for the augmentations. If you want to use any other model, then specify `model_name`, when you initialize the augmentor. It can either be a location of directory where the model is, or name of the model that is availbale in the Hugging Face [transformers](https://huggingface.co/models) library. 

In [2]:
augmentor = BertAugmentor()

# example with different model name and sentences layer
# augmentor = BertAugmentor(model_name='bert-base-multilingual-cased', sentences='sents')

To augment sentence, we need to first perform segmentation. Sentences layer can be changed by specifying the name of the sentences layer with `sentences_layer`. By default it is `sentences`. 

In [3]:
text = Text("Aas näeb selle taga üldist moraaliküsimust.")
text.tag_layer('paragraphs')

text
Aas näeb selle taga üldist moraaliküsimust.

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,1
words,normalized_form,,,True,7


We have two ways of augmenting the sentence. 
1. We can have many masks in one sentence and the augmentor will output new words to all of the masked tokens at once. 
2. We have one mask in one sentence and the augmentor will output new word in one place at a time. 

We have to use BertAugmentor class method `augment` to do the augmentations. Let's go over both possibilities. 

### 1. Many masks in a sentence

First, we have to specify the mask variable, which is a list containing ones and zeros. One means that the word that is located in this index, should be augmented and zero means that the word located in this index should be left unchanged.  In this example, we want to change words `Aas`, `üldist` and `näeb`. All of those words will be replaced by the BertAugmentor. `how_many` specifies how many of such augmented sentences is outputted. `method` specifies the method of filling in the masks. In this case we need to use `method=many`. 

In [4]:
mask = [1 if word.text in {'Aas','üldist', 'näeb'} else 0 for word in text.words]
print(mask)
augmentations = augmentor.augment(text, mask=mask, how_many=4, method='many')
augmentations

[1, 1, 0, 0, 1, 0, 0]


['sageli on selle taga olevat moraaliküsimust .',
 'küllap võib selle taga ka moraaliküsimust .',
 'enamasti näeb selle taga olnud moraaliküsimust .',
 'ilmselt pole selle taga otsida moraaliküsimust .']

The output of the BertAugmentor method `augment` is a list containing augmented sentences. In the example, we wanted to replace the words `Aas`, `üldist` and `näeb`.  There will be `how_many` amount of sentences in the list, in our case, there is four sentences.  In the first outputted sentence, the word `Aas` is replaced with `sageli`, word `üldist` is replaced with `olevat` and word `näeb` with `on`. Other outputted sentences have been augmented the same way but using different words. 

### 2. One mask in a sentence

In this case, we also need to specify the `mask` list. This time, one in the mask list means that the word that is located in the same location as the one should be replaced. There will be `how_many` sentences modified like that. 

In [5]:
mask = [1 if word.text in {'Aas','üldist', 'näeb'} else 0 for word in text.words]
print(mask)
augmentations = augmentor.augment(text, mask=mask, how_many=4, method='one')
augmentations

[1, 1, 0, 0, 1, 0, 0]


[['küllap näeb selle taga üldist moraaliküsimust .',
  'ta näeb selle taga üldist moraaliküsimust .',
  'eks näeb selle taga üldist moraaliküsimust .',
  'ometi näeb selle taga üldist moraaliküsimust .'],
 ['Aas näeb selle taga üldist moraaliküsimust .',
  'Aas nägi selle taga üldist moraaliküsimust .',
  'Aas otsib selle taga üldist moraaliküsimust .',
  'Aas tunneb selle taga üldist moraaliküsimust .'],
 ['Aas näeb selle taga oma moraaliküsimust .',
  'Aas näeb selle taga olevat moraaliküsimust .',
  'Aas näeb selle taga seisvat moraaliküsimust .',
  'Aas näeb selle taga ka moraaliküsimust .']]

As we can see, the output is a list of lists. The amount of sublists is equal to amount of ones in the mask list. Each sublist will contain `how_many` amount of sentences, where one word is augmented. First, we changed word `Aas` and wanted 4 sentences. This word was replaced with: `küllap`,`ta`, `eks`, `ometi`. Second, we changed the word `näeb`, this word was replaced with: `näeb`, `nägi`, `otsib`, `tunneb`. Third, we wanted to augment the word `üldist` as well. BertAugmentor replaced it with words: `oma`, `olevat`, `seisvat`, `ka`. 

## Specifying how many augmentations will be outputted

As the BERT uses its vocabulary for the predictions, which contains subwords, characters, full words, numbers, etc. This means that BERT can predict subwords and not full words inside the blanks. BertAugmentor will filter some of those subwords out. 

For example, when we have a sentence: 

`Aas siiski leidis , et defineerimine on formaalsus .`

BertTokenizer will tokenize this sentence like this: 

`['aas', 'siiski', 'leidis', ',', 'et', 'defineeri', '##mine', 'on', 'formaal', '##sus', '.']`

As we can see, words `defineerimine` and `formaalsus` have been split into `defineeri + mine` and `formaal + sus`. When BERT outputs words that start with `##` BertAugmentor filters such words out. BertAugmentor cannot filter out other subwords that are not full words, for example `formaal`.