# "Destilando Pré Treinamento do BERT"
> "Um passo a passo sobre como ele funciona :)"

- toc: true
- branch: master
- author: Andre Barbosa
- badges: true
- hide_binder_badge: false
- hide_colab_badge: false
- comments: true
- categories: [masters, nlp, knowledge-distill]
- hide: true
- search_exclude: true

> Note: Para uma versão em **inglês** confira [aqui](https://abarbosa94.github.io/personal_blog/masters/nlp/knowledge-distill/2020/09/19/Distilling-BERT.html)

# Uma rápida revisão

Eu lembro algum dia de 2016, quando eu estava no início da kinha carreira, eu encontrei por acaso o [blog do Chirs McCormick sobre Word2Vec](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/). Honestamente, acredito que o [artigo escrito pelo Tomas Mikolov](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) foi uma das indéias mais interessantes que eu já encontrei nessa minha jornada como cientista de dados {% fn 1 %} :) 

{{ 'Fun Fact: O [perfil do LinkedIn do Miklov](https://www.linkedin.com/in/tomas-mikolov-59831188/?originalSubdomain=cz) mostra que ele trabalhou na Microsoft, Google e Facebook; outro autor do W2V, [Ilya Sutskever](http://www.cs.toronto.edu/~ilya/) teve oportunidades de trabalhar com os maiores pesquisadores da área moderna de IA, tais como [Geoffrey Hinton](https://www.cs.toronto.edu/~hinton/) e [Andrew Ng](https://www.andrewng.org/). Além disso, ele é um dos fundadores da [Open AI](https://openai.com/)! ' | fndetail: 1 }}

## O que são Word Embeddings


Segundo a documentação do [Pytorch](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html),  um **Embedding** pode ser definido da seguinte forma: 

   >Uma tabela de lookup formada por um _dicionário_ de tamanho fixo.

Podemos interpretar os embeddings como uma forma de converter _índices_ em _vetores_ de um tamanho específico. Logo, **word embeddings**, podem ser entendidos como palavras que são convertidas para inteiros e **esses** números servem de índices para diferentes linhas de uma matriz que representa o espaço vetorial.'

Eu escrevi um código usando [manim](https://github.com/3b1b/manim) que ilustra isso:

![](images/media/videos/scene/720p30/EmbeddingExample.gif "Nesse exemplo, a dimensão do embedding é NxM, em que N seria o tamanho do vocabulário (8) e M é 4.")

Podemos interpretar cada dimensão como um único neurônio de uma camada oculta, e, então, **o tamanho desses embeddings podem ter seus números alterados** a partir de uma rede neural. Essa é, basicamente, a ideia por trás de algoritmos como [Word2Vec](https://patents.google.com/patent/US9037464B1/en) e [fastText](https://fasttext.cc/) {% fn 2 %} 

Já existem algumas bibliotecas que já fornecem alguns vetores pré-treinados. Por exemplo, considere o [código Spacy](https://spacy.io/models) abaixo:

{{ 'Eu não irei cobrir Word2Vec nesse blog post. Se você não tem familiaridade com isso, [consulte aqui](http://jalammar.github.io/illustrated-word2vec/); [aqui](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) e [aqui](https://www.youtube.com/watch?v=ASn7ExxLZws). Infelizmente, todos os links estão em inglês. Se você achar quiser que eu escreva um post sobre Word2Vec em português, me envie uma mensagem no meu [linkedin](https://www.linkedin.com/in/barbosaandre/) :)' | fndetail: 2 }}

In [1]:
#collapse-hide
import spacy
nlp = spacy.load("en_core_web_md")

print("Considere a sentença 'The quick brown fox jumps over the lazy dog!!'")
text = nlp("The quick brown fox jumps over the lazy dog!!")
for word in text:
    print(
        f"'{word.text}' representação vetorial com tamanho {word.vector.shape[0]}. Seus primeiros 5 elementos são: {word.vector[:5].round(2)}"
    )

Considere a sentença 'The quick brown fox jumps over the lazy dog!!'
'The' representação vetorial com tamanho 300. Seus primeiros 5 elementos são: [ 0.27 -0.06 -0.19  0.02 -0.02]
'quick' representação vetorial com tamanho 300. Seus primeiros 5 elementos são: [-0.45  0.19 -0.25  0.47  0.16]
'brown' representação vetorial com tamanho 300. Seus primeiros 5 elementos são: [-0.37 -0.08  0.11  0.19  0.03]
'fox' representação vetorial com tamanho 300. Seus primeiros 5 elementos são: [-0.35 -0.08  0.18 -0.09 -0.45]
'jumps' representação vetorial com tamanho 300. Seus primeiros 5 elementos são: [-0.33  0.22 -0.35 -0.26  0.41]
'over' representação vetorial com tamanho 300. Seus primeiros 5 elementos são: [-0.3   0.01  0.04  0.1   0.12]
'the' representação vetorial com tamanho 300. Seus primeiros 5 elementos são: [ 0.27 -0.06 -0.19  0.02 -0.02]
'lazy' representação vetorial com tamanho 300. Seus primeiros 5 elementos são: [-0.35 -0.3  -0.18 -0.32 -0.39]
'dog' representação vetorial com tamanho 30

Essas palavras são representações que foram treinadas com base nos dados do [Common Crawl usando o algoritmo GloVe](https://github.com/explosion/spacy-models/releases/tag/en_core_web_md-3.0.0). Diferente do exemplo usado no começo deste blog, a palavra '!' também teve uma representação vetorial. Outro fator interessante é que o GloVe provavelmente passou por uma etapa de pré processamento, uma vez que as palavras '_The_' and '_the_' tem a mesma representação vetorial. 

In [2]:
#collapse-hide
print(f"Os primeiros 5 valores da palavra 'The': {nlp('The').vector[:5].round(2)}")
print(f"Os primeiros 5 valores da palavra 'the': {nlp('the').vector[:5].round(2)}")

Os primeiros 5 valores da palavra 'The': [ 0.27 -0.06 -0.19  0.02 -0.02]
Os primeiros 5 valores da palavra 'the': [ 0.27 -0.06 -0.19  0.02 -0.02]


Para formar frases, podemos combinar embeddings de palavras de formas diferentes. Segundo a [documentação do spacy](https://spacy.io/usage/vectors-similarity#_title):

> Modelos que possuem vetores de palavras estão disponíveis pelo atributo Token.vector. Doc.vector e Span.vector, por padrão são representados pela **média** da representação de seus vetores. 


Logo, a frase que estamos usando como exemplo tem a seguinte representação vetorial:

In [3]:
#hide_input
print(f"Os primeiros 5 valores de 'The quick brown fox jumps over the lazy dog!!': {text.vector[:5].round(2)}")

Os primeiros 5 valores de 'The quick brown fox jumps over the lazy dog!!': [-0.23  0.08 -0.03 -0.07 -0.02]


## Limitations of Word Embeddings

Even though Word Embeddings brings many benefits in the realm of computational linguistics, they have some limitations. There is a linguistic phenomenon called _polysemy_. According to [wikipedia](https://en.wikipedia.org/wiki/Polysemy#:~:text=English%20has%20many%20polysemous%20words,a%20subset%20of%20the%20other.):
> A polyseme is a word or phrase with different, but related senses.(...) English has many polysemous words. For example, the verb "to get" can mean "procure" (I'll get the drinks), "become" (she got scared), "understand" (I get it) etc.

So considering the example above, despite the fact that the verb has **different meaning** depending on the contexts, **it's word representation would always be the same**

In [4]:
#hide_input
print(f"First 5 values of verb 'to get' vector: {nlp('to get').vector[:5].round(2)}")

First 5 values of verb 'to get' vector: [ 0.03  0.12 -0.32  0.13  0.12]


Then, if we pick two phrases: `She got scared` and `She understand it`, we will get the following vectors

In [5]:
text1 = nlp("He will get scared")
text2 = nlp("She will get the drinks")

print(f"First 5 values of verb '{text1}' vector: {text1.vector[:5].round(2)}")
print(f"First 5 values of verb '{text2}' vector: {text2.vector[:5].round(2)}")

First 5 values of verb 'He will get scared' vector: [-0.12  0.19 -0.21 -0.14  0.09]
First 5 values of verb 'She will get the drinks' vector: [ 0.01  0.13 -0.04 -0.08  0.03]


Then, if we take the cosine similarity by taking the average of the word vectors:

In [6]:
# collapse-hide
from sklearn.metrics.pairwise import cosine_similarity

print(
    f"Similarity between:\n '{text1}' and '{text2}': "
    f"{cosine_similarity(text1.vector.reshape(1, -1),text2.vector.reshape(1, -1))[0][0]}"
)

Simlarity between:
 'He will get scared' and 'She will get the drinks': 0.8653444051742554


This indicates that both vectors would be a lot similar. However, the reason for that is the usage of _similar_ words, even considering that they were applied in different contexts! So there is the objective that BERT tries to solve.{% fn 3 %} 



{{ 'There are some BERT percursors such as [ELMo](https://allennlp.org/elmo); [ULMFit](https://arxiv.org/abs/1801.06146) and [Open AI Transformer](https://openai.com/blog/language-unsupervised/) that I am not going to cover here. Please reach out to [Illustrated BERT blog](http://jalammar.github.io/illustrated-bert/) to know more' | fndetail: 3 }}



# BERT Model

## Attention is all you need

The [Attention is all you need](https://arxiv.org/abs/1706.03762) paper have introduced the Transformer architeture for us :) In sense, it can be summarized as the picture below:

![](images/transformer.png "The transformer- model architeture, taken from: https://arxiv.org/abs/1706.03762")

Strictly speaking, the motivation behind the paper is that _RNN_-like architetures are memory-expensive. The purpose behind Transformer models is that it you can achieve similar results using more computer efficient resources by applying **just attention mechanisms** (and exluding the CNN or RNN-like architetures) !{% fn 4 %} Despite the fact that the Transformer model was proposed to deal with translation problems, it turns out that we can also use variations of it to achieve awesome results in different tasks. This is the **motivation behind BERT**!





{{ '[The NLP group from Harvard](http://nlp.seas.harvard.edu/2018/04/03/attention.html) has written a great blog post distilling the paper as well as implementing them in pytorch. If you have some interest in knowing details about the transformer architecture, I recommend looking at it! ' | fndetail: 4 }}

### Attention?

According to the [Transformer and Attention lecture from NYU foundations of Deep Learning Course](https://atcold.github.io/pytorch-Deep-Learning/en/week12/12-3/):

   > Transformers are made up of attention modules, which are mappings between sets, rather than sequences, which means we do not impose an ordering to our inputs/outputs.
   


When we analyze the transformer architeture, we can see that both _Multi-Head Attention_ and _Multi-Head Masked Attention_ box have 3 Arrow Heads. Each one represents one of the following:

   - _Q_ that stands for **query** vector with dimension $d_k$ 
   - _K_ that stands for **key** vector that also has dimension $d_k$
   - _V_ that stands for **value** vector that also has dimension $d_v$
   
Where these three can be understood as projections over the input embeddings.

### Key-Value Store

Again, from the [Deep Learning Foundations Course from NYU](https://atcold.github.io/pytorch-Deep-Learning/en/week12/12-3/):

> A key-value store is a paradigm designed for storing (saving), retrieving (querying), and managing associative arrays (dictionaries/hash tables)

>For example, say we wanted to find a recipe to make lasagne. We have a recipe book and search for “lasagne” - this is the query. This query is checked against all possible keys in your dataset - in this case, this could be the titles of all the recipes in the book. We check how aligned the query is with each title to find the maximum matching score between the query and all the respective keys. If our output is the argmax function - we retrieve the single recipe with the highest score. Otherwise, if we use a soft argmax function, we would get a probability distribution and can retrieve in order from the most similar content to less and less relevant recipes matching the query.

> Basically, the query is the question. Given one query, we check this query against every key and retrieve all matching content.

Then, we can say that _K_; _Q_ and _V_ are specific rotations around a given input vector _x_ (the embedding one, for instance).

> Warning: I have decided not to cover attention concepts in this post, giving just a higher-level introduction. As you might have noticed, NYU Deep Learning Foundations Course provides a really nice introduction about the topic that I recommend going through if you want to learn more :)

### Positional Encoding

This was taken from [The annotated transformer blog](http://nlp.seas.harvard.edu/2018/04/03/attention.html#prelims) where you can find a cool pytorch implementation. It turns out that actually this is a quote from Attention is all you need [paper](https://arxiv.org/pdf/1706.03762.pdf):

>Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension 
$d_{model}$ as the embeddings, so that the two can be summed


![](images/positional_encoding.png "One example of a positional encoding that generates sine wave based on length. Notice that each dimension generates a sine wave with different frequency. Source: http://nlp.seas.harvard.edu/2018/04/03/attention.html")

## The BERT model

BERT model itself is an _encoder model_ only from the transformer model. Considering the models trained from the [paper](https://arxiv.org/pdf/1810.04805.pdf), the **base** model consists of 12 _encoder-stacked_ layers and the **large** model consists of 24 _encoder-stacked_ layers.

According to the [Attention is all you need paper](https://arxiv.org/pdf/1706.03762.pdf):

> The encoder is composed of a stack of $N = 6$ identical layers. Each layer has **two
sub-layers**. The first is a **multi-head self-attention mechanism**, and the second is a simple, **position wise fully connected feed-forward network**. We employ a [residual connection](https://arxiv.org/abs/1512.03385) around **each** of the two sub-layers, followed by [layer normalization](https://arxiv.org/abs/1607.06450).

![](images/sublayers.jpg "The encoder layer. Source: https://atcold.github.io/pytorch-Deep-Learning/en/week12/12-1/")

### The Multi-Head Attention

Basically, the multi head attention is a _type_ of an attention mechanism. It is basically a _concatenation_ of another type of attention, the _scaled dot_. Both mechanisms works together as represented in the following image:

![](images/attention_specific.png "(left) Scaled Dot-Product Attention followed by the Multi-Head Attention which consists of several attention layers running in parallel. Source: https://atcold.github.io/pytorch-Deep-Learning/en/week12/12-1/")

Here, _h_, or the number o attention heads (or layers) is equal to $12$ in the case of $\text{BERT}_\text{base}$ and $16$ in the case of  $\text{BERT}_\text{large}$

### Residual Conections


Each sublayer of the encoder stack contains a residual connection (the left curved arrow) added to the sublayer output before layer normalization. The [idea of Residual Conections](https://arxiv.org/pdf/1512.03385.pdf) came from Computer Vision domain, and actually, it is a relatively simple technique that can be summarized by the following image:

![](images/residual_connection.png "Residual Connection example. Source (https://arxiv.org/pdf/1512.03385.pdf)")

Considering the image above and the case of Encoder stack, each $\mathcal{F}(x)$ means either the _Multi-Head Attention_ or _Feed Forward_. Therefore, quoting the paper:

> That is, the **output of each sub-layer is LayerNorm(x + Sublayer(x))**, where Sublayer(x) is the function implemented by the sub-layer itself. To _facilitate these residual connections_, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension $d_{model} = 512$ {% fn 5 %}.




{{ 'In the case of [BERT model](https://arxiv.org/pdf/1810.04805.pdf), please have in mind that $N$ is either $12$ \(BERT<sub>base</sub>\) or $24$ (\(BERT<sub>large</sub>\) and _d<sub>model</sub>_ is 768 for BERT base and 1024 for BERT large' | fndetail: 5 }}

Then, what, in fact, is being encoded?

## Embedding Representation

The authors would like to make BERT to perform well in different downstream tasks such as _binary and multi lablel classification_; _language modeling_; _question and answering_; _named entity recognition_; _etc_. Therefore, they said the following:

> our input representation is able to unambiguously represent both a single sentence and a pair of sentences
(e.g., h Question, Answeri) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together

In order to perform and create the sentence embeddings, [WordPiece tokenize is applied](https://arxiv.org/abs/1609.08144). Then, besides adding [CLS] token, pairs of sentence (e.g. sentence _A_ and _B_) are concatenated into a single sentence, being separated with a special token [SEP] (e.g. _A_ [SEP] _B_). Then, a learned embedding [explained in NSP section](#Next-Sentence-Prediction) that indicates if the token becomes to either A or B is also add to the token representation.

Then:

> For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings.

![](images/token_embeddings.png "BERT input representation. Source: https://arxiv.org/pdf/1810.04805.pdf")

# BERT Pre Training

The first part of BERT is a pre Training procedure that involved two objective functions

## Masked Language Model (MLM)

As we are feeding the whole sentence into the model, it is possible to say that the model is bidirectional and hence as we are trying to predict the next word in a sentence, it would has access to it! Then, the idea behind this task is pretty simple. We can directly quote from the [paper](https://arxiv.org/pdf/1810.04805.pdf):

> Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially
predict the target word in a multi-layered context. 

> In order to train a deep bidirectional representation, we simply mask some percentage of the input
tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a _Cloze task_ in the [literature](https://journals.sagepub.com/doi/abs/10.1177/107769905303000401). In this
case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.

In the case of BERT model, 15% of each sentence were masked during training.

![](images/mlm.png "MLM task. Taken from here: http://jalammar.github.io/illustrated-bert/")

## Next Sentence Prediction (NSP)

In order to learn relationships between pair of sentence (e.g. Question and Ansering tasks) the authors needed a different approach than plain Language Modeling. Then:

>  In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus. Specifically, when choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A (labeled as `IsNext`), and 50% of the time it is a random sentence from the corpus (labeled as `NotNext`). 


Once defined, both objected functions are used in BERT Pre training learning :)

![](images/nsp.png "Next Sentence Preiction. Taken from here: http://jalammar.github.io/illustrated-bert/")

> Note: The training loss is the sum of the mean masked LM (MLM) likelihood and the mean next sentence prediction (NSP) likelihood

> Important: You may have noticed but this training procedure **does not require labeling**. As we are using the raw text inputs to generate the _labels_ during training, e considerer this BERT Pre Training as a _self-surpervised_ model!

# Putting all together

As we are dealing with **sentence** embeddings than **word** embeddings we need a clever way to, well, encode these sentences. Let's see how BERT do it:

   - We first take a text as input
   - We apply WordPiece Tokenizer
   - We fed the input into the Encoder stack
   - We train the network (Pre-Training step)
   - For those familiar with _CNN_ we can say that [CLS] embedding works as a "pooled" representation ([ref](https://arxiv.org/pdf/2002.08909.pdf)) of the sentence and then can be used as a **contextual embedding feature**. Hence, it can be fed into a Neural Net to solve classification tasks!
   - Depending on the downstreaming task (_Fine tuning task_) other token embeddings can be used as well
   
 > Important: without the fine-tuning task, CLS vector is not a meaninful representation since it was trained with NSP ([ref](https://arxiv.org/pdf/1810.04805.pdf))

I have tried to summarize a foward pass of BERT thorugh the following gif:

![](images/media/videos/scene/720p30/TransformerEncoderExample.gif "Entire Forward passing in BERT")

# Working in Practice

To show sentence embedding from BERT working, I usually rely on [Hugging Face's transformer library](https://huggingface.co/transformers/). Here, since the **Bert Model for Language Model** was trained already, I will be using the bare BERT Model without any specific head (e.g., `LanguageModeling head` or `Sentence Classification head`) on top of it!

In [7]:
#collapse
import numpy as np
import torch
from transformers import BertModel,BertTokenizer, BertForPreTraining
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

In [8]:
sequence_0 = "He will get scared"
sequence_1 = "She will get the drinks"

In [9]:
#collapse-hide
sequence_0_w2id = tokenizer.encode(sequence_0) # we need to map words to id's :)
sequence_1_w2id = tokenizer.encode(sequence_1)

print(f"Sequence 0 word2Id mapping: {sequence_0_w2id}")
print(f"Sequence 1 word2Id mapping: {sequence_1_w2id}")

Sequence 0 word2Id mapping: [101, 2002, 2097, 2131, 6015, 102]
Sequence 1 word2Id mapping: [101, 2016, 2097, 2131, 1996, 8974, 102]


In [10]:
#collapse-hide
sequence_0_embeddings = torch.tensor(sequence_0_w2id).unsqueeze(0)  # Batch size 1
sequence_0_embeddings = model(sequence_0_embeddings, return_dict=True)[
    "last_hidden_state"
].detach().numpy()
sequence_1_embeddings = torch.tensor(sequence_1_w2id).unsqueeze(0)  # Batch size 1
sequence_1_embeddings = model(sequence_1_embeddings, return_dict=True)[
    "last_hidden_state"
].detach().numpy()

In [11]:
sequence_0_embeddings.shape, sequence_1_embeddings.shape

((1, 6, 768), (1, 7, 768))

Since the first dimension means the batch size, we can get rid of it!

In [12]:
#collapse-hide
sequence_0_embeddings=sequence_0_embeddings[0]
sequence_1_embeddings=sequence_1_embeddings[0]
sequence_0_embeddings.shape, sequence_1_embeddings.shape

((6, 768), (7, 768))

It turns out that this model generates one embedding for each word plus `CLS` and `SEP` tokens. This explains why sentence_0 and sentence_1 both start and end with the same token number! Let's perform some cool math to analyze some patterns :)

First, let's analyze the similarity between CLS and token words

In [13]:
# collapse-hide
CLS_TOKEN_0 = sequence_0_embeddings[0]
CLS_TOKEN_WORDS_0 = np.mean(sequence_0_embeddings[[1, 2, 3, 4]], axis=0)
print(
    f"Cosine Similatiry between CLS token and the average of\n'{sequence_0}'"
    f" tokens: {cosine_similarity(CLS_TOKEN_0.reshape(1, -1), CLS_TOKEN_WORDS_0.reshape(1, -1))[0][0]}"
)

Cosine Similatiry between CLS token and the average of
'He will get scared' tokens: 0.29071152210235596


In [14]:
# collapse-hide
CLS_TOKEN_1 = sequence_1_embeddings[0]
CLS_TOKEN_WORDS_1 = np.mean(sequence_1_embeddings[[1, 2, 3, 4]], axis=0)
print(
    f"Cosine Similatiry between CLS token and the average of \n'{sequence_1}'"
    f" tokens: {cosine_similarity(CLS_TOKEN_1.reshape(1, -1), CLS_TOKEN_WORDS_1.reshape(1, -1))[0][0]}"
)

Cosine Similatiry between CLS token and the average of 
'She will get the drinks' tokens: 0.32392317056655884


It is interesting since as stated by the paper, the CLS token _seems to be meaninfulless_. Then, let's analyze the similarity between the average tokens embeddings of each sentence

In [15]:
# hide-input
print(
    f"Cosine Similatiry between average of embedding tokens of\n'{sequence_0}'and '{sequence_1}'"
    f" tokens :{cosine_similarity(CLS_TOKEN_WORDS_0.reshape(1, -1), CLS_TOKEN_WORDS_1.reshape(1, -1))[0][0]}"
)

Cosine Similatiry between average of embedding tokens of
'He will get scared'and 'She will get the drinks' tokens :0.6591895222663879


**As expected**, despite the fact that _similar_ words were used, their contexts were totally different and therefore, their embeddings similarities were less than the plain word vectors :)

# Conclusion

Congratulations! You have learned the main concepts behind the BERT model :) Please stay tuned, tor future blog posts :) I intend adding distillation about some BERT fine tuning as well as dissecting it from scratch!

However, if you want to have a higher level approach about how this works, I [highly recommend this blog post](https://huggingface.co/blog/how-to-train)!

# Resources that have inspired me

Besides all other papers that I have referenced through this post, I would like to emphaisze the following:

- http://jalammar.github.io/illustrated-bert/
- https://jalammar.github.io/illustrated-transformer/
- http://nlp.seas.harvard.edu/2018/04/03/attention.html

# Acknowledgments

I would really like to appreciate the effort made by some colleagues that provided a fantastic technical review for this blog post :)


In alphabetical order:

- [Alan Barzilay](https://www.linkedin.com/in/alan-barzilay-58754855/)
- [Alvaro Marques](https://www.linkedin.com/in/alvaro-marques-9a10aa131/)
- [Igor Hoelscher](https://www.linkedin.com/in/ighoelscher/)