# "BERT Introduction"
> "Inspired by Jay Alamar Ilustrated BERT blog post, I have decided to explain it using my own words :)"

- toc: true
- branch: master
- author: Andre Barbosa
- badges: true
- hide_binder_badge: true
- hide_colab_badge: true
- comments: true
- categories: [masters, nlp]
- hide: false
- search_exclude: false

# A quick review

I remember some day of 2016 while I was starting my carrer as a Data Scientist when I've stumped into [Chirs McCormick  blog about Word2Vec](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/). Honestly, I think that [Tomas Mikolov paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) was one of the most elegant and simple idea that I have ever found so far {% fn 1 %} :) 

{{ 'Fun Fact: Whereas nowadays [Miklov LinkedIn profile](https://www.linkedin.com/in/tomas-mikolov-59831188/?originalSubdomain=cz) points out that he has worked for Microsoft, Google and Facebook; another of W2V authors, [Ilya Sutskever](http://www.cs.toronto.edu/~ilya/) worked with some of the prestigious researchers in the recent AI area, such as [Geoffrey Hinton](https://www.cs.toronto.edu/~hinton/) and [Andrew Ng](https://www.andrewng.org/). Moreover, he is one of the founders of [Open AI](https://openai.com/)! ' | fndetail: 1 }}

## What are Word Embeddings


According to [Pytorch documentation](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) an **Emnedding** can be defined as the following: 

   >A simple lookup table that stores embeddings of a fixed _dictionary_ and _size_.

Then, we can interpret embeddings as a simple way to convert _integers_ into _vectors_ of a given size. Then, for **word embeddings**, we can interpret simply as words that are encoded as integers and then _these integers serves as inputs for a vector space.

A have written some code with [manim](https://github.com/3b1b/manim) to illustrate this process:

In [2]:
#hide
import jupyter_manim
from manimlib.imports import *
import torch

In [2]:
#hide
# %%manim -h
# pass

In [54]:
# hide
# %%manim EmbeddingText --low_quality

# initial_text = "The quick brown fox jumps over the lazy dog!!"
# post_process = initial_text.strip('!').lower().strip()
# tokenized = post_process.split()
# class EmbeddingText(Scene):
#     def construct(self):
#         title = TextMobject("Define some Text")
#         title.to_corner(UP + LEFT)
#         first_step = TextMobject(initial_text)
#         self.play(Write(title),
#                   FadeInFrom(first_step, DOWN),
#                  )
#         self.wait(1)
#         second_title = TextMobject("Preprocess it (optional)")
#         second_title.to_corner(UP + LEFT)
#         second_step = TextMobject(post_process, color=BLUE)
#         self.play(
#             Transform(title,second_title),
#             ReplacementTransform(first_step, second_step))
        
#         third_title = TextMobject("Tokenize it")
#         third_title.to_corner(UP + LEFT)
#         first_arrow = Arrow(DOWN,3*DOWN,color=BLUE)
#         first_arrow.next_to(second_step,DOWN)
#         third_step = TextMobject(str(tokenized))
#         third_step.next_to(first_arrow, DOWN)
        
#         self.play(GrowArrow(first_arrow))
#         self.play(Transform(second_title,third_title), FadeOut(title), FadeIn(third_step))
        

#         fourth_step = TextMobject(str(tokenized))
#         self.play(
#                   FadeOut(second_step),
#                   FadeOut(first_arrow),
#                   FadeOut(third_step),
#                   Transform(third_step, fourth_step))
        
#         fith_step_text = VGroup(
#             TextMobject("the"),
#             TextMobject("quick"),
#             TextMobject("brown"),
#             TextMobject("fox"),
#             TextMobject("jumps"),
#             TextMobject("over"),
#             TextMobject("the"),
#             TextMobject("lazy"),
#             TextMobject("dog"),
#         ).arrange(DOWN, aligned_edge=LEFT)
        
#         self.play(ReplacementTransform(fourth_step, fith_step_text))
        
#         second_arrow = Arrow(RIGHT,3*RIGHT)
#         second_arrow.next_to(fith_step_text, RIGHT)
        
#         fith_step_number = VGroup(
#             TextMobject("0"),
#             TextMobject("1"),
#             TextMobject("2"),
#             TextMobject("3"),
#             TextMobject("4"),
#             TextMobject("5"),
#             TextMobject("0"),
#             TextMobject("6"),
#             TextMobject("7"),
#         ).arrange(DOWN, aligned_edge=LEFT)
#         fith_step_number.next_to(second_arrow, RIGHT)
#         fourth_title = TextMobject("Map each word to an Integer*")
#         second_line = TextMobject("*notice that both words")
#         third_line = TextMobject("the", color=RED)
#         fourth_line = TextMobject("  were mapped to number 0")
#         second_line.scale(.6)
#         third_line.scale(.6)
#         fourth_line.scale(.6)
#         fourth_title.to_corner(UP + LEFT)
#         #Position text
#         second_line.next_to(fourth_title, DOWN)
#         second_line.to_edge(LEFT)
#         third_line.next_to(second_line, 0.8*RIGHT)
#         fourth_line.to_edge(LEFT)
#         fourth_line.next_to(second_line, 0.5*DOWN)
#         self.wait()
#         self.play(GrowArrow(second_arrow))
#         self.play(Transform(third_title,fourth_title),
#                   FadeOut(second_title),
#                   FadeInFrom(second_line,DOWN),
#                   FadeIn(third_line),
#                   FadeIn(fourth_line),
#                   FadeIn(fith_step_number))

        
#         sixth_step = VGroup(
#             TextMobject("0"),
#             TextMobject("1"),
#             TextMobject("2"),
#             TextMobject("3"),
#             TextMobject("4"),
#             TextMobject("5"),
#             TextMobject("0"),
#             TextMobject("6"),
#             TextMobject("7"),
#         ).arrange(DOWN, aligned_edge=LEFT)
        
#         self.wait(2)
#         self.play(FadeOut(fourth_line),
#                    FadeOut(third_line),
#                    FadeOut(second_line),
#                    FadeOut(fith_step_text),
#                   FadeOut(second_arrow),
#                   Transform(fith_step_number, sixth_step))
        
#         seventh_step = VGroup(
#             TextMobject("0"),
#             TextMobject("1"),
#             TextMobject("2"),
#             TextMobject("3"),
#             TextMobject("4"),
#             TextMobject("5"),
#             TextMobject("0"),
#             TextMobject("6"),
#             TextMobject("7"),
#         ).arrange(2*DOWN, aligned_edge=LEFT)
#         seventh_step.next_to(sixth_step, 5*LEFT)
        
#         self.play(
#           FadeOut(fith_step_number),
#           FadeOut(third_title),
#           FadeOut(fourth_title),
#           Transform(sixth_step, seventh_step)
#         )
        
#         third_arrow = Arrow(RIGHT,2*RIGHT)
#         third_arrow.next_to(seventh_step, RIGHT)
        
#         embedding = torch.nn.Embedding(8,4)
#         #single data 
#         input_data = torch.LongTensor([[0,1,2,3,4,5,0,6,7]])
#         #get the first image batch
#         emdedding = embedding(input_data).detach().numpy()[0].round(decimals=2).astype('str')
        
#         matrix_first = Matrix(emdedding)
#         matrix_first.next_to(third_arrow, RIGHT)
        
#         fifth_title = TextMobject("Each integer becomes the index of a matrix*")
#         second_line = TextMobject("*Again, notice that both words")
#         third_line = TextMobject("the", color=RED)
#         fourth_line = TextMobject(" were mapped to the same vector")
#         fifth_title.scale(.5)
#         second_line.scale(.5)
#         third_line.scale(.5)
#         fourth_line.scale(.5)
#         fifth_title.to_corner(UP + LEFT)
#         second_line.to_edge(LEFT)
#         second_line.next_to(fifth_title, DOWN)
#         third_line.next_to(second_line, 0.8*RIGHT)
#         fourth_line.to_edge(LEFT)
#         fourth_line.next_to(second_line, 0.5*DOWN)
#         self.play(GrowArrow(third_arrow))
#         self.play(FadeIn(matrix_first),
#                   FadeIn(fifth_title),
#                   FadeInFrom(second_line,DOWN),
#                   FadeIn(third_line),
#                   FadeIn(fourth_line),)
        
#         matrix_second=Matrix(emdedding)
#         self.wait(5)
#         self.play(FadeOut(sixth_step),
#                   FadeOut(third_arrow),
#                   FadeOut(seventh_step),
#                   FadeOut(fourth_line),
#                    FadeOut(third_line),
#                    FadeOut(second_line),
#                   FadeOut(fifth_title),
#                   Transform(matrix_first, matrix_second)
#                  )
#         self.wait(2)

![](images/EmbeddingText.gif "In this example, the embedding dimension is NxM, where N is the vocab size (8) and M is 4. The row representing word 'the' was duplicated for illustration purposes")

We can then interpret each dimension as a single neuron of a hidden layer and then **these embedding numbers can be modified** from a learning algorithm through a neural network. This is the main motivation behind Word Embeddings algorithms such as [Word2Vec](https://patents.google.com/patent/US9037464B1/en); [GloVe](https://nlp.stanford.edu/projects/glove/) and [fastText](https://fasttext.cc/) {% fn 2 %} 

Nowadays, there are some libraries that provides already trained vectors based on a fixed and previously trained vocabulary. For instance, considerer the following [Spacy](https://spacy.io/models) code:

{{ 'I am not going to cover word embeddings thourgh this blog post. If you are not familiarized with them, I highly recommend [this](http://jalammar.github.io/illustrated-word2vec/); [this](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) and [this](https://www.youtube.com/watch?v=ASn7ExxLZws) as potential resources :)' | fndetail: 2 }}

In [5]:
#collapse-hide
import spacy
nlp = spacy.load("en_core_web_md")
print("Coniderer the sentence 'The quick brown fox jumps over the lazy dog!!'")
text = nlp("The quick brown fox jumps over the lazy dog!!")
for word in text:
    print(f"'{word.text}' vector representation has size of {word.vector.shape[0]}. Its first five elements are: {word.vector[:5].round(2)}")

Coniderer the sentence 'The quick brown fox jumps over the lazy dog!!'
'The' vector representation has size of 300. Its first five elements are: [ 0.27 -0.06 -0.19  0.02 -0.02]
'quick' vector representation has size of 300. Its first five elements are: [-0.45  0.19 -0.25  0.47  0.16]
'brown' vector representation has size of 300. Its first five elements are: [-0.37 -0.08  0.11  0.19  0.03]
'fox' vector representation has size of 300. Its first five elements are: [-0.35 -0.08  0.18 -0.09 -0.45]
'jumps' vector representation has size of 300. Its first five elements are: [-0.33  0.22 -0.35 -0.26  0.41]
'over' vector representation has size of 300. Its first five elements are: [-0.3   0.01  0.04  0.1   0.12]
'the' vector representation has size of 300. Its first five elements are: [ 0.27 -0.06 -0.19  0.02 -0.02]
'lazy' vector representation has size of 300. Its first five elements are: [-0.35 -0.3  -0.18 -0.32 -0.39]
'dog' vector representation has size of 300. Its first five elements are:

Contains word representations that were trained on [Common Crawl data using GloVe algorithm](https://github.com/explosion/spacy-models/releases//tag/en_core_web_md-2.3.1). Different thant the example that I used at the beggining, the word '!' was encoded as well. Other interesting fact is that since GloVe probably passed thourgh a preprocessing step, both '_The_' and '_the_' got the same representation. 



In [24]:
#collapse-hide
print(f"First 5 values of word 'The' vector: {nlp('The').vector[:5].round(2)}")
print(f"First 5 values of word 'the' vector: {nlp('the').vector[:5].round(2)}")

First 5 values of word 'The' vector: [ 0.27 -0.06 -0.19  0.02 -0.02]
First 5 values of word 'the' vector: [ 0.27 -0.06 -0.19  0.02 -0.02]


We can combine different words to form the embedding of a phrase. According to [spacy documentation](https://spacy.io/usage/vectors-similarity#_title):
> Models that come with built-in word vectors make them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors. 

Then, the phrase the we are using as example has the following single representation:

In [23]:
#hide_input
print(f"First 5 values of 'The quick brown fox jumps over the lazy dog!!': {text.vector[:5].round(2)}")

First 5 values of 'The quick brown fox jumps over the lazy dog!!': [-0.23  0.08 -0.03 -0.07 -0.02]


## Limitations of Word Embeddings

Despite the fact that Word Embeddings brings a lot of benefits in the realm of computational linguistics, it has some limitations. There is a linguistic phenomena called _polyseme_ where according to [wikipedia](https://en.wikipedia.org/wiki/Polysemy#:~:text=English%20has%20many%20polysemous%20words,a%20subset%20of%20the%20other.):
> A polyseme is a word or phrase with different, but related senses.(...) English has many polysemous words. For example, the verb "to get" can mean "procure" (I'll get the drinks), "become" (she got scared), "understand" (I get it) etc.

So considering the example above, despite the fact that the verb has **different meaning** depending on the contexts, **it's word representation would always be the same**

In [22]:
#hide_input
print(f"First 5 values of verb 'to get' vector: {nlp('to get').vector[:5].round(2)}")

First 5 values of verb 'to get' vector: [ 0.03  0.12 -0.32  0.13  0.12]


Then, if we pick two phrases: `She got scared` and `She understand it`, we will get the following vectors

In [28]:
text1 = nlp("She will get scared")
text2 = nlp("She will get the drinks")

print(f"First 5 values of verb '{text1}' vector: {text1.vector[:5].round(2)}")
print(f"First 5 values of verb '{text2}' vector: {text2.vector[:5].round(2)}")

First 5 values of verb 'She will get scared' vector: [-0.08  0.16 -0.22 -0.03  0.02]
First 5 values of verb 'She will get the drinks' vector: [ 0.01  0.13 -0.04 -0.08  0.03]


Then, if we take the cosine simlarity by taking the average of the word vectors:

In [29]:
#collapse-hide
text1.similarity(text2)

0.88455245783805

This indicates that both vectors would be a lot similar. However, the reason for that is the usage of _similar_ words, even considering that they were applied in different contexts! So there is the objective that BERT tries to solve.{% fn 3 %} 



{{ 'There are some BERT percursors such as [ELMo](https://allennlp.org/elmo); [ULMFit](https://arxiv.org/abs/1801.06146) and [Open AI Transformer](https://openai.com/blog/language-unsupervised/) that I am not going to cover here. Please reach out to [Illustrated BERT blog](http://jalammar.github.io/illustrated-bert/) to know more' | fndetail: 3 }}



# TODO
Useful resources for continuating
- http://jalammar.github.io/illustrated-bert/
- https://jalammar.github.io/illustrated-transformer/
- http://nlp.seas.harvard.edu/2018/04/03/attention.html#decoder
- https://github.com/malhotra5/Manim-Tutorial#Text