## Embeddings

In our previous example, we operated on high-dimensional bag-of-words vectors with length `vocab_size`, and we were explicitly converting from low-dimensional positional representation vectors into sparse one-hot representation. This one-hot representation is not very efficient, and each word is treated independently from each other, i.e. one-hot encoded vectors do not show any semantic similarity between words.

The idea of **embedding** is to represent words by lower-dimensional dense vectors, which somehow reflect semantic meaning of a word. We will later discuss how to build meaningful word embeddings, but for now let's just think of embeddings as a way to lower dimensionality of a word vector. 

So, embedding layer would take a word as an input, and produce an output vector of specified `embedding_size`. In a sense, it is very similar to `Linear` layer, but instead of taking one-hot encoded vector, it will be able to take a word number as an input.

By using embedding layer as a first layer in our network, we can switch from bag-or-words to **embedding bag** model, where we first convert each word in our text into corresponding embedding, and then compute some aggregate function over all those embeddings, such as `sum`, `average` or `max`.  

![Embedding Classifier](../images/embed-classifier.png)

As a result of this architecture, minibatches to our network would need to be created in a certain way. In the previous example, all BoW tensors in a minibatch had equal size `vocab_size`, regardless of the actual length of a sequence. In  



## PreTrained Embeddings Word2Vec and Varients

As opposed to traditional distributional models neural embeddings such as Word to Vec are learned by training a neural langauge model to minimize a loss function for tasks that map to language understanding.  This process of training models on large collections of text to extract word representaions is called pre-training.  

One of the first sucessful neural pretraining techniques for text representation was called Word2Vec. 

There are two main architectures that are used to produce a distributed representation of words:

 - Continuous bag-of-words (CBOW) — In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words.
 - Continuous skip-gram - In the continuous skip-gram architecture, the model uses surrounding window of context words to predict the current word.

CBOW is faster while skip-gram is slower but does a better job of representing infrequent words.

![word2vec image](../images/word2vec.png)

Both CBOW and Skip-Grams are “predictive” embeddings, in that they only take local contexts into account. Word2Vec does not take advantage of global context. FastText, built on Word2Vec by learning vector representations for each word and the charachter n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to pre-training it enables word embeddings to encode sub-word information. 

Another method called GloVe by contrast leverages the same intuition behind the co-occurence matrix used by the traditional distributional embeddings above, but uses neural methods to decompose the co-occurrence matrix into more expressive and non linear word vectors.

Below we can use the Gensim Api to play with pretrained word2vec, fast text, and glove embeddings to find the most similar pretrained embeddings to the word 'play':


In [None]:
import gensim.downloader as api
w2v = api.load('word2vec-google-news-300')
print(w2v.most_similar('play'))

from gensim.models.wrappers import FastText
fast_text = FastText.load_fasttext_format('wiki.simple')
print(fast_text.most_similar('play'))

glove = api.load("glove-twitter-25")
print(glove.most_similar('play'))


One key limitation of tradition pretrained embedding representaitons such as Word2Vec is the problem of word sense disambigioution. While pretrained embeddings can capture some of the meaning of words in context every possible meaning of a word is encoded into the same embedding. This can cause problems in downstream models since many words such as the word 'play' have different meanings depending on the context they are used in.

For example word 'play' in the the sentence
- I went to a [play] at the theature.

Does not mean the same thing as the word 'play' in the sentence.
- John wants to [play] with his friends.

The pretrained embeddings above represent both of these meanings of the word 'play' in the same embedding. Contextual embeddings ,methods were developed to address this challenge of disambigutation and contributed to the massive leap forward in natrual language processing applications. 



## Contextual Embeddings

To address challenges of word sense disambigution a new method of pretraining models on large amounts of data and using the pre-trained models to generate contextual embeddings was spearheaded with the advent of models such as ULMFiT, ELMO and Later BERT.

![elmo](images/elmo.png)

Below we will look at Spacy's transformer api to play with contextual embeddings.


In [None]:
!pip install spacy-transformers
!python -m spacy download "en_trf_bertbaseuncased_lg"

import spacy

nlp = spacy.load("en_trf_bertbaseuncased_lg")
doc1 = nlp("I went to a play.")
doc2 = nlp("John wants to play a game.")
doc3 = nlp("John went to a show.")


print("Similarity between the two words 'play' in doc1 and doc2:", doc1[4].similarity(doc2[3]))
print("Similarity between doc1 'play' and doc3 'show':", doc1[4].similarity(doc3[4]))
print("Similarity between doc2 'play' and doc3 'show':", doc2[3].similarity(doc3[4]))

ULMFit and ELMo were models that generates embedding for a word based on the context it appears thus generating slightly different embeddings for each of its occurrence and thus alowing a downstream model to better disambiguate between the correct sense of a given word such as 'play'. On in it’s release it enabled near instant state of the art results in many downstream tasks, including tasks such as co-reference were previously not as viable for practical usage in nlp.

This was coined as the ImageNet moment of NLP more recent transfomer based models such as BERT capitalize on the development of BERT using attention transformers instead of bi-directonal RNNs to encode context. If you are unfamiliar with terms such as Transformers and RNNs do not worry in the next module we will walk through the progression of NLP models culminiating in the advent of current state of the models in NLP with PyTorch. 
