# Introduction to Represetning Text With Contextual Embeddings

Before we dive into buiding our own Question and Answering model the first step if to understand how State of the Art PyTorch SOTA models represent text with Contextual Embeddings. 

At this point you maybe wondering what the term 'Contextual Embedding' even means.  Don't worry by the end of the module that will be abundently clear but in order to understand better the concept lets first take a step back and look at problem of textual representation, some of the different approaches that have culminated in the current state of the art.

At the end of this tour we will look at sample PyTorch code using the HuggingFace transformers library to generate contextual embeddings.

# What is Text Representation

If you are here you proably at some point learned how to read, write and process language. Computers represent textual characters as numbers that map to fonts on your screen using coding formats such as ASCII or UTF-8. You and I can understand what the letters these fonts **represent** and how they characters come together to form the words of this setence. However in order by themselves computers do not have such an understanding. In order to train language models to process we need a mechanism to represent **text features** as features such as words and characters.

# How do we Represent Text with Computers

There are many different approaches to represent textual features in a format that can be modeled with machine learning. We've already mentioned that contextual embeddings is the state of the art but before we explain how contextual embeddigns work lets take a brief tour of the more tradional and early neural approaches for representing text.

## Bag of Words Text Representation

Bag of Words or BoW vector representations are the most common used traditional vector representation. Each word or n-gram is linked to a vector index and marked as 0 or 1 depending on whether it occurs in a given document.

[Put bag of word image here]

BoW representations are often used in methods of document classification where the frequency of each word, bi-word or tri-word is a useful feature for training classifiers. One challenge with bag of word representations is that they don’t encode any information with regards to the meaning of a given word.

In BoW word occurrences are evenly weighted independently of how frequently or what context they occur. However in most NLP tasks some words are more relevant than others.

## Term Frequency Inverse Document Frequency TF - IDF

TF - IDF short for term frequency–inverse document frequency, is a variation of bag of words where instead of a binary 0 and 1 value being used to indicate the appearence of a ngram in a document a the TF-IDF value is used. The TF-IDF value is a numerical statstic that reflect how prominent a word or n-gram is to a document in a collection. The TF-IDF value increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently than others. 

However even though TF-IDF representations provide frequency weight to different words they are unable to represent meaning or order. As the famous linguist J. R. Firth said in 1935, “The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.”

## Traditional Distributional Embeddings with Mutual Information

Distributional Embeddings enable word vectors to encapsulate contextual context. Each embedding vector is represented based on the mutual information it has with other words in a given corpus. Mutual information can be represented as a global co-occurrence frequency or restricted to a given window either sequentially or based on dependency edges.

## PreTrained Embeddings Word 2 Vec and Varients

Predictive models learn their vectors in order to improve their predictive ability of a loss such as the loss of predicting the vector for a target word from the vectors of the surrounding context words.
Word2Vec is a predictive embedding model. There are two main Word2Vec architectures that are used to produce a distributed representation of words:

 - Continuous bag-of-words (CBOW) — The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words.
 - Continuous skip-gram weighs nearby context words more heavily than more distant context words. While order still is not captured each of the context vectors are weighed and compared independently vs CBOW which weighs against the average context.

CBOW is faster while skip-gram is slower but does a better job for infrequent words.

Both CBOW and Skip-Grams are “predictive” embeddings, in that they only take local contexts into account. Word2Vec does not take advantage of global context. GloVe embeddings by contrast leverage the same intuition behind the co-occurence matrix used by previous distributional embeddings, but uses neural methods to decompose the co-occurrence matrix into more expressive and non linear word vectors.

FastText, built on Word2Vec by learning vector representations for each word and the charachter n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to pre-training it enables word embeddings to encode sub-word information. 

Contextual Embeddings

ELMO

BERT

ETC