# The Basics of Word Embeddings

So now that we understand the reason why NLP and representations for natural language concepts are difficult in the last chapter, let's take a look at some of the simple ways in which we can construct vector representations, on a word-level (i.e. for words). The vector representation of a word in NLP is called a **word embedding**.

Word embeddings are a very useful representation of vocabulary for machine learning purposes. The motive of word embeddings is to capture context, semantic, and syntactic similarity, and represent these aspects through a geometric relationship between the embedding vectors. Formulation of these embeddings is often harder as we need to isolate what we actually want to know about the words. Do we want to have embeddings that share a common part of speech appear closer together? Or do we want words with similar meanings appear geometrically closer than those without? How do we derive any of these relationships and how do we represent any of these features using simple means? 

All of these questions are answered in different manners according to the newest models and intuition as the field has developed, changing what word embeddings actually do for the down-stream task.


## Common types of word embeddings

By transforming textual data into numerical representations, we are able to train machines over complex variants of mathematical models that pose as intelligent understanding of language. The process of turning text into numbers is commonly known as “vectorization” or “embedding techniques”. These techniques are functions which map words onto vectors of real numbers. These vectors then combine to form a vector space, an algebraic model where all the rules of vector addition and measures of similarities apply.

Using the word embedding technique word2vec, researchers at Google were able to quantify word relationships in an algebraic model. This notebook goes into depth on how word2vec was created and the basic mathematical principles behind each step of the model's creation.

Such a vector space is a very useful way of understanding language. We can find “similar words” in a vector space by finding inherent clusters of vectors. We can determine word relations using vector addition. We can measure how similar two words by measuring the angles between the vectors or by examining their dot product.
There are more ways of vectorizing text than mapping every word to a vector. We can also map **documents**, characters or groups of words to vectors as well.

“Document” is a term that gets thrown around a lot in the NLP field. It refers to an unbroken entity of text, usually one that is of interest to the analysis task. For example, if you are trying to create an algorithm to identify spam emails, each email would be its own document, and an analysis of the emails would be considered a document-level analysis. 

Vectorizing documents makes it useful to compare text at the document level. It is useful in many applications including topic modeling and text classification. Vectorizing groups of words helps us differentiate between words with more than one semantic meaning. For example, “crash” can refer to a “car crash” or a “stock market crash” or intruding into a party.

In addition, creating these document level vectors can help us create 

The underlying mechanism to creating these vectors is by examining the context in which these words appear. We can examine how often a certain word appears in each document, or how often two words co-occur together.

All of these embedding techniques are reliant on the **distributional hypothesis**, the assumption that “words which are used and occur in the same contexts tend to purport similar meaning.”

### Count Based Vectors

This is one of the simplest methods of embedding words into numerical vectors. It is not often used in practice due to its oversimplification of language, but often the first embedding technique to be taught in the classroom setting.

Let’s consider the following documents. If it helps, you can imagine that they are text messages shared between friends.

- Document 1: High five!
- Document 2: I am old.
- Document 3: She is five.

The *vocabulary* we obtain from this set of documents is (High, five, I, am, old, she, is). We will ignore punctuation for now, although depending on our use case it can also make a lot of sense to incorporate them into our vocabulary.

We can create a matrix representing the relationship between each term from our vocabulary and the document. Each element in the matrix represents how many times that term appears in that particular document.

INSERT TABLE HERE

Using this matrix, we can obtain the vectors for each word as well as document. We can vectorize “five” as [1,0,1] and “Document 2” as [0,0,1,1,1,0,0].

Bag of words is not a good representation of language, especially when you have a small vocabulary. It ignores word order, word relationships and produces sparse vectors that is largely filled with zeros. We also see here from our small example that the words “I”, “am”, “old” are mapped to the same vector. This implies that these words are similar, something which we know not to be true.

The weight matrices connecting our word-level inputs to the network's hidden layers would each be $v \times h$,
where $v$ is the size of the vocabulary and $h$ is the size of the hidden layer.
With 100,000 words feeding into an LSTM layer with $1000$ nodes, the model would need to learn
$4$ different weight matrices (one for each of the LSTM gates), each with 100 million weights, and thus 400 million parameters in total.

### TF/IDF

This is another method which is based on the frequency method but it is different to the count vectorization in the sense that it takes into account not just the occurrence of a word in a single document but in the entire corpus. So, what is the rationale behind this?

Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear quite frequently in comparison to the words which are important to a document. For example, a document A on geurilla warfare is going to contain more occurences of the word “geurilla” in comparison to other documents. But common words like “the” etc. are also going to be present in higher frequency in almost every document.

Ideally, what we would want is to scale down the importance of certain common words occurring in most documents and scale up the importance of words that appear in a smaller subset of documents.

TF-IDF works exactly this way, by penalising these common words by assigning them lower weights while giving importance to words like geurilla in a particular document.

So, how does TF-IDF *actually* work?

Consider the two document's tables below:

**Document 1**

| Term     | Count |
|----------|:-----:|
| This     |   1   |
| is       |   1   |
| geurilla |   4   |
| warfare  |   2   |

**Document 2**

| Term     | Count |
|----------|:-----:|
| This     |   1   |
| is       |   2   |
| about    |   1   |
| TF-IDF   |   1   |


Now, let us define a few terms related to TF-IDF.

TF stands for term frequency. This can be mathematically defined as


<center>$TF = \frac{\text{Number of times term t appears in a document}}{\text{Number of terms in the document}}$</center>

So, 


<center>$TF(This,Document1) = \frac{1}{8} $</center>


<center>$TF(This, Document2) = \frac{1}{5}$</center>

It denotes the contribution of the word to the document i.e words relevant to the document should be frequent. eg: A document about geurilla should contain the word ‘geurilla’ in large number.


<center>$IDF = log(\frac{N}{n})$</center>

where, `N` is the number of documents and `n` is the number of documents a term t has appeared in.

So, 


<center>$IDF(This) = log(\frac{2}{2}) = 0$</center>

So, how do we explain the reasoning behind IDF? Ideally, if a word has appeared in all the document, then probably that word is not relevant to a particular document. But if it has appeared in a subset of documents then probably the word is of some relevance to the documents it is present in.

Let us compute IDF for the word ‘geurilla’.


<center>$IDF(geurilla) = log(\frac{2}{1}) = 0.301$</center>

Now, let us compare the TF-IDF for a common word ‘This’ and a word ‘geurilla’ which seems to be of relevance to Document 1.


<center>$TFIDF(This,Document1) = (\frac{1}{8}) * (0) = 0$</center>


<center>$TFIDF(This, Document2) = (\frac{1}{5}) * (0) = 0$</center>


<center>$TFIDF(geurilla, Document1) = (\frac{4}{8})*0.301 = 0.15$</center>

As, you can see for Document1 , TF-IDF method heavily penalises the word ‘This’ but assigns greater weight to ‘geurilla’. So, this may be understood as ‘geurilla’ is an important word for Document1 from the context of the entire corpus.