### Co-occurrence matrices with a fixed context window

The big idea – Similar words tend to occur together and will have similar context for example – Guava is a fruit. Grapefruit is a fruit.
Guava and Grapefruit tend to have a similar context i.e fruit.

Before we dive into the details of how a co-occurrence matrix is constructed, there are two concepts that need to be clarified – what do Co-Occurrence and Context Window mean here?

Co-occurrence – For a given corpus, the co-occurrence of a pair of words say w1 and w2 is the number of times they have appeared together in a Context Window.

Context Window – Context window is specified by a number and the direction. So what does a context window of 2 (around) means? Let us see an example below,

 

| **Fabulous**	| **Fairytale**	| _Fox_ | **Flew** | **Far** | From |	Five | Feet | Forwards |
|---|:--:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|

The words "Fabulous", "Fairytale", "Flew", and "Far" (bolded) are a 2 (around) context window for the word ‘Fox’ and for calculating the co-occurrence only these words will be counted. 

Now, let us take an example corpus to calculate a co-occurrence matrix.

Corpus = He is not lazy. He is intelligent. He is smart.

 

| - | He | is  | not  | lazy | intelligent | smart |
|--- |:--:|:-:  |:-:   |:-:   |:-:          |:-:    |
| He | 0 | 	4	| 2 | 1 | 2 | 1 |
| is | 4 | 0  | 1 | 2  | 2 | 1 | 
| not | 2 | 1 | 0 | 1 | 0 | 0 | 
| lazy	 | 1 | 2 | 1 | 0 | 0 | 0 |
| intelligent |	2 | 2 | 0 | 0 | 0 | 0 |
| smart | 1 | 1 | 0 | 0 | 0 | 0 |


Let us understand this co-occurrence matrix by seeing two examples in the table above.

The box under "is and to the right of He is the number of times ‘He’ and ‘is’ have appeared in the context window 2 and it can be seen that the count turns out to be 4. The below table will help you visualise the count.

while the word ‘lazy’ has never appeared with ‘intelligent’ in the context window and therefore has been assigned 0 in the box below intelligent and to the right of lazy.

### So... what now — after you have vectorized the text?

Numerical representations allow us to use numerical classification methods to analyze our text. If we want to build a document classifier, we can obtain document level vectors for each review and use them as feature vectors to predict the class label. We can then plug this data into any type of classifier.

By one hot encoding the classes, we can plug this data into any type of classifier
Textual data is rich and high in dimensions, which makes more complex classifiers such as neural networks ideal for NLP.

To give another simple example of one-hot encoding They were represented as one-hot vectors, the count of unique vocabulary in the corpus as the length of the vector, and the place in which the vocabulary occurred first as the position in the one-hot vector being a 1. For example, if my vocabulary was the following sentence:

> The boy liked the turtles.

The length of the above sentence is five, but there are four unique words in the vocabulary, "the", "boy", "liked", and "turtles." So the length of our embedding vectors for each would be four. For "the," the embedding would look like `[1, 0, 0, 0]`, the embedding for boy would be `[0, 1, 0, 0]`, the embedding for liked would be `[0, 0, 1, 0]`. The embedding for the next "the" would reuse the one-hot vector embedding used previously, and default to `[1, 0, 0, 0]` for every re-occurrence of "the," and lastly, the embedding for turtles would be `[0, 0, 0, 1]`. But according to the questions we posed earlier, these don't represent any of the features of syntactic or semantic similarity; these only represent the most basic feature of whether or not a word occurred or not. In addition, any calculations or computation that one would like to do with these one-hot vectors would be a problem as the inherent sparsity of these vectors makes it increasingly inefficient as the vocabulary size increases.

Fortunately, it turns out that a number of efficient techniques
can quickly discover broadly useful word embeddings in an *unsupervised* manner.

These embeddings map each word onto a low-dimensional vector $w \in R^d$ with $d$ commonly chosen to be roughly $100$.
Intuitively, these embeddings are chosen based on the contexts in which words appear.
Words that appear in similar contexts, like "tennis" and "racquet," should have similar embeddings
while words that are not alike, like "rat" and "gourmet," should have dissimilar embeddings.

We will explore the much more complex set of embeddings created using shallow neural networks by focusing on word2vec models. Trained over large corpora, word2vec uses unsupervised learning to determine semantic and syntactic meaning from word co-occurrence, which is used to construct vector representations for every word in the vocabulary.

Word2vec was developed at Google by a research team led by Tomas Mikolov. The research paper takes a while to read, but is worth the time and effort.

The model uses a two layer shallow neural network to find the vector mappings for each word in the corpus. The neural network is used to predict known co-occurrences in the corpus and the weights of the hidden layer are used to create the word vectors. Somewhat surprisingly, word vectors created using this method preserve many of the linear regularities found in language.

The Efficient Estimation of Word Representations in Vector Space paper shares the following result:
“Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that vector(”King”) — vector(”Man”) + vector(”Woman”) results in a vector that is closest to the vector representation of the word Queen.”

There are two model architectures used to train word2vec: Continuous Bag of Words and Skip Gram. These models determine how textual data is passed into the neural network. Both of these architectures use a context window to determine contextually similar words. A context window with a fixed size n means that all words within n units from the target word belong to its context.

Consider the following example with a fixed window size of 2:

> "The quick brown fox jumped over the lazy dog."

Fox is our target word and quick, brown, jumped, over belong to the context of fox. The assumption is that with enough examples of contextual similarity, the network is able to learn the correct associations between words.
This assumption is in line with the distributional hypothesis we presented earlier, which states that “words which are used and occur in the same contexts tend to purport similar meaning.”

The implementation of context window in word2vec is dynamic.
A dynamic context window has a maximum window size. Context is sampled from the maximum window size with probability 1/d, where d is the distance between the word to the target.
Consider the target word fox using a dynamic context window with maximum window size of 2. (brown, jumped) have a 1/1 probability of being included in the context since they are one word away from fox. (quick, over) have a 1/2 probability of being included in the context since they are two words away from fox.
Using this concept, the Continuous Bag of Words and the Skip Gram model separates data into observations of target words and their context.

### Continuous Bag of Words

We structure the data such that the context is used to predict the target word. For example, if our context is (quick, brown, jumped, over), we use that as features of the class fox.

### Skip Gram

We structure the data such that the target word is used to predict the context. For example, we use the feature (fox) to predict the context (quick, brown, jumped, over).

### Building the Neural Network

Word2vec trains a shallow neural network over data as structured using either Continuous Bag of Words or Skip Gram architecture. Instead of leveraging the model for predictive purposes, we use the hidden weights from the neural network to generate the word vectors.
Assuming a Continuous Bag of Words architecture with a fixed context window of 1 word, this is what the process would look like. First, the corpus.

> I like math
> I like programming
> Today is Friday
> Today is a good day

To make things even easier, we can require our context window to only include words which proceeds the target. We can assume that the context of words at the end of a sentence is the first word of the next sentence. Under such rules:

- like is the context of target I
- math is the context of target like
- programming is also the context of target like

Even with such a simple corpus, we can begin to recognize some patterns. “Math” and “programming” are both context to “like”. While this might not be picked up by the model, both of these words can be understood as things that I like.

#### Step 1
The first step is to one hot encode our classes like we did above with the 'I like turtles' example (the words in our vocabulary): I, like, math, programming, today, is, Friday, a, good, day

#### Step 2
Create a feed forward neural network with one hidden layer and an output layer using the softmax activation function. The data set used to train the network uses the one hot encoded context vector to predict the one hot encoded target vector.
The number of neurons in the hidden layer corresponds to the number of dimensions in the final word vectors.

#### Step 3
Obtain the weights of the hidden network. Each row in the weight matrix corresponds to the vector of each word in the vocabulary.

Realistically, this is not something that we do very often. Good word2vec models require a very large corpus in the billions of words. Fortunately, pre-trained models are easy to use and find. You can download the word2vec model trained over the 100 billion word Google News corpus on their website, or you can use GluonNLP to load a set of pre-trained word embedding.

Here, we'll show you how to create the model and train it, but, in the end, will use pre-built word embeddings that have been independently verified for accuracy for testing and understanding. 