# Word Embeddings


## Description


Although computers are able to process mammoth of data and produce tangible output but computers can't work on literal languages. All the machine learning and deep learning algorithms take input data in the form of integers or numbers (vectors) rather than natural language text. Before applying any machine learning algorithm, the words in natural language have to be converted in a representation in such a format that a machine learning algorithm can understand.So we need some way to convert text into sensible numerical data. 

In this unit we will learn about how to efficiently implement word representation and learn the concepts of word2vec, negative sampling and glove.


## Overview

- Need of Word representation
- Word2vec
- Negative sampling
- Glove


## Pre-requisite

- Python (along with NumPy and pandas libraries)
- Basic statistics 


## Learning Outcomes

- After completing this unit you will be able to represent words in vector form, understand word2vec, glove and negative sampling and  will be able to develop a text classification model.

# Chapter 1: Word representation

## 1.1 Introduction

Consider the following sentence pairs:

- I feel great today
- I feel good today

Both of them have very similar meanings. Traditional natural language processing(such as bag of words model) treat words as discrete atomic symbols.

Let us create a bag of words model vector for each of these words in vocabulary V. Length of the vector would be 5(equal to the size of V). 

Therefore we will get I = [1,0,0,0,0], feel=[0,1,0,0,0], good=[0,0,1,0,0], great=[0,0,0,1,0], today=[0,0,0,0,1]

Based on the above positions, following will be the vector representations:

Sentence 1: [1,1,0,1,1]

Sentence 2: [1,1,1,0,1]

If these encodings are visualised in a 5 dimensional space, where each word occupies one of the dimensions, 'good' and 'great' are as different as 'today’ and ‘feel’, which is not true.



Consider another set of sentences:

- An Apple a day keeps doctor away
- Apple hit with securities-fraud lawsuit for hiding IPhone sales drop
 
In these case the word 'Apple' will be a given a specfic vector or may be given a unique id 'id25'. This is not useful as these are random and provide no information that 'Apple' is a fruit or a company. This further means that the model can leverage very little of what it has learned about 'apple' when it is processing data about say 'orange' (Both are fruits, grown in winter, etc). Representing words as unique, discrete values furthermore leads to data sparsity. Majority of neural network architectures do not play well with very high-dimensional, sparse vectors.

Though we humans are able to process raw text and conclude the article details, for a computer to do text processing, all the necessary information from the text needs to be encoded. To solve that, we need to create a representation for words that capture their meanings via the semantic relationships and the different types of contexts they are used in.

Enter 'Word Embeddings'

##  What are Word Embeddings?

Put very naively, Word Embeddings are texts converted into vectors. This approach was a game changer for natural language processing with deep learning. 

The distributed representation is learned based on the how the words are used. This ensures that the words that are used in similar ways have similar representations.

Word embeddings can be broadly divided into two categories:

- Frequency based Embedding

- Prediction based Embedding

![](final_images/word_embeddings_type.jpg)

Let's look at both of them in detail.



### 1.2 Frequency based Embedding


These are word embeddings which use the frequency occurence of words to capture the meaning.

Following are the two type of vectors that are popular in this:

- Count Vector
- TF-IDF Vector
- Co-occurence Vector

**Count Vector**

If there is Corpus C of D documents {d1,d2…..dD} and N unique tokens, the N tokens will form our vocabulary. The count vector will be a simple matrix M where each row 'i' in the matrix contains the frequency of tokens in document D(i).The size of the Count Vector matrix M will be given by D X N. 

Let's look at an example to understand it better.

Consider the two documents:

D1: cat runs behind rat. rat flees

D2: dog runs behind cat. cat flees

The dictionary created will be a list of unique tokens =[‘cat’,’dog’,’runs’,’behind’,’rat’,flees]

Here, D=2, N=5

The count matrix M of size 2 X 5 will be represented as –

|-|cat|dog|runs|behind|rat|flees|
|---|---|---|---|---|---|---|
|D1 |1|0|1|1|2|1
|D2|1|1|1|1|2|1


The column is what is understood as word vector for the corresponding word in the matrix M. For example, the word vector for ‘dog’ is [0,1] 


Owing to how naive it is, it requires humongous data to be effective. Also in count vector, word occurrences are evenly weighted(independently of frequency or context). However we know in almost every NLP task some words have more relevance than others.



### TF-IDF vectorization

We have already encountered this technique while studying `Introduction to NLP`. 

To refresh, TF-IDF is used to gain information about a word from a corpus of words by calculating term-frequency(TF) and inverse-term-frequency(IDF). 

For our 'word features', what we want is to give less importance to common words occurring in almost all documents and give more importance to words that appear in a subset of documents. TF-IDF is used to weigh a key word in any content and assign importance of the word in the document and more importantly across entire corpus of different documents. 

Each of the word in the corpus will have their unique TF and IDF score. The product of TF and IDF is called the TF-IDF weight of that particular term.


#### TF-IDF calculation:

The tf-idf weight is composed by two terms: 

- The first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document;

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

- The second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.


IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Consider the following example:

There is document containing 200 words wherein the word cat appears 3 times. The term frequency for cat is then (3 / 200) = 0.015. Assume we have 10K documents and the word cat appears in 200 of these. The inverse document frequency is calculated as log(10000/ 200) = 1.7 

Therefore tf-idf score of cat= 0.015 * 1.7 = 0.0255.
 

'Tf-Idf' score are slightly better than count vector in terms of providing weights to different words, though they are still unable to capture the word meaning.




**Co-occurence matrix**

For a given corpus, the co-occurrence of a pair of words say w1 and w2 is the number of times they have appeared together.


Now, let us take an example corpus to calculate a co-occurrence matrix.

Corpus = Dave is smart. Dave is not lazy. Dave is hardworking.

Following is how co-occurence matrix of the above corpus will look like:

![](final_images/com.jpg)

For e.g. In the above corpus, `is` appears near `Dave` a total of 3 times. `lazy` appears near `not` a total of 1 time.

The above matrix was for 6 unique words.

Let’s say there are V unique words in the corpus. The co-occurrence matrix will be of size V X V. This results in a co-occurence matrix that is very large and difficult to handle.  So it's not the co-occurrence matrix that is employed for word vector representation. Co-occurrence matrix is decomposed using techniques like PCA, SVD etc. to form the word vector representation.


It has to be computed once and can be used anytime once computed. Therefore, it is faster in comparison to others methods discussed. On the other hand, it requires huge memory to store the co-occurrence matrix. 

Let's now try to implement the same on a corpus.


# TASK 1

Let's implement count vectorizer and tf-idf vectorizer on a small text corpus(Corpus contains lines from the famous children's novel: Alice in Wonderland).

- Corpus stored in a dataframe(`corpus_df`) is already given to you along with the preprocessing function(`preprocess_document()`) 

- Call the function `preprocess_document()` with `corpus_df` as the parameter and store the result in `clean_corpus`


- Implement count vectorizer on the `clean_corpus`
       
       - Initialize a "CountVectorizer()" object
       
       - Fit and transform the 'clean_corpus' using the above created object.
       
       - Convert the above transformed corpus into array and store it in a variable called 'cv_matrix'
       
       - Get the vocabulary(word list) using "get_feature_names()" of the "CountVectorizer()" object(that you previously  
         created) and store that in a variable called 'cv_vocab'
         
       - Create a dataframe with 'cv_matrix' and set its columns as 'cv_vocab' using pandas. Store the dataframe in 'cv_df'  
       
       - Print the dataframe to check the vectorization results

- Implement TF-IDF vectorizer on the `clean_corpus`
       
       - Initialize a "TfidfVectorizer()" object
       
       - Fit and transform the 'clean_corpus' using the above created object.
       
       - Convert the above transformed corpus into array and store it in a variable called 'tv_matrix'
       
       - Get the vocabulary(word list) using "get_feature_names()" of the "TfidfVectorizer()" object(that you previously  
         created) and store that in a variable called 'tv_vocab'
         
       - Create a dataframe with 'tv_matrix' and set its columns as 'tv_vocab' using pandas. Store the dataframe in 'tv_df'  

       - Print the dataframe to check the vectorization results
       

# Hint

You can create the datataframe `'cv_df'` by writing code similar to:

```python
cv_df=pd.DataFrame(cv_matrix, columns=vocab)
```
Similarly for `'tv_df'`


# Test Cases

#cv_df


(cv_df['alice'][0]==1) &(cv_df.shape==(8,34))

#tv_df


(tv_df['alice'][1]==0.22) & (tv_df.shape==(8,34))

In [1]:
import re
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


corpus = ['Alice was beginning to get very tired of sitting by her sister on the bank', 
          'What is the use of a book, thought Alice `without pictures or conversation',
          'There was nothing so very remarkable in that',
          'The Rabbit actually took a watch out its waist',
          'Alice started to her feet',
          'Alice opened the door and found that it led into a small passage',
          'And she went on planning to herself how she would manage it',
          'Alice took up the fan and gloves']


stop_words = nltk.corpus.stopwords.words('english')

def preprocess_document(corpus):
     
    # lower the string and strip spaces    
    corpus = corpus.lower()
    corpus = corpus.strip()
    
    # tokenize the words in document
    word_tokens = nltk.WordPunctTokenizer().tokenize(corpus)
    
    # remove stopwords
    filtered_tokens = [token for token in word_tokens if token not in stop_words]
    
    # join document from the tokens
    corpus = ' '.join(filtered_tokens)
    
    return corpus


# Loading the data 
corpus_df = pd.DataFrame({'Sentences': corpus})

# Vectorizing function so that it can work on corpus
preprocess_document = np.vectorize(preprocess_document)


# Code starts here

# Calling the function
clean_corpus = preprocess_document(corpus)


# Implementing countvectorizer

cv = CountVectorizer()
cv_matrix = cv.fit_transform(clean_corpus)
cv_matrix = cv_matrix.toarray()

# Getting the feature names
cv_vocab = cv.get_feature_names()

# Add the column names to features
cv_df=pd.DataFrame(cv_matrix, columns=cv_vocab)


# Implementing TFIDF vectorizer

tv = TfidfVectorizer()
tv_matrix = tv.fit_transform(clean_corpus)
tv_matrix = tv_matrix.toarray()

# Getting the feature names
tv_vocab = tv.get_feature_names()

# Add the column names to features
tv_df=pd.DataFrame(np.round(tv_matrix, 2), columns=tv_vocab)

# Success Message

Congrats! You have successfully implemented vectorization(count and tf-idf) on the corpus

# Chapter 2: Word2Vec

## 2.1 Introduction to Word2Vec



**Limitations of Frequency based representations**

The frequency based methods are easy to understand and can be easily implemented with the help of any good machine learning algorithms. However when dealing with large dataset, they suffer from two problems:

- Arbitariness: No useful information regarding the relationships that may exist between the words is captured

- Sparsity: More data is needed in order to successfully train statistical models

Not to mention these methods are computationally expensive and often produce mediocre results.


However with the advent of neural networks, this changed.


**Word rules problem**

Consider how images and audios are processed by computers. Their datasets are usually rich, high dimensional and have encodings that are not comprehensible by humans. Still machines(Deep Learning models) are able to use it for prediction(for e.g. Image Recognition) because all the information is encoded in the data and hence the relation between various entities can be clearly defined. 


Following that inspiration, woudn't using vector representations for words a possible solution to overcome the arbitariness and sparsity?
Yes, it would.

Word vector representations(or Word2Vec) groups vectors of similar words together in vectorspace.  
These vectors are distributed numerical representations of word features, features such as the context or usage of individual words. 

Put simply, it tries to detect similarities mathematically and it does that without human intervention. This similarities help find different word associations and topic clusters which can be used for NLP tasks like sentiment analysis or recommendations systems.

**Neural Word Embeddings**

The vectors we use to represent words are called neural word embeddings. Unlike the other vector processes(like image recognition), here instead of training against the input words themselves, word2vec uses neighbouring words of the input corpus.

There are two popular word2vec methods:

- CBOW

- Skipgram

Let's understand them one by one.


### Continuous Bag of words (CBOW)

Before we go ahead explaining the CBOW , let's understand three important terms:

- Target word
- Context word
- Context window

Consider the sentence: 

"The dog jumped over the fence"

Instead of encoding them independently in case of say, BOW(Bag of words) model, in CBOW, the encoding happens with respect to the word that are around it. They do that by predicting a word with respect to the surrounding words:

In that respect, if you were trying to predict dog, the word 'dog' will be the `target` word and the remaining words 'The','jumped','over','the','fence' will be the `context` words.

The input to CBOW model is the `context` word and output is the `target` word.

`Context window` is the no. of context words, we want to use to predict the target.

![](final_images/context_window.jpg)


Window size=1 means that each word will be the context word of its adjacent word.

Window size=2 means that each word will the context word of its two adjacent words

Notice when context window size =0, it's just simply BOW architecture.


|-|The|dog|jumped|over|the|fence|
|---|---|---|---|---|---|---|
|BOW |1|1|1|1|1|1|


For simplicity, let's consider the above sentence with a context window of just size 1. Following is how the corpus will look like: 

![](final_images/cbow1.jpg)


For example, 'jumped' is the context for 'dog' and 'over'.


Let's represent each word as a vector by simply one-hot encoding it. Following will the be input with 1x6 size:

![](final_images/cbow2.jpg)


The output similarly, will be a vector with size 6x1.

![](final_images/cbow_output.jpg)


Following is how the model architecture will look like when context window=1(or in other words input is just one feature):


![](final_images/layers.jpg)

Following is the flow:

- The input layer and the output are both one- hot encoded having size [1 X V](V=6 in our case)


- Input- Hidden layer matrix is of size =[V X N] , hidden-Output layer matrix is of size =[N X V]. 

*Note:* Here `N` is arbitary and is just the number of dimensions we choose to represent our word in. (N is also the number of neurons in the hidden layer whose optimum value can only be found with different model permutations)


- How CBOW architecture is different from a standard MLP based architecture is that there is a no `activation function` between any layers(in other words just linear activation).  The input is simply multiplied by the `input-hidden layer` weights which further gets multiplied by `hidden layer- output` weights and output is calculated.


- Error between output and target is calculated and backpropagation is used to readjust the weights.

****
**The weight  between the input layer and the hidden layer is what is taken as the word vector representation** 

![](final_images/cbow3.jpg)

Goal of the whole CBOW process is simply just finding this hidden layer weight matrix(the output layer is really only for improving accuracy)
****



What we learned just now was for a `single context` word, following is the architecture for `multiple context` words:



![](final_images/layer3.jpg)

The modification in the above figure consists of replicating the input to hidden layer connections C times(where C is the number of context words). and adding a divide by C operation in the hidden layer neurons. This `average vector` calculated becomes the hidden activation. 

So, if we have, say four context words for a single target word, we will have four initial hidden activations(`four input-hidden layer weights`) which are then `averaged` element-wise to obtain the final activation. The hidden layer-output weights and output is then calculated the same way as for `one context word` input. 
 
**Note:** The figure above might lead one to think that CBOW uses `multiple input matrices`.That is not the case. It is the same matrix that is `receiving` multiple input vectors representing different context words.


**Deep Dive(Optional)**

You can understand the mathematics of CBOW in more detail by checking out  `Unit 2` of [How exactly does word2vec work?](http://www.1-4-5.net/~dmm/ml/how_does_word2vec_work.pdf) by David Meyer


##### Advantages of CBOW:

- Works better than deterministic methods like TF-IDF, count vector.

 

##### Disadvantages of CBOW:

- As we have seen, `context window> 1 CBOW` takes the average of the context of a word and therefore loses quite some context(pun intended) while calculating output.




# Python implementation

Let's try to see how we can implement the same on python.

For the original code you can refer to the original [link](https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html) as well



### Step 1:

Load the data and clean it.

```python

from keras.preprocessing import text
from keras.utils import np_utils
from keras.preprocessing import sequence
import pandas as pd
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
from string import punctuation

# Loading the data
corpus=[nltk.corpus.gutenberg.words('shakespeare-caesar.txt')]

print("Sample Corpus(Julius Caesar- Shakespeare):\n")
print(corpus[0][16:])
print(corpus[0][150:])


# Terms to remove
r_terms = punctuation + '1234567890'

# Removing the above defined terms
norm_corpus = [[word.lower() for word in sent if word not in r_terms] for sent in corpus]

# Joining the sentence back    
norm_corpus = [' '.join(tok_sent) for tok_sent in norm_corpus]


# norm_bible = filter(None, normalize_corpus(norm_bible))

# Tokenizing the words 
norm_corpus = [tok_sent for tok_sent in norm_corpus if len(tok_sent.split()) > 2]
norm_corpus=norm_corpus[:1500]

tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(norm_corpus)
word2id = tokenizer.word_index

# Building the vocabulary of unique words
word2id['PAD'] = 0
id2word = {v:k for k, v in word2id.items()}
wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in norm_corpus]
vocab_size = len(word2id)

# Choosing the word embedding dimension()
embed_size = 100
window_size = 2 # context window size

print('\nVocabulary Size:', vocab_size)
print('\nEmbedding Size:',embed_size)
print('\nVocabulary Sample:\n', list(word2id.items())[:10])
```

**Output:**

```python
Sample Corpus(Julius Caesar- Shakespeare):

['.', 'Enter', 'Flauius', ',', 'Murellus', ',', 'and', ...]
['A', 'Trade', 'Sir', ',', 'that', 'I', 'hope', 'I', ...]

Vocabulary Size: 3016

Embedding Size: 100

Vocabulary Sample:
 [('and', 1), ('the', 2), ('i', 3), ('to', 4), ('you', 5), ('of', 6), ('that', 7), ('a', 8), ('not', 9), ('is', 10)]
```

### Step 2:

Create a function to encode the words to ids and separate context, target words

```python
import numpy as np

# Function for generating context word pairs

def gcp(corpus, vocab_size, window_size):
    
    # We take context length as double the window size to include both the left and right parts
    
    context_len = window_size*2
    
    for words in corpus:
        
        sentence_len = len(words)
        
        for index, word in enumerate(words):
            context_words = []
            label_word   = []            
            start = index - window_size
            end = index + window_size + 1
            

            context_words.append([words[i] 
                                 for i in range(start, end) 
                                 if 0 <= i < sentence_len 
                                 and i != index])
            label_word.append(word)

            x = sequence.pad_sequences(context_words, maxlen=context_len)
            y = np_utils.to_categorical(label_word, vocab_size)
            yield (x, y)
            
            
# Sample test for 10 samples
now = 0
for x, y in gcp(corpus=wids, window_size=window_size, vocab_size=vocab_size):
    if 0 not in x[0]:
        print('X (Context Words):', [id2word[w] for w in x[0]], '; Y(Target Word):', id2word[np.argwhere(y[0])[0][0]])
    
        if now == 5:
            break
        now += 1

```

**Output:**

```python
X (Context Words): ['the', 'tragedie', 'julius', 'caesar'] ; Y(Target Word): of
X (Context Words): ['tragedie', 'of', 'caesar', 'by'] ; Y(Target Word): julius
X (Context Words): ['of', 'julius', 'by', 'william'] ; Y(Target Word): caesar
X (Context Words): ['julius', 'caesar', 'william', 'shakespeare'] ; Y(Target Word): by
X (Context Words): ['caesar', 'by', 'shakespeare', '1599'] ; Y(Target Word): william
X (Context Words): ['by', 'william', '1599', 'actus'] ; Y(Target Word): shakespeare

```

### Step 3 COMMENT THE CODE BETTER

Build a cbow(DL) model

```python
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda

# build CBOW architecture
cbow = Sequential()
cbow.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embed_size,)))
cbow.add(Dense(vocab_size, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# view model summary
print(cbow.summary())

# visualize model structure
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(cbow, show_shapes=True, show_layer_names=False, 
                  rankdir='TB').create(prog='dot', format='svg'))

```

**Ouput**

```python
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 4, 100)            301600    
_________________________________________________________________
lambda_2 (Lambda)            (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 3016)              304616    
=================================================================
Total params: 606,216
Trainable params: 606,216
Non-trainable params: 0
_________________________________________________________________


```
![](images2/cbow_arch.svg)



### Step 4:

Run the model and store the result in a dataframe

```python
for epoch in range(1, 6):
    loss = 0.
    i = 0
    for x, y in gcp(corpus=wids, window_size=window_size, vocab_size=vocab_size):
        i += 1
        loss += cbow.train_on_batch(x, y)
        if i % 100000 == 0:
            print('Processed {} (context, word) pairs'.format(i))

    print('Epoch:', epoch, '\tLoss:', loss)
 

    
weights = cbow.get_weights()[0]
weights = weights[1:]
print(weights.shape)


cbow=pd.DataFrame(weights, index=list(id2word.values())[1:])
                                                            
print(cbow.head(5))    

    
```
**Output**
```python
Epoch: 1 	Loss: 142144.08271932602

Epoch: 2 	Loss: 168401.8597151637

Epoch: 3 	Loss: 172909.3102901429

Epoch: 4 	Loss: 171930.0592066273

Epoch: 5 	Loss: 172467.75245435536
        
        
        

```



You can also see the result(Which words does our model consider context and target word) by writing code silmilar to the following:

```python

from sklearn.metrics.pairwise import euclidean_distances

# compute pairwise distance matrix
distance_matrix = euclidean_distances(weights)
# print(distance_matrix.shape)

# view contextually similar words
similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1] 
                   for search_term in ['caesar', 'brutus', 'rome','ambitious']}

print(similar_words)

```

**Output:**

```python
{'caesar': ['brutus', 'most', 'octauius', 'death', 'blood'],
 'brutus': ['octauius', 'titinius', 'cassi', 'yours', 'most'],
 'rome': ['hoe', 'himselfe', 'pompeyes', 'many', 'whose'],
 'ambitious': ['loues', 'speak', 'peeces', 'wherein', 'greeke']}

```

You can find and implement the entire code [here](https://colab.research.google.com/drive/1PkccGxv9FKKfkYCsNntp0ouqHvZ8p2xV)


### 2.2 Skip – Gram model

Another method of neural word embedding is Skip – gram model.

It follows very similar architecture as CBOW. Instead of word given context, Skip Gram tries to predict the context given a word.

***

***The input to CBOW model is the `context` word and output is the `target` word.***


***The input to Skipgram model is the `target` word and output is the `context` word.***

***

The neural network is given an `input word(target)` and it outputs the probability of each word in the corpus appearing within the `context window` size(that we specify)of the input word.

The output probabilities then tell us how likely they will appear around the input word. 

For eg, If input word is "Germany",the model ,if trained correctly will have output probabilities higher for words like "Munich" and "Europe" than for unrelated words like "orange" and "arm".


Again similar to CBOW, we feed the network word pairs in our training documents. 

Let's use the same sentence as before "The dog jumped over the fence." and take a context window size=2

Following is the visual representation for the sentence with window size 2:

![](final_images/skipgram2.jpg)


In the above diagram, `dark blue` represents the `input(word)` and `light orange` represents the `output(context)`.


Following is how the corpus will look like:

![](final_images/skipgram1.jpg)


The network then simply learns it by counting the number of times each pairing shows up. 


The input vector for skip-gram is going to be same as a `1-context CBOW` model because it has only one feature input which is the target word.


The difference between skip-gram and CBOW will be in the target variable calculation. 


Following is the model arch. of our problem:


![](final_images/skip_gram_net_arch.jpg)


Since we have defined a context window of 2 there will be “two” corresponding outputs for the same input(By two corresponding outputs we mean two output neurons which should be activated for one output, along with non activation of all the other neurons).

Two separate errors are then calculated with respect to the two target variables. The two error vectors obtained are added element-wise to obtain a final error vector, after which back propagation is done to update the weights.


The skip-gram architecture for n-context window is shown below:

![](final_images/skipgrams2.jpg)

 
**Note:** The figure above might lead one to think that skipgram uses `multiple output matrices`.That is not the case. It is the same matrix that is receiving `multiple output vectors` representing different context words


**Deep Dive(Optional)**

You can understand the mathematics of Skip-Gram in more detail by checking out  `Unit 4` of [How exactly does word2vec work?](http://www.1-4-5.net/~dmm/ml/how_does_word2vec_work.pdf) by David Meyer


#### Advantage of Skip-Gram Model

- Owing to context driven output, Skip-gram model can capture two different meanings for the same word. i.e There will be two vector representations of Windows. One for the operating system(Microsoft Windows) and other for the house openings(Flat's windows, for example).

#### Disadvantage of Skip-Gram Model

- Depending on the context window size, backpropagation can get both time and resource intensive.


## Python implementation


Let's try to see how we can implement the same on python.

#### Step 1:

Loading the data and cleaning it(Note: It's the same as it was for CBOW)

```python


from keras.preprocessing import text
from keras.utils import np_utils
from keras.preprocessing import sequence
import pandas as pd
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
from string import punctuation
from sklearn.metrics.pairwise import euclidean_distances

# Loading the data
corpus=[nltk.corpus.gutenberg.words('shakespeare-caesar.txt')]

print("Sample Corpus(Julius Caesar- Shakespeare):\n")
print(corpus[0][16:])
print(corpus[0][150:])


# Terms to remove
r_terms = punctuation + '1234567890'

# Removing the above defined terms
norm_corpus = [[word.lower() for word in sent if word not in r_terms] for sent in corpus]

# Joining the sentence back    
norm_corpus = [' '.join(tok_sent) for tok_sent in norm_corpus]


# norm_bible = filter(None, normalize_corpus(norm_bible))

# Tokenizing the words 
norm_corpus = [tok_sent for tok_sent in norm_corpus if len(tok_sent.split()) > 2]
norm_corpus=norm_corpus[:1500]

tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(norm_corpus)
word2id = tokenizer.word_index

# Building the vocabulary of unique words
word2id['PAD'] = 0
id2word = {v:k for k, v in word2id.items()}
wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in norm_corpus]
vocab_size = len(word2id)

# Choosing the word embedding dimension()
embed_size = 100
window_size = 2 # context window size

print('\nVocabulary Size:', vocab_size)
print('\nEmbedding Size:',embed_size)
print('\nVocabulary Sample:\n', list(word2id.items())[:10])
```

**Output:**

```python
Sample Corpus(Julius Caesar- Shakespeare):

['.', 'Enter', 'Flauius', ',', 'Murellus', ',', 'and', ...]
['A', 'Trade', 'Sir', ',', 'that', 'I', 'hope', 'I', ...]

Vocabulary Size: 3016

Embedding Size: 100

Vocabulary Sample:
 [('and', 1), ('the', 2), ('i', 3), ('to', 4), ('you', 5), ('of', 6), ('that', 7), ('a', 8), ('not', 9), ('is', 10)]
```
#### Step 2:

Create a function to encode the words to ids and separate context, target words(Note: It's the same as it was for CBOW)


```python
import numpy as np

# Function for generating context word pairs

def gcp(corpus, vocab_size, window_size):
    
    # We take context length as double the window size to include both the left and right parts
    
    context_len = window_size*2
    
    for words in corpus:
        
        sentence_len = len(words)
        
        for index, word in enumerate(words):
            context_words = []
            label_word   = []            
            start = index - window_size
            end = index + window_size + 1
            

            context_words.append([words[i] 
                                 for i in range(start, end) 
                                 if 0 <= i < sentence_len 
                                 and i != index])
            label_word.append(word)

            x = sequence.pad_sequences(context_words, maxlen=context_len)
            y = np_utils.to_categorical(label_word, vocab_size)
            yield (x, y)
            
            
# Sample test for 10 samples
now = 0
for x, y in gcp(corpus=wids, window_size=window_size, vocab_size=vocab_size):
    if 0 not in x[0]:
        print('X (Context Words):', [id2word[w] for w in x[0]], '; Y(Target Word):', id2word[np.argwhere(y[0])[0][0]])
    
        if now == 5:
            break
        now += 1

```

**Output:**

```python
X (Context Words): ['the', 'tragedie', 'julius', 'caesar'] ; Y(Target Word): of
X (Context Words): ['tragedie', 'of', 'caesar', 'by'] ; Y(Target Word): julius
X (Context Words): ['of', 'julius', 'by', 'william'] ; Y(Target Word): caesar
X (Context Words): ['julius', 'caesar', 'william', 'shakespeare'] ; Y(Target Word): by
X (Context Words): ['caesar', 'by', 'shakespeare', '1599'] ; Y(Target Word): william
X (Context Words): ['by', 'william', '1599', 'actus'] ; Y(Target Word): shakespeare

```


#### Step 3:

Create Skip-Gram pairs skipgram function of keras

```python
from keras.preprocessing.sequence import skipgrams

# generate skip-grams
skip_grams = [skipgrams(wid, vocabulary_size=vocab_size, window_size=2) for wid in wids]

# view sample skip-grams
pairs, labels = skip_grams[0][0], skip_grams[0][1]
for i in range(10):
    print("X(Target word) : {:s} ({:d}), Y(Context word):{:s} ({:d}) = {:d}".format(
          id2word[pairs[i][0]], pairs[i][0], 
          id2word[pairs[i][1]], pairs[i][1], 
          labels[i]))
    
```

**Output:**

```python
X(Target word) : your (26), Y(Context word):break (1317) = 0
X(Target word) : him (20), Y(Context word):haile (842) = 0
X(Target word) : what (32), Y(Context word):swim (1509) = 0
X(Target word) : any (128), Y(Context word):vs (60) = 1
X(Target word) : health (659), Y(Context word):he (17) = 1
X(Target word) : stand (104), Y(Context word):purpos (2112) = 0
X(Target word) : reasons (350), Y(Context word):shall (33) = 1
X(Target word) : to (4), Y(Context word):foam (1626) = 0
X(Target word) : see (99), Y(Context word):lucius (133) = 0
X(Target word) : you (5), Y(Context word):home (237) = 1
```

### Step 4:


Build a skipgram(DL) model

```python

from keras.layers import Merge
from keras.layers.core import Dense, Reshape
from keras.layers.embeddings import Embedding
from keras.models import Sequential
from keras.layers import Input
# build skip-gram architecture
word_model = Sequential()
word_model.add(Embedding(vocab_size, embed_size,
                         embeddings_initializer="glorot_uniform",
                         input_length=1))
word_model.add(Reshape((embed_size, )))

context_model = Sequential()
context_model.add(Embedding(vocab_size, embed_size,
                  embeddings_initializer="glorot_uniform",
                  input_length=1))
context_model.add(Reshape((embed_size,)))
input_1 = Input(shape=(1,))
input_2 = Input(shape=(1,))

model = Sequential()
model.add(Merge([word_model, context_model], mode="dot"))
# model.add(Dot(axes=1)[word_model, context_model])
model.add(Dense(1, kernel_initializer="glorot_uniform", activation="sigmoid"))
model.compile(loss="mean_squared_error", optimizer="rmsprop")

# view model summary
print(model.summary())

# visualize model structure
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(model, show_shapes=True, show_layer_names=False, 
                 rankdir='TB').create(prog='dot', format='svg'))

```

**Ouput:**

```python
Layer (type)                 Output Shape              Param #   
=================================================================
merge_1 (Merge)              (None, 1)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 2         
=================================================================
Total params: 603,202
Trainable params: 603,202
Non-trainable params: 0
_________________________________________________________________
None
```

![](final_images/skipgram_svg.png)




#### Step 5:

Run the model and store the result in a dataframe


```python
for epoch in range(1, 6):
    loss = 0
    for i, elem in enumerate(skip_grams):
        pair_first_elem = np.array(list(zip(*elem[0]))[0], dtype='int32')
        pair_second_elem = np.array(list(zip(*elem[0]))[1], dtype='int32')
        labels = np.array(elem[1], dtype='int32')
        X = [pair_first_elem, pair_second_elem]
        Y = labels
        if i % 10000 == 0:
            print('Processed {} (skip_first, skip_second, relevance) pairs'.format(i))
        loss += model.train_on_batch(X,Y)  

    print('Epoch:', epoch, 'Loss:', loss)

merge_layer = model.layers[0]
word_model = merge_layer.layers[0]
word_embed_layer = word_model.layers[0]
weights = word_embed_layer.get_weights()[0][:]

print(weights.shape)
sg=pd.DataFrame(weights, index=id2word.values())

print(sg.head())


    ```




**Output:**

```python

Processed 0 (skip_first, skip_second, relevance) pairs
Epoch: 1 Loss: 0.25000664591789246
Processed 0 (skip_first, skip_second, relevance) pairs
Epoch: 2 Loss: 0.24970348179340363
Processed 0 (skip_first, skip_second, relevance) pairs
Epoch: 3 Loss: 0.24942149221897125
Processed 0 (skip_first, skip_second, relevance) pairs
Epoch: 4 Loss: 0.24911095201969147
Processed 0 (skip_first, skip_second, relevance) pairs
Epoch: 5 Loss: 0.24874989688396454

        
0         1         2         3         4         5         6   \
and  0.026597  0.001879 -0.043389 -0.043488 -0.018410  0.038676 -0.003460   
the  0.014068 -0.032061  0.000684  0.027724 -0.002988 -0.006454  0.030715   
i   -0.002288 -0.021885  0.031956  0.006131  0.028143 -0.040348  0.025684   
to   0.008286 -0.030030  0.004438  0.045821  0.042157  0.015617 -0.011603   
you -0.017092 -0.049372  0.035414 -0.052552  0.036532  0.000042  0.030852   

           7         8         9     ...           90        91        92  \
and -0.007456 -0.032164 -0.022693    ...    -0.029225 -0.014139  0.029264   
the  0.029273  0.010004 -0.011260    ...     0.021498 -0.007501  0.012237   
i    0.031671  0.025718  0.024700    ...    -0.001088  0.014545  0.034693   
to  -0.024210  0.003525 -0.056557    ...    -0.004863  0.009977 -0.007356   
you -0.020037  0.044469 -0.041131    ...     0.030758  0.010160  0.048979   

           93        94        95        96        97        98        99  
and -0.003137 -0.007429  0.028457  0.023394  0.032370 -0.007107 -0.024838  
the  0.043908  0.009744 -0.050923 -0.045509 -0.010050  0.000221  0.027214  
i   -0.000955 -0.037693 -0.009268 -0.048223  0.000058 -0.011583  0.005775  
to  -0.002735 -0.038393  0.041381  0.003286 -0.049478 -0.015810  0.034801  
you  0.018299 -0.024840 -0.022634 -0.008918  0.024879  0.011316 -0.008067        

```




You can also see the result(Which words does our model consider context and target word) by writing code silmilar to the following:

```python
from sklearn.metrics.pairwise import euclidean_distances

# compute pairwise distance matrix
distance_matrix = euclidean_distances(weights)
# print(distance_matrix.shape)

# view contextually similar words
similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1] 
                   for search_term in ['caesar', 'brutus', 'rome','ambitious']}

similar_words
```

**Output:**
```python
{'caesar': ['vpper', 'infirmities', 'traitor', 'stab', 'factious'],
 'brutus': ['crowne', 'affayres', 'treasure', 'staine', 'laught'],
 'rome': ['healthfull', 'ho', 'foot', 'yawn', 'eares'],
 'ambitious': ['knee', 'griefe', 'peoples', 'cride', 'veyl']}

```

You can find and implement the code [here](https://colab.research.google.com/drive/17BAYfMEoDyPTQv2_J3ckjzxOq2lZ5Ss8)

We have seen from both the above code and theory that Word2Vec models though effective usually involve a huge neural network(The average corpus size of text data is around 10K)!

This issue is also addressed by authors of Word2Vec with a method called `negative sampling`. 
That is each training sample only modifies a small percentage of the weights, rather than all of them.

Let's try to understand it in detail in the next topic.


# 2.3 Negative Sampling


Suppose we had a corpus with 10,000 unique words. If we pass it through a neural network having an input layer(which will be 10,000 neurons), hidden layer(say, 200 neurons) and output layer(which will be 10,000 neurons as well), weight matrix for hidden layer and output layer will  be of 200 x 10,000 = 2 million weights each!

Running gradient descent on this neural network is going to be slow, not to mention, you need a large amount of training data in order avoid over-fitting. 
This humongous weight matrix coupled with large amount of training samples means that training this model is not easily scalable or deployable.

To resolve that we use something called `negative sampling`. 

In negative sampling, we select what is known as "negative" words and update their weights. Here "negative" simply refers to words for which our model will output a zero(or words which should not be the output words given the input)

The motivation behind negative sampling is that instead of changing all the weights each time, we’re using only 'K' of them and increasing computational efficiency.


For example, if we have a word pair `UPDATE (“dog”, “jumped”)` to train our network for the corpus. Ideally, for the output neuron corresponding to “jumped” we need to output a 1, and for all of the other output neurons we need to output a 0.

Instead of that using negative sampling, we will just select 5 to 20 words(for smaller datasets) or 2 to 5 words(for larger datasets) to output a 0.


So if we take the same example as before where we have word vector with 200 components, and a vocabulary of 10000 words, we will just be updating the weights for our positive word which is “jumped” along with the weights for 5 other negative words that will have output 0. That’s a total of 6 output neurons, and 1,200 weight values total. 

**That’s only 0.06% of the 2M weights.**


Q: How do we chose those negative sample words?

A: They are chosen using simple (unigram) distribution.

The probability for selecting a word as a negative sample is related to its frequency.
Therefore more frequent words are more likely to be selected as negative samples.

Each word is given a weight equal to it’s frequency (word count) raised to the 3/4 power. The probability for a selecting a word is just it’s weight divided by the sum of weights for all words.

\begin{align}
P(w_i) = \frac{  {f(w_i)}^{3/4}  }{\sum_{j=0}^{n}\left(  {f(w_j)}^{3/4} \right) }
\end{align}

**Note:** The decision to raise the frequency to the 3/4 power is just empirical.

Skip-gram with negative sub-sampling outperforms almost every other method.

****

Before we go any further, let's shift our focus back to the 'embeddings' coming out from neural embeddings.

Following is the flow

The output context matrix W′ encodes the meanings of words as context, different from the embedding matrix W


The questions then one starts thinking about are these:

*Q :* How do you know if you have got the right embeddings?

*A :* There's no simple way to know exactly what an embedding means. The training process just forces words to be in valuable relative positions against each other. Owing to the training, the individual values of the embedding do have some semantic meaning in the language but not they are not simple to comprehend.
In a nutshell, the defintion of right embeddings will change according to each application and way to know the right embeddings is to observe the evaluation metrics.


*Q :* What is the correct embedding size? 

*A :*  As a continuation of the above answer, the correct embedding size will vary according to application.


Finally following is a summary of all the methods we discussed:

![](final_images/comp.jpg)

# Add TASK

# Task 2

We will try to directly create word embeddings using Word2Vec model of gensim

- Julius Caesar corpus is already loaded and tokenized, and stored in `'corpus'`

- Print `'corpus'` to see how the data looks

- Create a CBOW model on corpus using `"gensim.models.Word2Vec()"` and passing the following parameters `"sentences=corpus"`, `"size = 100"` & `"window = 2"`(`size` refers to the embedding matrix size you want, `window` refers to the window size of context)

- Find similarity between `caesar` and `rome` using the `"similarity()"` method of the above created model. (For eg: if you saved your Word2Vec model as `'model'`, you can find the similarity by writing `model.similarity('caesar','rome')`). Save the similarity score in `'cbow_sim'`

- Create a Skip-Gram model on corpus using `"gensim.models.Word2Vec()"` and passing the following parameters `"sentences=corpus"`, `"size = 100"`,`"window = 2"` & `"sg = 1"`(sg = 1 implies we want skip gram modeling)

- Find similarity between `caesar` and `rome` using the `"similarity()"` method of the above created model. (For eg: if you saved your Word2Vec model as `'model'`, you can find the similarity by writing `model.similarity('caesar','rome')`). Save the similarity score in `'sg_sim'`


**Extra Task**

Word2Vec in gensim automatically implements `negative sampling`. You can play around it using the `negative` parameter of gensim model(The value passed to `negative` specifies how many “noise words” should be drawn (usually between 5-20). Default is 5. If set to 0, no negative samping is used)

Why don't you try to change the value of `negative` and see how the similarity scores change?



# Hints

You can CBOW find the similarity using the following similar code:
```python
model1 = gensim.models.Word2Vec(sentences=corpus, size = 100, window = 2) 
  
cbow_sim=model1.similarity('caesar', 'rome')    
```
# Test Cases

#cbow_sim
Variable Declaration
np.round(cbow_sim,2)== np.round(0.999789,2)

#sg_sim
Variable Declaration
np.round(sg_sim,2)== np.round(0.9993608,2)

In [2]:
  # importing all necessary modules 
from nltk.tokenize import sent_tokenize, word_tokenize 
  
from gensim.models import Word2Vec 
  
#  Reading of  ‘julius_caesar.txt’ file 
sample = open("julius_caesar.txt", "r") 
s = sample.read() 
    
corpus = [] 
  
# iterate through each sentence in the file 
for i in sent_tokenize(s): 
    temp = [] 
      
    # tokenize the sentence into words 
    for j in word_tokenize(i): 
        temp.append(j.lower()) 
  
    corpus.append(temp) 
  


# Code starts here

# Printing the cleaned corpus
print(corpus[100:105])


# Create CBOW model 
model1 = gensim.models.Word2Vec(sentences=corpus, size = 100, window = 2) 
  
cbow_sim=model1.similarity('caesar', 'rome')    
# Print results 
print("\nCBOW similarity between 'Caesar' and 'Rome': ", cbow_sim) 
      
  
# Create Skip Gram model 
model2 = gensim.models.Word2Vec(sentences=corpus, size = 100, window = 2, sg = 1) 
  
sg_sim=model2.similarity('caesar', 'rome')    
# Print results 
print("Skip-Gram similarity between 'Caesar' and 'Rome': ", sg_sim) 
    



NameError: name 'f' is not defined

# Success Message

Congrats! You have successfully implemented CBOW and Skip-Gram using gensim Word2Vec

# Chapter 3: GloVe



## What is GLOVE

Although word embeddings have become the SOTA in many NLP tasks, they do have some drawbacks. 

**Limitation with Word2Vec**


- Inability to handle unknown vocabulary words

If the model encounters a new unknown word , it will have not be able to interpret it or build a vector for it. This problem is tackled by force assigning a random vector to the unknown word (which is far from ideal).

This is a major issue in domains like News articles,Twitter, etc where there is a lot of noisy and sparse data(with words that will appear only once or twice in a very large corpus)


- The sub-linear relationships is not explicitly defined. 

There are no shared representations at sub-word levels with word2vec. 

For example, if a new word starts with "dis", from our knowledge of words, we can infer that it’s probably an adjective indicating the opposite of something, like disloyal, dissimilar or dishonest

Word2vec unfortunately represents every word as an independent vector, even though many words are morphologically similar(just like our example above)



This leads us to talk about another model that addresses the above issues better. It is currently gaining a lot of attention and slightly different from the normal word2vec models.

**GLOVE**

Global Vectors(or Glove) is an unsupervised learning algorithm for obtaining vector representations for words. The authors of [GLOVE paper](https://www.aclweb.org/anthology/D14-1162) found out that `context window-based` methods(like skip-gram, CBOW) suffer from a disadvantage. They take into account only the immediate statistic(dog appears near the word jumped in a sentence) but not into account the global corpus statistics(repetition and large-scale patterns; For e.g.: 'dog' and 'cat' both appear near 'jumped' across multiple sentences). 


GLOVE's objective is to capture the meaning generated from these statistics, and represent that meaning in the resulting word vectors. Let's try to understand how it does that.


Consider the words man and woman. In a standard corpus, the Euclidean distance between the two word vectors will be very less owing to their similarity(i.e. Both man and woman will appear in similar structures sentences). Therefore finding out the nearest neighbors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words. 

Unfortunately it's too simple and this can be problematic because two given words can exhibit more intricate relationships than can be captured by a single number. 'Man' is similar to 'Woman' in that both words describe humans; on the other hand, the two words are often considered opposites since they represent two different biological variations of the same species.
This means in order to capture a quantitative way that distinguishes man from woman, it is necessary for a model to associate more than a single number to the word pair. 

One of the methods is the vector difference between the two word vectors. GloVe uses these vector differences to capture as much as possible the meaning specified by the juxtaposition of two words.


How does this help us? 

Suppose you have the word vectors “rome,” “italy,” and “france” and perform the following operation, vector("rome") − vector("italy") + vector("france"). The resulting vector will be close to the vector for “paris.”

The underlying concept that distinguishes male from female, i.e. gender, may be equivalently specified by other word pairs like father and mother or uncle and aunt. The vector differences between all this is roughly equal. 



**Working**

How does GLOVE come up with these vectors?

The simple idea behind the model is that ratios of word-word co-occurrence probabilities have the potential to capture some form of meaning which can then be encoded as 'vector differences'

Let's understand this using a very popular example:

Consider the co-occurrence probabilities for target words ice and steam with various words from the vocabulary. 

We know,
- 'ice' will co-occur more frequently with 'solid' than it will with 'gas'
- 'steam' will co-occur more frequently with gas than it will with solid. 
- both words will co-occur frequently with 'water'(both are scientifically different states of water) 
- both words will co-occur infrequently with perhaps 'fashion'(a completely unrelated word).


Let's take P(k|w) as the prob. that a word k appears in the context(nearby) of word w.

Based on the above points, P(solid|ice) will be relatively high and P(solid | steam) will be relatively low. 

Thus, the ratio of 'P(solid | ice) / P(solid | steam)' will be large. 
In the same way, ratio of P(gas | ice) / P(gas | steam) will be small. 

For a word related/unrelated to both ice and steam, such as water & fashion respectively, ratio of P(water|ice)/P(water|steam) and P(fashion|ice)/P(fashion|steam) will be close to one.

Using the ratio of probabilities, noise from non-discriminative words like water and fashion therefore cancel out.

![](final_images/glove_table.jpg)

So if for a word k, 'P(k | ice) / P(k | steam)' gives large value (much greater than 1), it means the word correlates well with properties specific to ice. Similarly small values (much less than 1) will mean that the word correlates well with properties specific of steam. 

**Note:** In our specific example, these ratio of probabilities therefore have encoded some meaning associated with the concept of thermodynamic phase. 


The training objective of GloVe therefore is to learn word vectors such that their dot product equals the words' probability of co-occurrence. Instead of ratio, logarithm of ratio is taken owing to the fact that the logarithm of a ratio equals the difference of logarithms. This objective then associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space. 

Because these ratios can encode some form of meaning, this information gets encoded as vector differences as well. 

Too much? Let's break it down into simple steps:

Step 1:

For two input words i and j, and context word k, Let's create a function F similar to the following:

![](final_images/model2.jpg)

So now we have two arguments; the context word vector and the vector difference of the two words we're comparing

Step 2:

Now, we want to create a relation(linear) between $ w_i - w_j  \text{ and } \tilde{w_k}$ . This can be accomplished by using the dot product:

![](final_images/glove_1.jpg)

Step 3:

Add bias(to capture the fact that some words occur more often) and converting R.H.S. to log subtraction, we get:


$\textrm{dot}(w_i - w_j, \tilde{w_k}) + b_i - b_j = \log(P_{ik}) - \log(P_{jk}) $

We can convert this equation into an equation over a single entry(only i, no j) of the matrix :

$\textrm{dot}(w_i, \tilde{w_k}) + b_i = \log(P_{ik}) = \log(X_{ik}) - \log(X_i)$

Adding an output bias for symmetry and taking the term($- \log(X_i)$) into the bias term, we get

$\textrm{dot}(w_i, \tilde{w_k}) + b_i + \tilde{b}_k = \log(X_{ik})$ 

The above equation is the GLOVE equation for a single term.

Step 4:

Unfortunately, the equation above weights all co-occurrences equally. We know not all co-occurrences provide with the same information. 

- Case 1: We want to weight infrequent co-occurrences less heavily(noisy and unreliable) 
- Case 2: At the same time too frequent co-occurrences need similar less heavily based weighting(for e.g Trivial pairs like (“and,the”) shouldn't be dominating because of their weights).

Through experimentation(by authors of GLOVE paper), following is the weighting function that works the best:

$\textrm{weight}(x) = \textrm{min}(1, (x / x_{max})^\frac{3}{4})$  

Following is it's plot:

![](final_images/glove_2.jpg)

Put simply, the function gradually increases with x but never becomes larger than 1


### Comparision with Word2Vec Models

GloVe and Word2Vec methods are mathematically similar, though GLOVE has been seen to perform faster.

That's because if some words occur very frequently(which is usually the case), it’s faster to optimize over the statistics rather than to iterate over all the entire corpus repeatedly. The authors have also shown that GloVe consistently produces better embeddings faster than word2vec. 

Though, in practical uses nowadays, people use pre-trained word embeddings so the training time is not much of an advantage. 



## GLOVE in python

Pretrained word embedding corpus contains word vectors obtained from large corpora(Like Wikipedia, Twitter, etc).The word vectors trained on massive web datasets are now freely available. This large corpora helps capture more data than the problem specific corpus(The data you are using in DL Problem) and therefore leads to better model performance.


GLOVE embedding corpus as well is publicly available [here](https://github.com/stanfordnlp/GloVe). 
Let's see how we can fit our data on these pretrained embedding vectors



# Task 3

- Vector corpus using glove wikipedia corpus is already created and stored in `path`(The code for the same is commented)

- Load the corpus using `load_word2vec_format()` method of `KeyedVectors` and passing the parameters:`fname=path` & `binary=False`

- Use the `"most_similar()"` method of the above created model and pass the parameters `positive=['rome', 'france']`, `negative=['italy']`, `topn=1` to it. Store the result in a variable called `'result'`

Note: The above task is similar  to performing the following:
 vector("rome") − vector("italy") + vector("france")
 
 
- After submitting the correct code, experiment with different word combinations(like positive=['woman', 'father'], negative=['man']) and see what the model throws 

# Hint
You can load the vector by writing code similar to:
```python
model = KeyedVectors.load_word2vec_format(fname=path, binary=False)

```


# Test Cases

#result
variable declaration
(result[0][0]=='paris') &  (np.round(result[0][1],2)==np.round(0.8440366387367249,2))



In [None]:
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

# glove_input_file = 'glove.6B.100d.txt'
# word2vec_output_file = 'glove.6B.100d.txt.word2vec'
# glove2word2vec(glove_input_file, word2vec_output_file)


path= 'glove.6B.100d.txt.word2vec'

# Code starts here

model = KeyedVectors.load_word2vec_format(fname=path, binary=False)

# result = model.most_similar(positive=['woman', 'father'], negative=['man'], topn=1)
result = model.most_similar(positive=['rome', 'france'], negative=['italy'], topn=1)
print(result)

# 3.2 Developments in word embeddings


Owing to the tremendous applicability of word embeddings, it's development has been quite rapid.

Before we look at that, let's look at some of the wonderful word relations that word2vec models have achieved.

Following is the list of words that a Word2vec model provides when given the first three elements:
Note:  `:` means 'is to' and `::` means `as`; For e.g. "Paris is to France as Rome is to Italy" = Paris:France :: Rome:Italy 

    house:roof :: castle:[dome, bell_tower, spire, crenellations, turrets]

    monkey:human::dinosaur:[fossil, fossilized, Ice_Age_mammals, fossilization]

    knee:leg::elbow:[forearm, arm, ulna_bone]

    love:indifference::fear:[apathy, callousness, timidity, helplessness, inaction]

Seeing the above relations, you can understand why word embeddings are so powerful. Word2vec algorithm has never been taught any English syntax, has no information fed about the world and yet it can compute analogies that make sense to humans.




### Recent advances in Word Embeddings:

The most popular models are word2vec and GloVe which are both based on the distributional hypothesis (i.e. similar context tends to result in similar meanings).


Other than that both ELMO and FastText have been gaining a lot of attraction recently


1. FastText

You can read more about it [here](https://github.com/facebookresearch/fastText)


2. ELMO 

You can read more about it [here](https://allennlp.org/elmo)



If you want to check out other recent developments, you can visit a fantastic summary blog [here](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a)