<a href="https://colab.research.google.com/github/bucuram/foundations-of-NLP-labs/blob/main/Lab4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Word2vec


Word representation methods from the last lab

- Bag of Words
- TF-IDF

Limitations of these representations

- High-dimensional
- Sparse
- No info about words

Word2vec Paper [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)

Word2Vec is a shallow, two-layer neural network which is trained to reconstruct linguistic contexts of words.

It takes as its input a large corpus of words and produces a vector space, with each unique word in the corpus being assigned a corresponding vector in the space.


Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

Example:    
The **kid** studies mathematics.

The **child** studies mathematics.

![embedding](https://miro.medium.com/max/1400/1*sAJdxEsDjsPMioHyzlN3_A.png)

###Methods for building the Word2vec model

![cbow-skip-gram](https://miro.medium.com/max/1400/1*cuOmGT7NevP9oJFJfVpRKA.png)

###Continuous Bag-of-Words (CBOW)



CBOW predicts target words from the surrounding context words.

![cbow](https://1.bp.blogspot.com/-nZFc7P6o3Yc/XQo2cYPM_ZI/AAAAAAAABxM/XBqYSa06oyQ_sxQzPcgnUxb5msRwDrJrQCLcBGAs/s1600/image001.png)

###Skip-gram

Skip-gram predicts surrounding context words from the target words.

![skip-gram](https://i.stack.imgur.com/fYhXF.png)


##Architecture

The words are feeded as one-hot vectors ( vector of the same length as the vocabulary, filled with zeros except at the index that represents the word we want to represent, which is assigned “1”.)

The hidden layer is a standard fully-connected (Dense) layer whose weights are the word embeddings.

The output layer outputs probabilities for the target words from the vocabulary.

The goal of this neural network is to learn the weights for the hidden layer matrix.

![model](https://miro.medium.com/max/1400/1*tmyks7pjdwxODh5-gL3FHQ.png)

High-level illustration of the architecture

![model2](https://i.imgur.com/CBuZay5.png)

The rows of the hidden layer weight matrix, are actually the word vectors (word embeddings).


![hidden-layer](https://i.imgur.com/v6VqHad.png)

The hidden layer operates as a lookup table. The output of the hidden layer is just the “word vector” for the input word.

More concretely, if you multiply a 1 x 10,000 one-hot vector by a 10,000 x 300 matrix, it will effectively just select the matrix row corresponding to the ‘1’.

![vector](https://i.imgur.com/EYhcA5S.png)

###Semantic and syntactic relationships

If different words are similar in context, then Word2Vec should have similar outputs when these words are passed as inputs, and in-order to have a similar outputs, the computed word vectors (in the hidden layer) for these words have to be similar, thus Word2Vec is motivated to learn similar word vectors for words in similar context.

Word2Vec is able to capture multiple different degrees of similarity between words, such that semantic and syntactic patterns can be reproduced using vector arithmetic.

![w2vec](https://i.imgur.com/I66L7No.png)

![w2vec2](https://israelg99.github.io/images/2017-03-23-Word2Vec-Explained/linear-relationships.png)

**Skip-gram** - works well with a small amount of the training data, represents well even rare words or phrases

**CBOW** - several times faster to train than the skip-gram, slightly better accuracy for the frequent words.

###Word2vec embeddings in Gensim

In [2]:
from gensim.models import Word2Vec
import gensim.downloader

Gensim has multiple vector representations for words: word2vec, fasttext, glove

In [3]:
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


Downloading the word2vec model

In [4]:
word2vec = gensim.downloader.load('word2vec-google-news-300')

In [5]:
word2vec['cat'][:20]

array([ 0.0123291 ,  0.20410156, -0.28515625,  0.21679688,  0.11816406,
        0.08300781,  0.04980469, -0.00952148,  0.22070312, -0.12597656,
        0.08056641, -0.5859375 , -0.00445557, -0.296875  , -0.01312256,
       -0.08349609,  0.05053711,  0.15136719, -0.44921875, -0.0135498 ],
      dtype=float32)

In [6]:
word2vec.similarity('dog', 'house')

0.25689757

In [7]:
word2vec.similarity('dog', 'puppy')

0.81064284

In [8]:
word2vec.most_similar('cat')

[('cats', 0.8099379539489746),
 ('dog', 0.7609456777572632),
 ('kitten', 0.7464985251426697),
 ('feline', 0.7326233983039856),
 ('beagle', 0.7150583267211914),
 ('puppy', 0.7075453996658325),
 ('pup', 0.6934291124343872),
 ('pet', 0.6891531348228455),
 ('felines', 0.6755931377410889),
 ('chihuahua', 0.6709762215614319)]


(king - man) + woman = queen

In [9]:
word2vec.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411999702454)]

##Assignemnet

To be uploaded here: https://forms.gle/KuR71xiA2rR6tukz8 until December 1st.

1. Play around with the word2vec model and see if there are any interesting or counterintuitive similarity results using  ```word2vec.similarity``` and ```word2vec.most_similar```.

2. Use word2vec embeddings to encode the data from the sentiment analysis task from Lab 3 and train a classification model.

Adapted from https://israelg99.github.io/2017-03-23-Word2Vec-Explained/

Further Reading

- [Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings](https://arxiv.org/pdf/1607.06520.pdf)