##Word2vec


Word representation methods from the last lab

- Bag of Words
- TF-IDF

Limitations of these representations

- High-dimensional
- Sparse
- No info about words

Word2vec Paper [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)

Word2Vec is a shallow, two-layer neural network which is trained to reconstruct linguistic contexts of words.

It takes as its input a large corpus of words and produces a vector space, with each unique word in the corpus being assigned a corresponding vector in the space.


Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

Example:    
The **kid** studies mathematics.

The **child** studies mathematics.

![embedding](https://miro.medium.com/max/1400/1*sAJdxEsDjsPMioHyzlN3_A.png)

###Methods for building the Word2vec model

![cbow-skip-gram](https://miro.medium.com/max/1400/1*cuOmGT7NevP9oJFJfVpRKA.png)

###Continuous Bag-of-Words (CBOW)



CBOW predicts target words from the surrounding context words.

![cbow](https://1.bp.blogspot.com/-nZFc7P6o3Yc/XQo2cYPM_ZI/AAAAAAAABxM/XBqYSa06oyQ_sxQzPcgnUxb5msRwDrJrQCLcBGAs/s1600/image001.png)

###Skip-gram

Skip-gram predicts surrounding context words from the target words.

![skip-gram](https://i.stack.imgur.com/fYhXF.png)


##Architecture

The words are fed as one-hot vectors ( vector of the same length as the vocabulary, filled with zeros except at the index that represents the word we want to represent, which is assigned ‚Äú1‚Äù.)

The hidden layer is a standard fully-connected (Dense) layer whose weights are the word embeddings.

The output layer outputs probabilities for the target words from the vocabulary.

The goal of this neural network is to learn the weights for the hidden layer matrix.

![model](https://miro.medium.com/max/1400/1*tmyks7pjdwxODh5-gL3FHQ.png)

High-level illustration of the architecture

![model2](https://i.imgur.com/CBuZay5.png)

The rows of the hidden layer weight matrix, are actually the word vectors (word embeddings).


![hidden-layer](https://i.imgur.com/v6VqHad.png)

The hidden layer operates as a lookup table. The output of the hidden layer is just the ‚Äúword vector‚Äù for the input word.

More concretely, if you multiply a 1 x 10,000 one-hot vector by a 10,000 x 300 matrix, it will effectively just select the matrix row corresponding to the ‚Äò1‚Äô.

![vector](https://i.imgur.com/EYhcA5S.png)

###Semantic and syntactic relationships

If different words are similar in context, then Word2Vec should have similar outputs when these words are passed as inputs, and in-order to have a similar outputs, the computed word vectors (in the hidden layer) for these words have to be similar, thus Word2Vec is motivated to learn similar word vectors for words in similar context.

Word2Vec is able to capture multiple different degrees of similarity between words, such that semantic and syntactic patterns can be reproduced using vector arithmetic.

![w2vec](https://i.imgur.com/I66L7No.png)

![w2vec2](https://israelg99.github.io/images/2017-03-23-Word2Vec-Explained/linear-relationships.png)

**Skip-gram** - works well with a small amount of the training data, represents well even rare words or phrases

**CBOW** - several times faster to train than the skip-gram, slightly better accuracy for the frequent words.

###Word2vec embeddings in Gensim

In [None]:
from gensim.models import Word2Vec
import gensim.downloader

Gensim has multiple vector representations for words: word2vec, fasttext, glove

In [None]:
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


Downloading the word2vec model

In [None]:
word2vec = gensim.downloader.load('word2vec-google-news-300')

In [None]:
word2vec['cat'][:20]

array([ 0.0123291 ,  0.20410156, -0.28515625,  0.21679688,  0.11816406,
        0.08300781,  0.04980469, -0.00952148,  0.22070312, -0.12597656,
        0.08056641, -0.5859375 , -0.00445557, -0.296875  , -0.01312256,
       -0.08349609,  0.05053711,  0.15136719, -0.44921875, -0.0135498 ],
      dtype=float32)

In [None]:
word2vec.similarity('dog', 'house')

0.25689757

In [None]:
word2vec.similarity('dog', 'puppy')

0.81064284

In [None]:
word2vec.most_similar('cat')

[('cats', 0.8099379539489746),
 ('dog', 0.7609456777572632),
 ('kitten', 0.7464985251426697),
 ('feline', 0.7326233983039856),
 ('beagle', 0.7150583267211914),
 ('puppy', 0.7075453996658325),
 ('pup', 0.6934291124343872),
 ('pet', 0.6891531348228455),
 ('felines', 0.6755931377410889),
 ('chihuahua', 0.6709762215614319)]


(king - man) + woman = queen

In [None]:
word2vec.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411999702454)]

In [None]:
word2vec.most_similar(['woman', 'officer'], negative = ['man'])

[('Officer', 0.5694271326065063),
 ('officers', 0.538264274597168),
 ('offi_cer', 0.5283650159835815),
 ('chief', 0.48523107171058655),
 ('deputy', 0.47100305557250977),
 ('patrolwoman', 0.4685642719268799),
 ('policewoman', 0.46202757954597473),
 ('vice_president', 0.461116224527359),
 ('supervisor', 0.4552857577800751),
 ('oficer', 0.4532422721385956)]

## Using word2vec embeddings

Using the word2vec embeddings from Gensim

In [None]:
import nltk
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [None]:
from nltk.corpus  import twitter_samples

pos_tweets = twitter_samples.strings('positive_tweets.json')
print(len(pos_tweets))

neg_tweets = twitter_samples.strings('negative_tweets.json')
print(len(neg_tweets))

5000
5000


In [None]:
import pandas as pd
pos_df = pd.DataFrame(pos_tweets, columns = ['tweet'])
pos_df['label'] = 1

In [None]:
neg_df = pd.DataFrame(neg_tweets, columns = ['tweet'])
neg_df['label'] = 0

In [None]:
data_df = pd.concat([pos_df, neg_df], ignore_index=True)
# data_df = data_df[:20]

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data_df, test_size=0.2, shuffle = True)
print(train_df)
print(test_df)

                                                  tweet  label
6745                 @sweetbabecake yea i guess so :(((      0
8859  I fucking hate when I wake up like at this tim...      0
644         .@sajidislam honored to have you here ! :-)      1
6013                    o otp :( http://t.co/EVislmNp5V      0
2283  @effinoreos HAPPY 15th BIRTHDAY VIANEY!!! (a.k...      1
...                                                 ...    ...
4840  @scousebabe888 Nice Holiday Honey!!!!!!!!!!!!!...      1
3976  @planetjedward GoodMorning ! What's coming nex...      1
7151  I feel like I'm a weird person for shipping Be...      0
6902  I met a new kinds of people, new classmate, ne...      0
5326        @hamzaabasiali exactly but unfortunately :(      0

[8000 rows x 2 columns]
                                                  tweet  label
5648  @jenxmish @wittykrushnic you are the only thin...      0
7425                                   Omg no Amber :((      0
255   @AvinPera  follow @jnlaz

In [None]:
import numpy as np
import tqdm

def compute_embeddings(df):
    train_emb = []
    for i, row in tqdm.tqdm(df.iterrows(), total = len(df.index)):
        words = row['tweet'].split(' ')
        words = filter(lambda x: x in word2vec.vocab, words)
        text_emb = [word2vec[word] for word in words]

        if len(text_emb) == 0:
            train_emb.append(np.zeros(300))
            continue

        doc_embedding = np.mean(text_emb, axis = 0)
        train_emb.append(doc_embedding)
    return np.array(train_emb)

In [None]:
X_train_emb = compute_embeddings(train_df)
y_train = train_df['label']

X_test_emb = compute_embeddings(test_df)
y_test = test_df['label']

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8000/8000 [00:01<00:00, 4282.18it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [00:00<00:00, 4114.12it/s]


In [None]:
from sklearn.svm import SVC

svm = SVC(verbose = 2)
svm.fit(X_train_emb, y_train)

[LibSVM]

SVC(verbose=2)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, f1_score

y_test_pred = svm.predict(X_test_emb)

print('Accuracy', accuracy_score(y_test, y_test_pred))
print('Precision',precision_score(y_test, y_test_pred))
print('F1 score',f1_score(y_test, y_test_pred))

Accuracy 0.8815
Precision 0.9507803121248499
F1 score 0.869851729818781


# Global Vectors (GloVe)

While Word2Vec is based only on local statistics (the occurence of words at
a single-sentence level) [GloVe](https://nlp.stanford.edu/projects/glove/) incorporates global statistics methods. This makes it better suited for smaller datasets, as it does not need as much training data.

The model counts all "word1 word2 ..." pairs (for a context window of x we consider words that have at most distance x between them) and keeps the information in a co-occurrence matrix:

<center><img src='https://drive.google.com/uc?export=view&id=1pnX1lPdQItUauHp9W8xJlx8q2lgTe4cJ' width=500></center>

Afterwards, it computes the probability that a word will be closer to another one based on this matrix:
$$P(j | i) = \frac{X_{ij}}{X_i}$$
where:
$$P(j | i) = the\ probability\ of\ word\ j\ given\ i$$
$$X_{ij} = how\ many\ times\ word\ j\ appears\ in\ the\ context\ of\ i$$
$$X_i = \sum_k X_{ik} = sum\ of\ how\ many\ times\ words\ appear\ in\ the\ context\ of\ i$$

Based on this we should be able to infer relations between words:

<center><img src='https://nlp.stanford.edu/projects/glove/images/table.png' width=500></center>

Notice how _solid_ is related to _ice_ but not _steam_, while _gas_ is related to _steam_ but not _ice_ (very large vs. very small conditional values). _Water_ and _fashion_ on the other hand are either highly related to both or completely unrelated.

Some more computation will bring us to the regression model that is now used for this model. If you want to learn more you can check [the paper](https://aclanthology.org/D14-1162.pdf).

## Using GloVe

We can load a pretrained GloVe model using the gensim library (or other resources):

In [None]:
import gensim.downloader as api

model = api.load("glove-twitter-100")



And use it to compute the word embeddings (or do all other similarity functions that we saw for Word2Vec):

In [None]:
model['system']

array([ 0.43887 ,  0.32601 , -0.28524 , -0.08248 ,  0.43643 ,  0.75065 ,
        0.093945, -0.72626 ,  0.32297 , -0.37128 , -0.23306 ,  0.35499 ,
       -3.1764  ,  0.015004,  0.69725 , -0.15256 ,  0.025449, -0.058944,
        0.20002 , -0.61298 , -0.79661 ,  0.53051 ,  0.64765 ,  0.90153 ,
       -0.27407 ,  0.52871 ,  0.39344 ,  0.56076 ,  0.31942 ,  0.83347 ,
       -0.53268 , -1.0166  , -0.25328 , -0.17347 ,  0.68794 ,  0.25902 ,
        0.42864 ,  0.3844  , -0.071415, -0.026013, -0.42733 ,  0.58874 ,
       -0.30061 , -0.18357 ,  0.21158 , -0.72648 , -0.48477 ,  0.43527 ,
       -0.37412 , -0.48493 ,  0.26264 ,  0.21684 , -0.8822  ,  0.57925 ,
       -0.54    ,  0.7147  , -0.33133 , -0.44715 , -0.40713 , -0.014364,
       -0.083808,  0.45569 , -0.094374,  0.56057 ,  0.65446 , -0.45768 ,
        0.2522  ,  0.34328 , -0.061001, -0.4899  ,  0.3342  ,  0.41277 ,
       -0.55403 ,  0.30807 ,  0.22867 , -0.53921 ,  0.16439 ,  0.021561,
        0.15131 , -0.70287 ,  1.4152  ,  0.83387 , 

Or you can train your own model from scratch:

In [None]:
from glove import Corpus, Glove

corpus = Corpus()
corpus.fit(common_texts, window=4)

glove = Glove(no_components=4, learning_rate=0.1)
glove.fit(corpus.matrix, epochs=10, no_threads=8, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model.txt')

# FastText

The last embedding technique that we will talk about is FastText. With a really nice documentation, FastText also uses Skip-Gram and CBoW (like Word2Vec), but instead of learning words as a whole, it splits them in sequences of characters. This helps the model generalize better, especially with rare words, as it learns prefixes and suffixes along with other short sequences that convey information.

If we choose to split the word _artificial_ in n-grams of size 3 and padding 1, the representation will be: <_ar_, _art_, _rti_, _tif_, _ifi_, _fic_, _ici_, _ial_, _al_>. And then we continue similar as with word2vec. The full explanation is in [the paper](https://aclanthology.org/E17-2068.pdf) and code snippets are in the [documentation](https://fasttext.cc).

# Principal Component Analysis (PCA)

PCA is a dimensionality reduction algorithm -- meaning that we can use it to visualise our data in 2D or 3D. Here is an example of how you can use it to see the distance between embeddings in 2D:

In [None]:
from sklearn.decomposition import PCA

text = ['system', 'graph', 'trees', 'user']
embeddings = [model[word] for word in text]

pca = PCA(n_components=2)
pca.fit(embeddings)
vectors_2d = pca.transform(embeddings)

We can train it the same way we would a normal ML model, and visualize the results using, for example, a plotting library like matplotlib:

In [None]:
import matplotlib.pyplot as plt

x = [v[0] for v in vectors_2d]
y = [v[1] for v in vectors_2d]

fig, ax = plt.subplots()
ax.scatter(x, y)

for i, txt in enumerate(text):
    ax.annotate(txt, (x[i], y[i]))

plt.show()

## Exercises

1. Play around with the word2vec model and see if there are any interesting or counterintuitive similarity results using  ```word2vec.similarity``` and ```word2vec.most_similar```.

2. Use other embeddings (glove, fasttext) to encode the data from the sentiment analysis task and train the classification model.

3. Write your own implementation for Bag of Words from scratch. You should be able to set whether the representation will be binary or frequency-based.

4. Implement your own TfIdf from scratch. You can use as many helper functions as you want.

5. Create the (context, target) pairs and train a neural network for either skip-gram or continuous bag of words. You should quantify each word with a unique id and use padding at the beginning and end of the text for training on the marginal terms.

6. Visualise the distance between a few words in 2D using PCA (or another dimensionality reduction technique)

7. Compare these embeddings using any means (e.g.: train time, most similar word to X, distances in a 2D space, accuracy with a SVM etc.). Also compare the library versions with your own implementations.


Notebook adapted from https://israelg99.github.io/2017-03-23-Word2Vec-Explained/

Further Reading

- [Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings](https://arxiv.org/pdf/1607.06520.pdf)
- [Debiaswe: try to make word embeddings less sexist](https://github.com/tolga-b/debiaswe)
- [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)
- [Fasttext Word vectors for 157 languages](https://fasttext.cc/docs/en/crawl-vectors.html)
- [Illustrated Word2vec](https://jalammar.github.io/illustrated-word2vec/)