# Word Embeddings 

In this notebook we will go through word embeddings using deep learning, we will not train a new model we will use pre-trained ones as training a new one will cost a lot.

We will be using `spacy` in this tutorial to demonstrate word embeddings

In [1]:
# Update pip tools and install spacy


# pip install -U spacy

# Download the English model

# python -m spacy download en_core_web_md

In [2]:
# pip install -U pip setuptools wheel
# pip install -U spacy
# python -m spacy download en_core_web_md

In [3]:
import spacy
import pandas as pd
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

cm = sns.light_palette("blue", as_cmap=True)
nlp = spacy.load('en_core_web_md')

In [4]:
words = ['cat', 'dog', 'car', 'bird', 'eagle','tiger','lion']
vectors = [nlp(word).vector for word in words]

In [5]:
similarities = cosine_similarity(vectors, vectors)
pd.DataFrame(similarities, columns=words, index=words).style.background_gradient(cmap=cm)

Unnamed: 0,cat,dog,car,bird,eagle,tiger,lion
cat,1.0,0.801686,0.319075,0.523687,0.324779,0.541339,0.526544
dog,0.801686,1.0,0.356292,0.478755,0.289382,0.436547,0.474245
car,0.319075,0.356292,1.0,0.223812,0.22869,0.166127,0.175249
bird,0.523687,0.478755,0.223812,1.0,0.572219,0.493906,0.492987
eagle,0.324779,0.289382,0.22869,0.572219,1.0,0.545475,0.591164
tiger,0.541339,0.436547,0.166127,0.493906,0.545475,1.0,0.735983
lion,0.526544,0.474245,0.175249,0.492987,0.591164,0.735983,1.0


# Vectors !

The vectors generated by `spacy` model is a 300 dimensional vector which is the output of a pre-trained GloVe model.

In [8]:
vector = nlp("Bank").vector
print(vector.shape)
print(vector[:5])

(300,)
[-0.60877  0.30253 -0.12351 -0.23647  0.2665 ]


## Embeddings as feature

We can use word embedding as features of the text and build a classifier using them

In [9]:
import numpy as np
from tqdm.auto import tqdm
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
x_train, y_train = fetch_20newsgroups(categories=categories, 
                          remove=('headers', 'footers', 'quotes'), return_X_y=True)
x_test, y_test = fetch_20newsgroups(categories=categories, 
                          remove=('headers', 'footers', 'quotes'), return_X_y=True, subset='test')

In [11]:
x_train_v = np.zeros((len(x_train), 300))
x_test_v = np.zeros((len(x_test), 300))

for i, doc in tqdm(enumerate(nlp.pipe(x_train)), total=len(x_train)):
    x_train_v[i, :] = doc.vector

for i, doc in tqdm(enumerate(nlp.pipe(x_test)), total=len(x_test)):
    x_test_v[i, :] = doc.vector

  0%|          | 0/2257 [00:00<?, ?it/s]

  0%|          | 0/1502 [00:00<?, ?it/s]

# Train a classifier

In [42]:
clf = LinearSVC()
clf.fit(x_train_v, y_train)
print(classification_report(y_test, clf.predict(x_test_v), target_names=categories))

                        precision    recall  f1-score   support

           alt.atheism       0.72      0.65      0.68       319
soc.religion.christian       0.91      0.90      0.91       389
         comp.graphics       0.84      0.88      0.86       396
               sci.med       0.80      0.83      0.82       398

              accuracy                           0.83      1502
             macro avg       0.82      0.82      0.82      1502
          weighted avg       0.82      0.83      0.82      1502



# Get top similar

In [43]:
import random
from termcolor import colored

for i in random.choices(range(0, len(x_test_v)), k=5):
    print(f"ID: {i}")
    print("True label:", colored(categories[y_test[i]], 'green'))
    distances = cosine_similarity([x_test_v[i]], x_train_v).flatten()
    indices = np.argsort(distances)[::-1]
    for _, j in enumerate(indices[:3]):
        print(f"{_} nearest label is",
              f"{colored(categories[y_train[j]], 'green' if y_train[j]==y_test[i] else 'red')}",
              f"similarity score: {colored(round(distances[j], 3), 'yellow')}")

ID: 1252
True label: [32msoc.religion.christian[0m
0 nearest label is [32msoc.religion.christian[0m similarity score: [33m0.978[0m
1 nearest label is [32msoc.religion.christian[0m similarity score: [33m0.975[0m
2 nearest label is [32msoc.religion.christian[0m similarity score: [33m0.974[0m
ID: 1466
True label: [32mcomp.graphics[0m
0 nearest label is [31malt.atheism[0m similarity score: [33m0.992[0m
1 nearest label is [31malt.atheism[0m similarity score: [33m0.991[0m
2 nearest label is [31msci.med[0m similarity score: [33m0.99[0m
ID: 867
True label: [32msci.med[0m
0 nearest label is [32msci.med[0m similarity score: [33m0.984[0m
1 nearest label is [31malt.atheism[0m similarity score: [33m0.983[0m
2 nearest label is [32msci.med[0m similarity score: [33m0.983[0m
ID: 1053
True label: [32msci.med[0m
0 nearest label is [32msci.med[0m similarity score: [33m0.98[0m
1 nearest label is [31malt.atheism[0m similarity score: [33m0.98[0m
2 nearest l

# Conclusion

- Word embedding is a very powerful feature specially if you have small data, as your model will make use of the learned features of the word2vec model and thus will be able to make better predictions.
- Word2vec and GloVe don't count for different context that the same word can have in different sentences