---

# Skip Gram Model


 - Python's Gensim module: https://radimrehurek.com/gensim/ (install using pip)

    Note: The most important hyper parameters of skip-gram/CBOW are vector size and windows size_


In [2]:
!pip install gensim
import pandas as pd
import numpy as np
import gensim



In [3]:
import gensim.downloader as api

model = api.load('word2vec-google-news-300') # this step might take ~10-15 minutes



Finding the cosine similarity between the following word pairs
- (President, Election)
- (Hot, Warm)
- (England, London)
- (France, ball)
- (small, smaller)

In [4]:
#Replace 0 with the code / value; Do not delete this cell
from scipy.spatial.distance import cosine
word_pairs = [('President', 'Election'), ('Hot', 'Warm'), ('England', 'London'),
              ('France', 'ball'), ('small', 'smaller')]
similarity_pair1 = 1 - cosine(model[word_pairs[0][0]], model[word_pairs[0][1]]) if word_pairs[0][0] in model and word_pairs[0][1] in model else 0
similarity_pair2 = 1 - cosine(model[word_pairs[1][0]], model[word_pairs[1][1]]) if word_pairs[1][0] in model and word_pairs[1][1] in model else 0
similarity_pair3 = 1 - cosine(model[word_pairs[2][0]], model[word_pairs[2][1]]) if word_pairs[2][0] in model and word_pairs[2][1] in model else 0
similarity_pair4 = 1 - cosine(model[word_pairs[3][0]], model[word_pairs[3][1]]) if word_pairs[3][0] in model and word_pairs[3][1] in model else 0
similarity_pair5 = 1 - cosine(model[word_pairs[4][0]], model[word_pairs[4][1]]) if word_pairs[4][0] in model and word_pairs[4][1] in model else 0

Writing expressions to extract the vector representations of the words:

- France
- England
- smaller
- bigger
- rocket
- big

In [6]:
vector_1 = model['France'][:5] if 'France' in model else 0
vector_2 = model['England'][:5] if 'England' in model else 0
vector_3 = model['smaller'][:5] if 'smaller' in model else 0
vector_4 = model['bigger'][:5] if 'bigger' in model else 0
vector_5 = model['rocket'][:5] if 'rocket' in model else 0
vector_6 = model['big'][:5] if 'big' in model else 0

Finding the euclidean distances between the word pairs :

- (France, England)
- (smaller, bigger)
- (England, London)
- (France, Rocket)
- (big, bigger)


In [8]:
import numpy as np
eu_dist1 = np.linalg.norm(model['France'] - model['England']) if 'France' in model and 'England' in model else 0
eu_dist2 = np.linalg.norm(model['smaller'] - model['bigger']) if 'smaller' in model and 'bigger' in model else 0
eu_dist3 = np.linalg.norm(model['England'] - model['London']) if 'England' in model and 'London' in model else 0
eu_dist4 = np.linalg.norm(model['France'] - model['Rocket']) if 'France' in model and 'England' in model else 0
eu_dist5 = np.linalg.norm(model['big'] - model['bigger']) if 'big' in model and 'bigger' in model else 0

In [9]:
print(eu_dist1)
print(eu_dist2)
print(eu_dist3)
print(eu_dist4)
print(eu_dist5)


3.0151067
1.8618743
2.8752837
3.892071
1.9586496


Using Word2Vec to find the 2 closest words:
- (King - Man + Queen)
- (bigger - big + small)
- (waiting - wait + run)
- (Texas + Milwaukee – Wisconsin)

In [10]:
closest1 = model.most_similar(positive=['King', 'Queen'], negative=['Man'], topn=2)
closest2 = model.most_similar(positive=['bigger', 'small'], negative=['big'], topn=2)
closest3 = model.most_similar(positive=['waiting', 'run'], negative=['wait'], topn=2)
closest4 = model.most_similar(positive=['Texas', 'Wisconsin'], negative=['Milwaukee'], topn=2)

In [11]:
print(closest1)
print(closest2)
print(closest3)
print(closest4)

[('Queen_Elizabeth', 0.5257916450500488), ('monarch', 0.5004087090492249)]
[('larger', 0.7402471303939819), ('smaller', 0.7329993844032288)]
[('running', 0.5654535889625549), ('runs', 0.49639999866485596)]
[('Nebraska', 0.6184834241867065), ('Arkansas', 0.5827385783195496)]


***Using Google News dataset to apply K-means clustering to find most representative words***

In [12]:
from sklearn.cluster import KMeans
import random
np.random.seed(42)
random.seed(42)
words = list(model.index_to_key)
sample1 = random.sample(words, 25000)
vectors = np.array([model[i] for i in sample1])
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(vectors)
cluster_centers = kmeans.cluster_centers_
most_rep_cluster1 = model.similar_by_vector(cluster_centers[0], topn=5)
most_rep_cluster2 = model.similar_by_vector(cluster_centers[1], topn=5)

In [13]:
print(most_rep_cluster1)
print(most_rep_cluster2)

[('http_dol##.net_index###.html_http', 0.9178717732429504), ('dol##.net_index####.html_http_dol##.net', 0.907823920249939), ('index###.html_http_dol##.net_index###.html', 0.906944751739502), ('Deltagen_undertakes', 0.9038156270980835), ('By_TRICIA_SCRUGGS', 0.9010686874389648)]
[('Emil_Protalinski_Published', 0.9201086163520813), ('By_HuDie_####-##-##', 0.9168179035186768), ('By_QianMian_####-##-##', 0.9161686301231384), ('BY_GEOFF_KOHL', 0.914146363735199), ('By_XiaoBing_####-##-##', 0.9127659797668457)]


Categorical cross entropy as a primary loss function for the skip gram model. The skip-gram's goal is to predict context words for a target word. Categorical cross entropy works with a softmax activation function which calculates the probability distirbution over all of the words. This function minimizes the difference between the model's predictions and the actual words, making it effective for optimization.

Finding at least 2 interesting word vec combinations

In [15]:
result1 = model.most_similar(positive=['Paris', "Italy"], negative=['France'], topn=2)
result2 = model.most_similar(positive=['teacher', 'man'], negative=['woman'], topn=2)
result3 = model.most_similar(positive=['teacher', 'woman'], negative=['man'], topn=2)
result4 = model.most_similar(positive=['summer', 'sweater'], negative=['winter'], topn=2)

print(result1)
print(result2)
print(result3)
print(result4)

[('Milan', 0.7222141623497009), ('Rome', 0.702830970287323)]
[('teachers', 0.5810958743095398), ('PE_teacher', 0.556725800037384)]
[('teachers', 0.6448071002960205), ('guidance_counselor', 0.6279474496841431)]
[('shirt', 0.6057167053222656), ('blazer', 0.5627408027648926)]
