# Exporing Semantic Relationship with Word Embeddings

Types of Embeddings
- word2vec
- GloVe
- FastText
- Deep contextualized embeddings

### Blueprint: Using similarity queries on pretrained models


**Loading a pretrained model**

In [1]:
import os

# Store the gnesim models locally.
os.environ["GENSIM_DATA_DIR"] = "./models"

In [2]:
import gensim.downloader as api
import pandas as pd

info_df = pd.DataFrame.from_dict(api.info()["models"], orient="index")
info_df[["file_size", "base_dataset", "parameters"]].head(5)

Unnamed: 0,file_size,base_dataset,parameters
fasttext-wiki-news-subwords-300,1005007000.0,"Wikipedia 2017, UMBC webbase corpus and statmt...",{'dimension': 300}
conceptnet-numberbatch-17-06-300,1225498000.0,"ConceptNet, word2vec, GloVe, and OpenSubtitles...",{'dimension': 300}
word2vec-ruscorpora-300,208427400.0,Russian National Corpus (about 250M words),"{'dimension': 300, 'window_size': 10}"
word2vec-google-news-300,1743564000.0,Google News (about 100 billion words),{'dimension': 300}
glove-wiki-gigaword-50,69182540.0,"Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)",{'dimension': 50}


In [3]:
model = api.load("glove-wiki-gigaword-50")

## Similarity Queries

In [4]:
v_king = model["king"]
v_queen = model["queen"]

print("Vector size:", model.vector_size)
print("v_king =", v_king[:10])
print("v_queen =", v_queen[:10])
print("similarity:", model.similarity("king", "queen"))

Vector size: 50
v_king = [ 0.50451   0.68607  -0.59517  -0.022801  0.60046  -0.13498  -0.08813
  0.47377  -0.61798  -0.31012 ]
v_queen = [ 0.37854   1.8233   -1.2648   -0.1043    0.35829   0.60029  -0.17538
  0.83767  -0.056798 -0.75795 ]
similarity: 0.7839043


In [5]:
model.most_similar("king", topn=3)

[('prince', 0.8236179351806641),
 ('queen', 0.7839043140411377),
 ('ii', 0.7746230363845825)]

In [6]:
v_lion = model["lion"]
v_nano = model["nanotechnology"]

model.cosine_similarities(v_king, [v_queen, v_lion, v_nano])

array([ 0.7839043 ,  0.47800118, -0.25490996], dtype=float32)

In [7]:
model.most_similar(positive=["women", "king"], negative=["man"], topn=3)

[('kingdom', 0.701315701007843),
 ('queen', 0.6152784824371338),
 ('invited', 0.6111606359481812)]

In [8]:
model.most_similar(positive=["paris", "germany"], negative=["france"], topn=3)

[('berlin', 0.9203965663909912),
 ('frankfurt', 0.8201637268066406),
 ('vienna', 0.8182448744773865)]

In [9]:
model.most_similar(positive=["france", "capital"], topn=1)

[('paris', 0.7835100293159485)]

## Blueprints for Training and Evaluating your own Embeddings

### Data Preparation

1. Clean text from unwanted tokens (symbols, tags, etc.)
2. Put all words into lowercase.
3. Use lemmas.

In [10]:
import sqlite3

db_name = "data/reddit-selfposts.db"

con = sqlite3.connect(db_name)
df = pd.read_sql("select subreddit, lemmas, text from posts_nlp", con)
con.close()

df["lemmas"] = df["lemmas"].str.lower().str.split()  # Lower case tokens
sents = df["lemmas"]  # Our training sentences

### Phrases

In [11]:
from gensim.models.phrases import Phrases, npmi_scorer

phrases = Phrases(
    sents, min_count=10, threshold=0.3, delimiter="-", scoring=npmi_scorer
)

In [12]:
sent = "I had to replace the timing belt in my mercedes c300".split()
phrased = phrases[sent]
print(*phrased, sep="|")

I|had|to|replace|the|timing-belt|in|my|mercedes-c300


In [34]:
phrase_df = pd.DataFrame.from_dict(
    phrases.export_phrases(), orient="index", columns=["score"]
)
phrase_df.index.name = "phrase"
phrase_df = phrase_df.reset_index()
phrase_df = (
    phrase_df[["phrase", "score"]]
    .drop_duplicates()
    .sort_values(by="score", ascending=False)
    .reset_index(drop=True)
)
phrase_df["phrase"] = phrase_df["phrase"]

In [40]:
phrase_df[phrase_df["phrase"].str.contains("mercedes")]

Unnamed: 0,phrase,score
87,mercedes-benz,0.800502


From the result, threshold should be larger than 0.5 and smaller and 0.8

In [42]:
phrases = Phrases(
    sents, min_count=10, threshold=0.7, delimiter="-", scoring=npmi_scorer
)
df["phrased_lemmas"] = df["lemmas"].map(lambda s: phrases[s])
sents = df["phrased_lemmas"]

### Blueprint: Training Models with Gensim

In [46]:
from gensim.models import Word2Vec

model = Word2Vec(
    sents,  # Tokenized input sentences
    vector_size=100,  # Size of word vectors (default 100)
    window=2,  # Context window size (default 5)
    sg=1,  # Use skip-gram (default 0 = CBOW)
    negative=5,  # Number of negative samples (default 5)
    min_count=5,  # Ignore infrequent words (default 5)
    workers=4,  # Number of threads (default 3)
    epochs=5,  # Number of epochs (default 5)
)

In [47]:
model.save("./models/autos_w2v_100_2_full.bin")

In [49]:
# !pip install fasttext

In [54]:
from gensim.models import FastText, Word2Vec

model_path = "./models"
model_prefix = "autos"

param_grid = {
    "w2v": {"variant": ["cbow", "sg"], "window": [2, 5, 30]},
    "ft": {"variant": ["sg"], "window": [5]},
}
size = 100

for algo, params in param_grid.items():
    for variant in params["variant"]:
        sg = 1 if variant == "sg" else 0
        for window in params["window"]:
            if algo == "w2v":
                model = Word2Vec(sents, vector_size=size, window=window, sg=sg)
            else:
                model = FastText(sents, vector_size=size, window=window, sg=sg)
            file_name = f"{model_path}/{model_prefix}_{algo}_{variant}_{window}"
            model.wv.save_word2vec_format(file_name + ".bin", binary=True)

### Blueprint: Evaluating Different Models

In [58]:
from gensim.models import KeyedVectors

names = [
    "autos_w2v_cbow_2",
    "autos_w2v_sg_2",
    "autos_w2v_sg_5",
    "autos_w2v_sg_30",
    "autos_ft_sg_5",
]
models = {}

for name in names:
    file_name = f"{model_path}/{name}.bin"
    models[name] = KeyedVectors.load_word2vec_format(file_name, binary=True)

In [59]:
def compare_models(models, **kwargs):
    df = pd.DataFrame()

    for name, model in models:
        df[name] = [
            f"{word} {score:.3f}" for word, score in model.most_similar(**kwargs)
        ]
    df.index = df.index + 1  # Let row index start at 1

    return df

In [60]:
compare_models([(n, models[n]) for n in names], positive="bmw", topn=10)

Unnamed: 0,autos_w2v_cbow_2,autos_w2v_sg_2,autos_w2v_sg_5,autos_w2v_sg_30,autos_ft_sg_5
1,mercedes 0.866,mercedes 0.739,328i 0.746,328i 0.806,bmws 0.844
2,lexus 0.823,335i 0.698,335i 0.725,xdrive 0.800,bmwfs 0.801
3,vw 0.803,mercede 0.697,benz 0.714,335i 0.776,mercedes_benz 0.770
4,mercede 0.795,porsche 0.689,mercedes 0.703,5-serie 0.763,m135i 0.763
5,subaru 0.792,benz 0.686,mercede 0.687,bmws 0.759,merc 0.761
6,porsche 0.787,merc 0.670,merc 0.684,535i 0.749,525i 0.752
7,audi 0.787,e92 0.669,135i 0.679,340i 0.746,328i 0.750
8,benz 0.776,e39 0.663,e39 0.678,f10 0.737,mercede 0.750
9,volvo 0.771,lexus 0.663,x5 0.678,e39 0.736,mercs 0.749
10,volkswagen 0.756,audi 0.659,e92 0.677,x-drive 0.732,mercedes-benz 0.745


**Analogy reasoning on our own model**

What is to "toyota" as "f150" is to "ford"?

In [61]:
compare_models(
    [(n, models[n]) for n in names],
    positive=["f150", "toyota"],
    negative=["ford"],
    topn=5,
).T

Unnamed: 0,1,2,3,4,5
autos_w2v_cbow_2,f-150 0.862,camry 0.826,s80 0.799,civic-si 0.799,e320 0.795
autos_w2v_sg_2,camry 0.712,f-150 0.700,sr5 0.691,89 0.673,nissan-frontier 0.672
autos_w2v_sg_5,tacoma 0.705,tundra 0.659,highlander 0.644,nissan-frontier 0.640,f-150 0.638
autos_w2v_sg_30,4runner 0.729,tacoma 0.698,tacomas 0.658,4wd 0.657,4x4 0.651
autos_ft_sg_5,f150s 0.759,tacomas 0.746,toyotas 0.737,toyo 0.734,tacoma 0.731


**Interpretation**

In reality, the Toyota Tacoma is a direct competitor to the F-150 as well as the Toyota Tundra. 
The skip-gram model with the window size 5 gives the best result.



## Blueprint for Visualizing Embeddings

### Blueprint: Applying Dimensionality Reduction

In [121]:
# !pip install umap-learn'[plot]'

In [122]:
model = models["autos_w2v_sg_30"]
words = model.key_to_index
wv = [model[word] for word in words]

In [125]:
import umap

reducer = umap.UMAP(n_components=2, metric="cosine", n_neighbors=15, min_dist=0.1)
reduced_wv = reducer.fit_transform(wv)

AttributeError: module 'umap' has no attribute 'UMAP'

### Blueprint: Using the Tensorflow Embedding