<a id='top'></a><a name='top'></a>
# Chapter 10: Exploring Semantic Relationships with Word Embeddings

**Blueprints for Text Analysis Using Python**

* [Introduction](#introduction)
* [10.0 Imports and Setup](#10.0)
* [10.1 The Case for Semantic Embeddings](#10.1)
    - [10.1.1 Word Embeddings](#10.1.1)
    - [10.1.2 Analogy Reasoning with Word Embeddings](#10.1.2)
    - [10.1.3 Types of Embeddings](#10.3.3)
* [10.2 Blueprint: Using Similarity Queries on Pretrained Models](#10.2)
    - [10.2.1 Loading a Pretrained Model](#10.2.1)
    - [10.2.2 Similarity Queries](#10.2.2)
* [10.3 Blueprints for Training and Evaluating Your Own Embeddings](#10.3)
    - [10.3.1 Data Preparation](#10.3.1)
    - [10.3.2 Blueprint: Training Models with Gensim](#10.3.2)
    - [10.3.3 Blueprint: Evaluating Different Models](#10.3.3)
* [10.4 Blueprints for Visualizing Embeddings](#10.4)
    - [10.4.1 Blueprint: Applying Dimensionality Reduction](#10.4.1)
    - [10.4.2 Blueprint: Using the TensorFlow Embedding Projector](#10.4.2)
    - [10.4.3 Blueprint: Constructing a Similarity Tree](#10.4.3)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

### Dataset

* Reddit Self-Posts dataset (same as Chp 4)


* Reddit rspct_autos.tsv.gz: [script](#reddit-selfposts-ch10.db), [source](https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/data/reddit-selfposts/reddit-selfposts-ch10.db)


### Explore

* How to use pretrained embeddings.
* How to train your own embeddings.
* How to compare and visualize different models.

---
<a name='10.0'></a><a id='10.0'></a>
# 10.0 Imports and Setup
<a href="#top">[back to top]</a>

In [4]:
# Start with clean project
!rm -f *.py 
!rm -f *.txt 
!rm -f *.db 
!rm -f *.bin 

zsh:1: no matches found: *.py
zsh:1: no matches found: *.db
zsh:1: no matches found: *.bin


In [2]:
req_file = "requirements_10.txt"

In [3]:
%%writefile {req_file}
gensim
plotly
pydot
scikit-learn-intelex
tqdm
umap-learn
watermark

Writing requirements_10.txt


In [4]:
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

Installing packages
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.5/26.5 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.0/83.0 KB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.2/88.2 KB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━

In [5]:
%%writefile imports.py
# Place at top to patch scikit-learn algorithms
from sklearnex import patch_sklearn # isort:skip
patch_sklearn() # isort:skip

import csv
import locale
import logging
import os
import pprint
import re
import sqlite3
import warnings
from collections import deque

import gensim
import gensim.downloader as api
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
import seaborn as sns
from gensim.models import FastText, KeyedVectors, Word2Vec
from gensim.models.phrases import Phrases, npmi_scorer
from networkx.drawing.nx_pydot import graphviz_layout
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from tqdm import tqdm
from umap import UMAP
from watermark import watermark

Writing imports.py


In [6]:
!isort imports.py
!cat imports.py

/bin/bash: isort: command not found
# Place at top to patch scikit-learn algorithms
from sklearnex import patch_sklearn # isort:skip
patch_sklearn() # isort:skip

import csv
import locale
import logging
import os
import pprint
import re
import sqlite3
from collections import deque

import gensim
import gensim.downloader as api
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
import seaborn as sns
from gensim.models import FastText, KeyedVectors, Word2Vec
from gensim.models.phrases import Phrases, npmi_scorer
from networkx.drawing.nx_pydot import graphviz_layout
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from tqdm import tqdm
from umap import UMAP
from watermark import watermark


In [7]:
# Place at top to patch scikit-learn algorithms
from sklearnex import patch_sklearn # isort:skip
patch_sklearn() # isort:skip

import csv
import locale
import logging
import os
import pprint
import re
import sqlite3
import warnings
from collections import deque

import gensim
import gensim.downloader as api
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
import seaborn as sns
from gensim.models import FastText, KeyedVectors, Word2Vec
from gensim.models.phrases import Phrases, npmi_scorer
from networkx.drawing.nx_pydot import graphviz_layout
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from tqdm import tqdm
from umap import UMAP
from watermark import watermark

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [8]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('ignore')
BASE_DIR = '.'
sns.set_style("darkgrid")
tqdm.pandas(desc="progress-bar")
pp = pprint.PrettyPrinter(indent=4)

logging.getLogger().setLevel(logging.WARNING) # ????

print(watermark(iversions=True, globals_=globals(),python=True, machine=True))

Python implementation: CPython
Python version       : 3.9.16
IPython version      : 7.9.0

Compiler    : GCC 9.4.0
OS          : Linux
Release     : 5.10.147+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

sqlite3   : 2.6.0
seaborn   : 0.12.2
pandas    : 1.4.4
csv       : 1.0
numpy     : 1.22.4
sys       : 3.9.16 (main, Dec  7 2022, 01:11:51) 
[GCC 9.4.0]
plotly    : 5.13.1
networkx  : 3.0
gensim    : 4.3.1
logging   : 0.5.1.2
matplotlib: 3.7.1
re        : 2.2.1



In [9]:
def plot_embeddings(model, search=[], topn=0, show_all=False, train_all=False, 
                    labels=False, colors=True, n_dims=2, algo='pca', **kwargs):

    def closest(word, model, search, topn):
        """Find the closest word in a given list of search words, if in top-n."""
        closest_word = model.most_similar_to_given(word, search)
        if word == closest_word or \
           word in [w for w, _ in model.most_similar(closest_word, topn=topn)]:
            return closest_word 
        else:
            return 'other'

    # eliminate kwargs of other methods if supplied
    if algo != 'tsne': ###
        kwargs.pop('perplexity', None) ###
    if algo != 'umap': ###
        kwargs.pop('n_neighbors', None) ###
        kwargs.pop('min_dist', None) ###
        kwargs.pop('spread', None) ###

    # Define the reducer
    if algo == 'umap':
        reducer = UMAP(n_components=n_dims, metric='cosine', **kwargs)
    elif algo == 'tsne':
        reducer = TSNE(n_components=n_dims, **kwargs)
    else:
        reducer = PCA(n_components=n_dims, **kwargs)

    if len(search) == 0: # no search words: show all
        show_all = True
    if show_all:  # to show all, all must be trained
        train_all = True
        
    # Identify words to plot
    if show_all:
        words = [w for w in model.index_to_key]
    else:
        words = search + [sim_word for w in search 
                         for sim_word, _ in model.most_similar(w, topn=topn)]
        words = list(set(words)) # make word list it unique for t-SNE

    # Reduce
    wv = [model[word] for word in words]
    if not train_all:
        print(f"Calculating {algo} for {len(words)} words ...", end="") 
        reduced_wv = reducer.fit_transform(wv)
    else:
        print(f"Calculating {algo} for {len(words)} words ...", end="") 
        reducer.fit(model.vectors)
        reduced_wv = reducer.transform(wv)
    print(f" done.") ###

    # Create data frame for ploty express visualization
    # with x, y (, z) and meta data for styling
    if n_dims == 2:
        df = pd.DataFrame.from_records(reduced_wv, columns=['x', 'y'])
    else:
        df = pd.DataFrame.from_records(reduced_wv, columns=['x', 'y', 'z'])

    df['word']  = words
    params = {}

    if show_all:
        df['size'] = 1
        params.update({'size_max': 3, 'size': 'size' })
    else:
        df['size'] = df['word'].map(lambda w: 30 if w in search else 5)
        params.update({'size': 'size'})

    if len(search) > 0: # Colorize with closest search word
        df['label'] = df['word'].map(lambda w: w if labels or w in search else '')
        params.update({'text': 'label'})
        if colors:
            df['color'] = df['word'].progress_apply(closest, model=model, search=search, topn=topn)
            params.update({'color': 'color'})

    params.update({'hover_data': {c: False for c in df.columns}, 'hover_name': 'word'})

    # Generate scatter plot
    if n_dims == 2:
        params.update({'width': 800, 'height': 400})
        fig = px.scatter(df, x="x", y="y", opacity=0.3, **params)
        fig.update_xaxes(showticklabels=False, showgrid=True, title='', zeroline=False, visible=True)
        fig.update_yaxes(showticklabels=False, showgrid=True, title='', zeroline=False, visible=True)
    else:
        params.update({'width': 900, 'height': 700})
        df['z'] = df['z']*2/3 # scale 3d box
        fig = px.scatter_3d(df, x="x", y="y", z="z", opacity=0.5, **params)
        fig.update_layout(scene = dict(xaxis = go.layout.scene.XAxis(title = '', showticklabels=False),
                                       yaxis = go.layout.scene.YAxis(title = '', showticklabels=False),
                                       zaxis = go.layout.scene.ZAxis(title = '', showticklabels=False)))
    
    fig.update_traces(textposition='middle center', marker={'line': {'width': 0}})
    fig.update_layout(font=dict(family="Franklin Gothic", size=12, color="#000000"))
    fig.show()
    
    return fig

---
<a name='10.1'></a><a id='10.1'></a>
# 10.1 The Case for Semantic Embeddings
<a href="#top">[back to top]</a>

<a name='10.1.1'></a><a id='10.1.1'></a>
## 10.1.1 Word Embeddings
<a href="#top">[back to top]</a>

<a name='10.1.2'></a><a id='10.1.2'></a>
## 10.1.2 Analogy Reasoning with Word Embeddings
<a href="#top">[back to top]</a>

<a name='10.1.3'></a><a id='10.1.3'></a>
## 10.1.3 Types of Embeddings
<a href="#top">[back to top]</a>

---
<a name='10.2'></a><a id='10.2'></a>
# 10.2 Blueprint: Using Similarity Queries on Pretrained Models
<a href="#top">[back to top]</a>


<a name='10.2.1'></a><a id='10.2.1'></a>
## 10.2.1 Loading a Pretrained Model
<a href="#top">[back to top]</a>


In [10]:
os.environ['GENSIM_DATA_DIR'] = './models'

In [11]:
# pandas number format
pd.options.display.float_format = '{:.0f}'.format

In [12]:
info_df = pd.DataFrame.from_dict(api.info()['models'], orient='index')
info_df[['file_size', 'base_dataset', 'parameters']].head(5)

Unnamed: 0,file_size,base_dataset,parameters
fasttext-wiki-news-subwords-300,1005007116,"Wikipedia 2017, UMBC webbase corpus and statmt...",{'dimension': 300}
conceptnet-numberbatch-17-06-300,1225497562,"ConceptNet, word2vec, GloVe, and OpenSubtitles...",{'dimension': 300}
word2vec-ruscorpora-300,208427381,Russian National Corpus (about 250M words),"{'dimension': 300, 'window_size': 10}"
word2vec-google-news-300,1743563840,Google News (about 100 billion words),{'dimension': 300}
glove-wiki-gigaword-50,69182535,"Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)",{'dimension': 50}


In [13]:
# full list of columns
info_df.head(3)

Unnamed: 0,num_records,file_size,base_dataset,reader_code,license,parameters,description,read_more,checksum,file_name,parts,preprocessing
fasttext-wiki-news-subwords-300,999999,1005007116,"Wikipedia 2017, UMBC webbase corpus and statmt...",https://github.com/RaRe-Technologies/gensim-da...,https://creativecommons.org/licenses/by-sa/3.0/,{'dimension': 300},1 million word vectors trained on Wikipedia 20...,[https://fasttext.cc/docs/en/english-vectors.h...,de2bb3a20c46ce65c9c131e1ad9a77af,fasttext-wiki-news-subwords-300.gz,1,
conceptnet-numberbatch-17-06-300,1917247,1225497562,"ConceptNet, word2vec, GloVe, and OpenSubtitles...",https://github.com/RaRe-Technologies/gensim-da...,https://github.com/commonsense/conceptnet-numb...,{'dimension': 300},ConceptNet Numberbatch consists of state-of-th...,[http://aaai.org/ocs/index.php/AAAI/AAAI17/pap...,fd642d457adcd0ea94da0cd21b150847,conceptnet-numberbatch-17-06-300.gz,1,
word2vec-ruscorpora-300,184973,208427381,Russian National Corpus (about 250M words),https://github.com/RaRe-Technologies/gensim-da...,https://creativecommons.org/licenses/by/4.0/de...,"{'dimension': 300, 'window_size': 10}",Word2vec Continuous Skipgram vectors trained o...,[https://www.academia.edu/24306935/WebVectors_...,9bdebdc8ae6d17d20839dd9b5af10bc4,word2vec-ruscorpora-300.gz,1,The corpus was lemmatized and tagged with Univ...


In [14]:
pd.options.display.float_format = '{:.2f}'.format

Use the `glove-wiki-gigaword-50` model. This model has 50D word vectors and is small in size, but still quite comprehensive and sufficient for our purposes. It was trained on 6 billion lowercased tokens. 

In [15]:
model = api.load("glove-wiki-gigaword-50")



<a name='10.2.2'></a><a id='10.2.2'></a>
## 10.2.2 Similarity Queries
<a href="#top">[back to top]</a>

In [16]:
%precision 2

'%.2f'

In [17]:
v_king = model['king']
v_queen = model['queen']

print("Vector size:", model.vector_size)
print("v_king  =", v_king[:10])
print("v_queen =", v_queen[:10])
print("similarity:", model.similarity('king', 'queen'))

Vector size: 50
v_king  = [ 0.5   0.69 -0.6  -0.02  0.6  -0.13 -0.09  0.47 -0.62 -0.31]
v_queen = [ 0.38  1.82 -1.26 -0.1   0.36  0.6  -0.18  0.84 -0.06 -0.76]
similarity: 0.7839043


In [18]:
%precision 3

'%.3f'

In [19]:
model.most_similar('king', topn=3)

[('prince', 0.824), ('queen', 0.784), ('ii', 0.775)]

In [20]:
v_lion = model['lion']
v_nano = model['nanotechnology']

model.cosine_similarities(v_king, [v_queen, v_lion, v_nano])

array([ 0.784,  0.478, -0.255], dtype=float32)

In [21]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)

[('queen', 0.852), ('throne', 0.766), ('prince', 0.759)]

In [22]:
model.most_similar(positive=['paris', 'germany'], negative=['france'], topn=3)

[('berlin', 0.920), ('frankfurt', 0.820), ('vienna', 0.818)]

In [23]:
model.most_similar(positive=['france', 'capital'], topn=1)

[('paris', 0.784)]

In [24]:
model.most_similar(positive=['greece', 'capital'], topn=3)

[('central', 0.797), ('western', 0.757), ('region', 0.750)]

---
<a name='10.3'></a><a id='10.3'></a>
# 10.3 Blueprints for Training and Evaluating Your Own Embeddings
<a href="#top">[back to top]</a>

<a name='10.3.1'></a><a id='10.3.1'></a>
## 10.3.1 Data Preparation
<a href="#top">[back to top]</a>

If there is not much training sentences, we should include these steps in the preprocessing:

1. Clean text from unwanted tokens (symbols, tags, etc)
2. Put all words into lowercase
3. Use lemmas

<a id='reddit-selfposts-ch10.db'></a><a name='reddit-selfposts-ch10.db'></a>
### Dataset: reddit-selfposts-ch10.db
<a href="#top">[back to top]</a>

In [25]:
db_name = "reddit-selfposts-ch10.db"
!wget -nc -q 'https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/data/reddit-selfposts/reddit-selfposts-ch10.db'
!ls -l {db_name}

-rw-r--r-- 1 root root 30314496 Mar 21 11:04 reddit-selfposts-ch10.db


In [26]:
con = sqlite3.connect(db_name)
df = pd.read_sql("select subreddit, lemmas, text from posts_nlp", con)
con.close()

df['lemmas'] = df['lemmas'].str.lower().str.split() # lower case tokens
sents = df['lemmas'] # our training "sentences"

### Phrases


In [27]:
!pip list | grep gensim

gensim                        4.3.1


In [28]:
# Assume running Gensim 4.x
delim = '-'

phrases = Phrases(
    sents, 
    min_count=10, 
    threshold=0.3, 
    delimiter=delim, 
    scoring=npmi_scorer
)

phrases

<gensim.models.phrases.Phrases at 0x7ffbc03e27c0>

In [29]:
sent = "I had to replace the timing belt in my mercedes c300".split()
phrased = phrases[sent]
print('|'.join(phrased))

I|had|to|replace|the|timing-belt|in|my|mercedes-c300


In [30]:
phrase_df = pd.DataFrame(
    phrases.find_phrases(sents), 
    columns =['phrase', 'score']
)

phrase_df = pd.DataFrame.from_dict(
    phrases.find_phrases(sents), 
    orient='index').reset_index()

phrase_df.columns = ['phrase', 'score']

phrase_df = (
    phrase_df[['phrase', 'score']]
        .drop_duplicates() \
        .sort_values(by='score', ascending=False)
    .reset_index(drop=True)
)

In [31]:
(
    phrase_df[phrase_df['phrase']
        .str.contains('mercedes')]
        .head(3)
)

Unnamed: 0,phrase,score
83,mercedes-benz,0.8
1416,mercedes-c300,0.47


In [32]:
# Show some additional phrases with score > 0.7
phrase_df.query('score > 0.7').sample(100)

Unnamed: 0,phrase,score
236,heater-cartridge,0.70
204,key-fob,0.72
88,college-student,0.79
201,sun-visor,0.72
235,kapton-tape,0.70
...,...,...
29,charcoal-canister,0.91
87,king-ranch,0.79
90,license-plate,0.78
42,carbon-buildup,0.86


In [33]:
sents = df['lemmas']

phrases = Phrases(
    sents, 
    min_count=10, 
    threshold=0.7, 
    delimiter=delim, 
    scoring=npmi_scorer
)

df['phrased_lemmas'] = df['lemmas'].progress_map(lambda s: phrases[s])

sents = df['phrased_lemmas']

progress-bar: 100%|██████████| 20000/20000 [00:05<00:00, 3453.08it/s]


<a name='10.3.2'></a><a id='10.3.2'></a>
## 10.3.2 Blueprint: Training Models with Gensim
<a href="#top">[back to top]</a>

In [34]:
# for Gensim training
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s', level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)

In [35]:
model = Word2Vec(
    sentences=sents, # tokenized input sentences
    vector_size=100, # size of word vectors (default 100)
    window=2,        # context window size (default 5)
    sg=1,            # use skip-gram (default 0 = CBOW)
    negative=5,      # number of negative samples (default 5)
    min_count=5,     # ignore infrequent words (default 5)
    workers=4,       # number of threads (default 3)
    # epochs=5         # number of epochs (default 5)
    epochs=1         # number of epochs (default 5)
)

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 999568 words, keeping 27208 word types
INFO:gensim.models.word2vec:collected 40172 word types from a corpus of 2009337 raw words and 20000 sentences
INFO:gensim.models.word2vec:Creating a fresh vocabulary
INFO:gensim.utils:Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 10457 unique words (26.03% of original 40172, drops 29715)', 'datetime': '2023-03-21T11:05:12.066002', 'gensim': '4.3.1', 'python': '3.9.16 (main, Dec  7 2022, 01:11:51) \n[GCC 9.4.0]', 'platform': 'Linux-5.10.147+-x86_64-with-glibc2.31', 'event': 'prepare_vocab'}
INFO:gensim.utils:Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1965556 word corpus (97.82% of original 2009337, drops 43781)', 'datetime': '2023-03-21T11:05:12.074980', 'gensim': '4.3.1', 'python': '3.9

In [36]:
logging.getLogger().setLevel(logging.ERROR)

In [37]:
model.save('./autos_w2v_100_2_full.bin')

In [38]:
model = Word2Vec.load('./autos_w2v_100_2_full.bin')

**This takes several minutes to run.** Please be patient, you need this to continue.

In [39]:
%%time
print("Start")
# This takes a long time to run.

model_path = './'
model_prefix = 'autos'

param_grid = {
    'w2v': { 'variant': ['cbow', 'sg'], 'window': [2, 5, 30]}, 
    'ft': {'variant': ['sg'], 'window': [5]}
}

size = 100

for algo, params in param_grid.items(): 
    print(algo)
    for variant in params['variant']:
        sg = 1 if variant == 'sg' else 0
        for window in params['window']:
            print(f"\tVariant: {variant},\tWindow: {window},\tSize: {size}")
            np.random.seed(1) # To ensure repeatability
            if algo == 'w2v':
                model = Word2Vec(sents, vector_size=size, window=window, sg=sg)
            else:
                model = FastText(sents, vector_size=size, window=window, sg=sg)

            file_name = f"{model_path}/{model_prefix}_{algo}_{variant}_{window}"
            model.wv.save_word2vec_format(file_name + '.bin', binary=True)
            
print("Done")

Start
w2v
	Variant: cbow,	Window: 2,	Size: 100
	Variant: cbow,	Window: 5,	Size: 100
	Variant: cbow,	Window: 30,	Size: 100
	Variant: sg,	Window: 2,	Size: 100
	Variant: sg,	Window: 5,	Size: 100
	Variant: sg,	Window: 30,	Size: 100
ft
	Variant: sg,	Window: 5,	Size: 100
Done
CPU times: user 12min 39s, sys: 2.34 s, total: 12min 42s
Wall time: 8min 7s


<a name='10.3.3'></a><a id='10.3.3'></a>
## 10.3.3 Blueprint: Evaluating Different Models
<a href="#top">[back to top]</a>

In [40]:
model_path = '.'

names = ['autos_w2v_cbow_2', 'autos_w2v_sg_2', 
         'autos_w2v_sg_5', 'autos_w2v_sg_30', 'autos_ft_sg_5']
models = {}

for name in names:
    file_name = f"{model_path}/{name}.bin"
    print(f"Loading {file_name}") ###
    models[name] = KeyedVectors.load_word2vec_format(file_name, binary=True)

Loading ./autos_w2v_cbow_2.bin
Loading ./autos_w2v_sg_2.bin
Loading ./autos_w2v_sg_5.bin
Loading ./autos_w2v_sg_30.bin
Loading ./autos_ft_sg_5.bin


In [41]:
def compare_models(models, **kwargs):

    df = pd.DataFrame()
    for name, model in models:
        df[name] = [f"{word} {score:.3f}" 
                    for word, score in model.most_similar(**kwargs)]
    df.index = df.index + 1 # let row index start at 1
    return df

In [42]:
compare_models([(n, models[n]) for n in names], positive='bmw', topn=10)

Unnamed: 0,autos_w2v_cbow_2,autos_w2v_sg_2,autos_w2v_sg_5,autos_w2v_sg_30,autos_ft_sg_5
1,mercedes 0.857,mercedes 0.788,328i 0.767,328i 0.836,bmws 0.836
2,lexus 0.815,benz 0.706,335i 0.765,xdrive 0.803,mercedes 0.800
3,volvo 0.806,merc 0.692,mercedes 0.756,bimmer 0.780,mercedes_benz 0.795
4,vw 0.796,porsche 0.689,mercede 0.720,335i 0.775,bmwfs 0.789
5,subaru 0.790,mercedes-benz 0.678,e92 0.712,non-m 0.754,mercede 0.788
6,porsche 0.782,mercede 0.676,benz 0.705,5-series 0.741,mercedes-benz 0.767
7,audi 0.773,mb 0.669,135i 0.700,n52 0.740,335i 0.766
8,volkswagen 0.764,m3 0.665,m3 0.694,x-drive 0.736,m135i 0.763
9,toyota 0.751,audis 0.661,merc 0.693,e92 0.733,merc 0.760
10,benz 0.746,135i 0.658,z4 0.689,bmws 0.733,328i 0.760


### Looking for Similar Concepts


### Analogy Reasoning on our own Models


**Note** that your results may be slightly different to the ones printed in the book because of random initialization.

In [43]:
compare_models(
    [(n, models[n]) for n in names], 
    positive=['f150', 'toyota'], negative=['ford'], topn=5).T

Unnamed: 0,1,2,3,4,5
autos_w2v_cbow_2,camry 0.837,f-150 0.837,ranger 0.820,is250 0.817,s70 0.816
autos_w2v_sg_2,f-150 0.738,camry 0.730,tacoma 0.728,ls430 0.706,f-250 0.701
autos_w2v_sg_5,nissan-frontier 0.673,tacoma 0.658,2wd 0.650,tundra 0.642,f-150 0.610
autos_w2v_sg_30,tacoma 0.684,4runner 0.672,4runners 0.631,tundra 0.617,tacomas 0.615
autos_ft_sg_5,toyotas 0.756,f150s 0.738,tacoma 0.729,toyo 0.729,tacomas 0.727


In [44]:
# Try a different analogy
compare_models(
    [(n, models[n]) for n in names], 
    positive=['x3', 'audi'], negative=['bmw'], topn=5).T

Unnamed: 0,1,2,3,4,5
autos_w2v_cbow_2,b8.5 0.820,b7 0.810,b9 0.808,tt 0.806,sportback 0.805
autos_w2v_sg_2,b8.5 0.756,cla-250 0.749,a7 0.746,sportback 0.738,avant 0.734
autos_w2v_sg_5,sportback 0.768,b8.5 0.753,avant 0.745,b9 0.745,a4 0.739
autos_w2v_sg_30,a5 0.669,sportback 0.663,a4 0.650,a4s 0.619,quattro 0.616
autos_ft_sg_5,a4 0.799,a7 0.775,b8.5 0.775,b9 0.770,a5 0.765


In [45]:
# And another one
try:
    compare_models(
        [(n, models[n]) for n in names], 
        positive=['spark-plug'], negative=[], topn=5)
except Exception as e:
    print(f"Error: {e}")

---
<a name='10.4'></a><a id='10.4'></a>
# 10.4 Blueprints for Visualizing Embeddings
<a href="#top">[back to top]</a>

<a name='10.4.1'></a><a id='10.4.1'></a>
## 10.4.1 Blueprint: Applying Dimensionality Reduction
<a href="#top">[back to top]</a>


In [46]:
model = models['autos_w2v_sg_30']
words = model.index_to_key # words in vocabulary
wv = [model[word] for word in words]

reducer = UMAP(n_components=2, metric='cosine', n_neighbors = 15, min_dist=0.1, random_state = 12)

reduced_wv = reducer.fit_transform(wv)

In [47]:
px.defaults.template = "plotly_white" ### plotly style

plot_df = pd.DataFrame.from_records(reduced_wv, columns=['x', 'y'])
plot_df['word'] = words
params = {'hover_data': {c: False for c in plot_df.columns}, 
          'hover_name': 'word'}
params.update({'width': 800, 'height': 600}) ###

fig = px.scatter(plot_df, x="x", y="y", opacity=0.3, size_max=3, **params)
fig.update_traces(marker={'line': {'width': 0}}) ###
fig.update_xaxes(showticklabels=False, showgrid=True, zeroline=False, visible=True) ###
fig.update_yaxes(showticklabels=False, showgrid=True, zeroline=False, visible=True) ###
fig.show()

In [48]:
model = models['autos_w2v_sg_30'] ###
search = ['ford', 'lexus', 'vw', 'hyundai', 
          'goodyear', 'spark-plug', 'florida', 'navigation']

try:
    _ = plot_embeddings(model, search, topn=50, show_all=True, labels=False, 
                    algo='umap', n_neighbors=15, min_dist=0.1, random_state=12)
except Exception as e:
    print(f"Error: {e}")

Calculating umap for 10457 words ... done.


progress-bar: 100%|██████████| 10457/10457 [00:13<00:00, 786.79it/s]


In [49]:
model = models['autos_w2v_sg_30'] ###
search = ['ford', 'bmw', 'toyota', 'tesla', 'audi', 'mercedes', 'hyundai']

_ = plot_embeddings(model, search, topn=10, show_all=False, labels=True, 
    algo='umap', n_neighbors=15, min_dist=10, spread=25, random_state=7)

Calculating umap for 77 words ... done.


progress-bar: 100%|██████████| 77/77 [00:00<00:00, 1264.48it/s]


In [50]:
_ = plot_embeddings(model, search, topn=30, n_dims=3, 
    algo='umap', n_neighbors=15, min_dist=.1, spread=40, random_state=23)

Calculating umap for 214 words ... done.


progress-bar: 100%|██████████| 214/214 [00:00<00:00, 1071.19it/s]


In [51]:
# PCA plot (not in the book) - better to explain analogies:
# difference vectors of pickup trucks "f150"-"ford", "tacoma"-"toyota" and 
# "frontier"-"nissan" are almost parallel. 
# "x5"-"bmw" is pointing to a somewhat different direction, but "x5" is not a pickup

model = models['autos_w2v_sg_5'] 
search = ['ford', 'f150', 'toyota', 'tacoma', 'nissan', 'frontier', 'bmw', 'x5']
_ = plot_embeddings(model, search, topn=0, algo='pca', labels=True, colors=False)

Calculating pca for 8 words ... done.


<a name='10.4.2'></a><a id='10.4.2'></a>
## 10.4.2 Blueprint: Using the TensorFlow Embedding Projector
<a href="#top">[back to top]</a>

In [52]:
model_path = '.' ###
name = 'autos_w2v_sg_30'
model = models[name]

with open(f'{model_path}/{name}_words.tsv', 'w', encoding='utf-8') as tsvfile:
    tsvfile.write('\n'.join(model.index_to_key))

with open(f'{model_path}/{name}_vecs.tsv', 'w', encoding='utf-8') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t', 
                        dialect=csv.unix_dialect, quoting=csv.QUOTE_MINIMAL)
    for w in model.index_to_key:
        _ = writer.writerow(model[w].tolist())

<a name='10.4.3'></a><a id='10.4.3'></a>
## 10.4.3 Blueprint: Constructing a Similarity Tree
<a href="#top">[back to top]</a>

In [53]:
def sim_tree(model, word, top_n, max_dist):

    graph = nx.Graph()
    graph.add_node(word, dist=0)

    to_visit = deque([word])
    while len(to_visit) > 0:
        source = to_visit.popleft() # visit next node
        dist = graph.nodes[source]['dist']+1

        if dist <= max_dist: # discover new nodes
            for target, sim in model.most_similar(source, topn=top_n):
                if target not in graph:
                    to_visit.append(target)
                    graph.add_node(target, dist=dist)
                    graph.add_edge(source, target, sim=sim, dist=dist)
    return graph

In [54]:
def plt_add_margin(pos, x_factor=0.1, y_factor=0.1):
    # rescales the image s.t. all captions fit onto the canvas
    x_values, y_values = zip(*pos.values())
    x_max = max(x_values)
    x_min = min(x_values)
    y_max = max(y_values)
    y_min = min(y_values)

    x_margin = (x_max - x_min) * x_factor
    y_margin = (y_max - y_min) * y_factor
    # return (x_min - x_margin, x_max + x_margin), (y_min - y_margin, y_max + y_margin)

    plt.xlim(x_min - x_margin, x_max + x_margin)
    plt.ylim(y_min - y_margin, y_max + y_margin)

def scale_weights(graph, minw=1, maxw=8):
    # rescale similarity to interval [minw, maxw] for display
    sims = [graph[s][t]['sim'] for (s, t) in graph.edges]
    min_sim, max_sim = min(sims), max(sims)

    for source, target in graph.edges:
        sim = graph[source][target]['sim']
        graph[source][target]['sim'] = (sim-min_sim)/(max_sim-min_sim)*(maxw-minw)+minw

    return graph

def solve_graphviz_problems(graph):
    # Graphviz has problems with unicode
    # this is to prevent errors during positioning
    def clean(n):
        n = n.replace(',', '')
        n = n.encode().decode('ascii', errors='ignore')
        n = re.sub(r'[{}\[\]]', '-', n)
        n = re.sub(r'^\-', '', n)
        return n
    
    node_map = {n: clean(n) for n in graph.nodes}
    # remove empty nodes
    for n, m in node_map.items(): 
        if len(m) == 0:
            graph.remove_node(n)
    
    return nx.relabel_nodes(graph, node_map)

In [55]:
def plot_tree(graph, node_size=1000, font_size=12):
    graph = solve_graphviz_problems(graph) ###

    pos = graphviz_layout(graph, prog='twopi', root=list(graph.nodes)[0])
    plt.figure(figsize=(10, 4), dpi=200) ###
    
    # b was renamed to visible
    # plt.grid(b=None) ### hide box
    plt.grid(visible=None) ### hide box
    
    plt.box(False) ### hide grid
    plt_add_margin(pos) ### just for layout

    colors = [graph.nodes[n]['dist'] for n in graph] # colorize by distance
    
    nx.draw_networkx_nodes(graph, pos, node_size=node_size, node_color=colors, 
                           cmap='Set1', alpha=0.4)
    
    nx.draw_networkx_labels(graph, pos, font_size=font_size)
    
    scale_weights(graph) ### not in book
    
    for (n1, n2, sim) in graph.edges(data='sim'):
         nx.draw_networkx_edges(graph, pos, [(n1, n2)], width=sim, alpha=0.2)

    plt.show()

In [56]:
model = models['autos_w2v_sg_2']
graph = sim_tree(model, 'noise', top_n=10, max_dist=3)

### The following takes too much time to execute on COLAB.

In [None]:
# %%time
# plot_tree(graph, node_size=500, font_size=8)

In [None]:
# model = models['autos_w2v_sg_30']
# graph = sim_tree(model, 'spark-plug', top_n=8, max_dist=2)
# plot_tree(graph, node_size=500, font_size=8)