# Remark<div class='tocSkip'/>

The code in this notebook differs slightly from the printed book, because we removed some boilerplate parts of it. For example we frequently use pretty print (`pp.pprint`) instead of `print` and `tqdm`'s `progress_apply` instead of Pandas' `apply`. 

Moreover, several layout and formatting commands, like `figsize` to control figure size or subplot commands are removed in the book. Numbers in the book may have less decimal places as shown here in the notebook.

You may also find some lines marked with three hashes ###. Those are not in the book as well as they don't contribute to the concept.

All of this is done to simplify the code in the book and put the focus on the important parts.

# Setup<div class='tocSkip'/>

## Determine Environment<div class='tocSkip'/>

In [None]:
import sys
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    BASE_DIR = "/content"
    print("You are working on Google Colab.")
    print(f'Files will be downloaded to "{BASE_DIR}".')
    # adjust release
    GIT_ROOT = "https://github.com/blueprints-for-text-analytics-python/early-release/raw/master"
else:
    BASE_DIR = ".."
    print("You are working on a local system.")
    print(f'Files will be searched relative to "{BASE_DIR}".')

## Download data files<div class='tocSkip'/>

In [None]:
import os, subprocess
from subprocess import PIPE

required_files = [
                  'settings.py',
                  'packages/blueprints/__init__.py',
                  'packages/blueprints/embeddings.py',
                  'data/reddit-selfposts/reddit-selfposts.db.gz',
                  'ch11/colab_requirements.txt'
]

if ON_COLAB:
    print("Downloading required files ...")
    for file in required_files:
        cmd = ['wget', '-P', os.path.dirname(BASE_DIR+'/'+file), GIT_ROOT+'/'+file]
        print('!'+' '.join(cmd))
        stdout, stderr = subprocess.Popen(cmd, stdout=PIPE, stderr=PIPE).communicate()
        # print(stderr.decode()) # uncomment in case of problems

## Install required libraries and additional setup<div class='tocSkip'/>

It may take a moment to install the required Python libraries.

In [None]:
if ON_COLAB:
    print("\nAdditional setup ...")
    setup_cmds = ['pip install -r ch11/colab_requirements.txt',
                  'mkdir -p models',
                  f'gunzip -k {BASE_DIR}/data/reddit-selfposts/reddit-selfposts.db.gz']

    for cmd in setup_cmds:
        print('!'+cmd)
        if os.system(cmd) != 0:
            print('  --> ERROR')

## Common Imports<div class='tocSkip'/>

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'png'

%run "$BASE_DIR/settings.py"

%reload_ext autoreload
%autoreload 2

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "last_expr" # in this notebook not "all"

In [None]:
# to import blueprints package
import os, sys
sys.path.append(BASE_DIR + '/packages')
# sys.path

In [None]:
# for Gensim training

import logging
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s', level=logging.INFO)

logging.getLogger().setLevel(logging.WARNING)
logging.getLogger().info("Logging INFOS.")
logging.getLogger().warning("Logging WARNINGS.")
logging.getLogger().error("Logging ERRORS.")

In [None]:
warnings.filterwarnings('ignore')

In [None]:
from gensim.models import Word2Vec, FastText, KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

from gensim.models.phrases import Phrases, npmi_scorer
from gensim.models.word2vec import LineSentence

In [None]:
# set precision for similarity values
%precision 3
np.set_printoptions(suppress=True) # no scientific for small numbers

# for figure cropping and conversion
from PIL import Image

# Exploring Semantic Relationships with Word Embeddings
## What you will learn and what we will build


# The Case for Semantic Embeddings
## Word Embeddings


## Analogy Reasoning with Word Embeddings


## Types of Embeddings


### Word2Vec


### GloVe
### FastText
### Deep Contextualized Embeddings


# Blueprint: Similarity Queries on Pre-Trained Models
## Loading a Pretrained Model


In [None]:
import os ###
os.environ['GENSIM_DATA_DIR'] = './models'

In [None]:
# not in the book: display models as table
pd.options.display.float_format = '{:.0f}'.format ###

In [None]:
import gensim.downloader as api

info_df = pd.DataFrame.from_dict(api.info()['models'], orient='index')
info_df[['file_size', 'base_dataset', 'parameters']].head(5)

In [None]:
# full list of columns
info_df.head(3)

In [None]:
pd.options.display.float_format = '{:.2f}'.format ###

In [None]:
model = api.load("glove-wiki-gigaword-50")

## Similarity Queries


In [None]:
%precision 2

In [None]:
v_king = model['king']
v_queen = model['queen']

print("Vector size:", model.vector_size)
print("v_king  =", v_king[:10])
print("v_queen =", v_queen[:10])
print("similarity:", model.similarity('king', 'queen'))

In [None]:
%precision 3

In [None]:
model.most_similar('king', topn=3)

In [None]:
v_lion = model['lion']
v_nano = model['nanotechnology']

model.cosine_similarities(v_king, [v_queen, v_lion, v_nano])

In [None]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)

In [None]:
model.most_similar(positive=['paris', 'germany'], negative=['france'], topn=3)

In [None]:
model.most_similar(positive=['france', 'capital'], topn=1)

In [None]:
model.most_similar(positive=['greece', 'capital'], topn=3)

# Blueprint: Training and Comparing your own Embeddings


## Data Preparation


In [None]:
db_path = f"{BASE_DIR}/data/reddit-selfposts/reddit-selfposts.db"
con = sqlite3.connect(db_path)
df = pd.read_sql("select subreddit, lemmas, text from posts_nlp", con)
con.close()

df['lemmas'] = df['lemmas'].str.lower().str.split() # lower case tokens
sents = df['lemmas'] # our training "sentences"

### Phrases


In [None]:
from gensim.models.phrases import Phrases, npmi_scorer

phrases = Phrases(sents, min_count=10, threshold=0.3, 
                  delimiter=b'-', scoring=npmi_scorer)

In [None]:
sent = "I had to replace the timing belt in my mercedes c300".split()
phrased = phrases[sent]
print('|'.join(phrased))

In [None]:
phrase_df = pd.DataFrame(phrases.export_phrases(sents), 
                         columns =['phrase', 'score'])
phrase_df = phrase_df[['phrase', 'score']].drop_duplicates() \
            .sort_values(by='score', ascending=False).reset_index(drop=True)
phrase_df['phrase'] = phrase_df['phrase'].map(lambda p: p.decode('utf-8'))

In [None]:
phrase_df[phrase_df['phrase'].str.contains('mercedes')] .head(3)

In [None]:
# show some additional phrases with score > 0.7
phrase_df.query('score > 0.7').sample(100)

In [None]:
logging.getLogger().setLevel(logging.WARNING) ###
phrases = Phrases(sents, min_count=10, threshold=0.7, 
                  delimiter=b'-', scoring=npmi_scorer)

df['phrased_lemmas'] = df['lemmas'].progress_map(lambda s: phrases[s])
sents = df['phrased_lemmas']

## Training


In [None]:
logging.getLogger().setLevel(logging.INFO) ###
model = Word2Vec(sents,       # tokenized input sentences
                 size=100,    # size of word vectors (default 100)
                 window=2,    # context window size (default 5)
                 sg=1,        # use skip-gram (default 0 = CBOW)
                 negative=5,  # number of negative samples (default 5)
                 min_count=5, # ignore infrequent words (default 5)
                 workers=4,   # number of threads (default 3)
                 iter=5)      # number of epochs (default 5)
logging.getLogger().setLevel(logging.ERROR) ###

In [None]:
model.save('./models/autos_w2v_100_2_full.bin')

In [None]:
model = Word2Vec.load('./models/autos_w2v_100_2_full.bin')

**This takes several minutes on Colab.** Please be patient, you need this to continue.

In [None]:
from gensim.models import Word2Vec, FastText

model_path = './models'
model_prefix = 'autos'

param_grid = {'w2v': {'variant': ['cbow', 'sg'], 'window': [2, 5, 30]},
              'ft': {'variant': ['sg'], 'window': [5]}}
size = 100

for algo, params in param_grid.items(): 
    print(algo) ###
    for variant in params['variant']:
        sg = 1 if variant == 'sg' else 0
        for window in params['window']:
            print(f"  Variant: {variant}, Window: {window}, Size: {size}") ###
            np.random.seed(1) ### to ensure repeatability
            if algo == 'w2v':
                model = Word2Vec(sents, size=size, window=window, sg=sg)
            else:
                model = FastText(sents, size=size, window=window, sg=sg)

            file_name = f"{model_path}/{model_prefix}_{algo}_{variant}_{window}"
            model.wv.save_word2vec_format(file_name + '.bin', binary=True) 

## Evaluating Different Models


In [None]:
### You can add the other computed models as well here.
### For the book we just selected five of them. 
from gensim.models import KeyedVectors
model_path = './models' ###

names = ['autos_w2v_cbow_2', 'autos_w2v_sg_2', 
         'autos_w2v_sg_5', 'autos_w2v_sg_30', 'autos_ft_sg_5']
models = {}

for name in names:
    file_name = f"{model_path}/{name}.bin"
    print(f"Loading {file_name}") ###
    models[name] = KeyedVectors.load_word2vec_format(file_name, binary=True)

In [None]:
def compare_models(models, **kwargs):

    df = pd.DataFrame()
    for name, model in models:
        df[name] = [f"{word} {score:.3f}" 
                    for word, score in model.most_similar(**kwargs)]
    df.index = df.index + 1 # let row index start at 1
    return df

In [None]:
compare_models([(n, models[n]) for n in names], positive='bmw', topn=10)

### Looking for Similar Concepts


### Analogy Reasoning on our own Models


In [None]:
compare_models([(n, models[n]) for n in names], 
               positive=['f150', 'toyota'], negative=['ford'], topn=5).T

In [None]:
# try a different analogy
compare_models([(n, models[n]) for n in names], 
               positive=['x3', 'mercedes'], negative=['bmw'], topn=5).T

In [None]:
# and another one
compare_models([(n, models[n]) for n in names], 
               positive=['spark-plug'], negative=[], topn=5)

# Visualizing Embeddings


## Blueprint: Applying Dimensionality Reduction


In [None]:
from umap import UMAP

model = models['autos_w2v_sg_30']
words = model.vocab
wv = [model[word] for word in words]

reducer = UMAP(n_components=2, metric='cosine', n_neighbors = 15, min_dist=0.1,
               random_state = 12)
reduced_wv = reducer.fit_transform(wv)

In [None]:
import plotly.express as px

df = pd.DataFrame.from_records(reduced_wv, columns=['x', 'y'])
df['word'] = words
params = {'hover_data': {c: False for c in df.columns}, 'hover_name': 'word'}
params.update({'width': 800, 'height': 600}) ###

fig = px.scatter(df, x="x", y="y", opacity=0.3, size_max=3, **params)
fig.update_traces(marker={'line': {'width': 0}}) ###
fig.update_xaxes(showticklabels=False, showgrid=True, zeroline=False, visible=True) ###
fig.update_yaxes(showticklabels=False, showgrid=True, zeroline=False, visible=True) ###
fig.show()

In [None]:
from blueprints.embeddings import plot_embeddings

model = models['autos_w2v_sg_30'] ###
search = ['ford', 'lexus', 'audi', 'vw', 'hyundai', 
          'goodyear', 'spark-plug', 'florida', 'navigation']

_ = plot_embeddings(model, search, topn=50, show_all=True, labels=False, 
                algo='umap', n_neighbors=15, min_dist=0.1, random_state=12)

In [None]:
model = models['autos_w2v_sg_30'] ###
search = ['ford', 'bmw', 'toyota', 'tesla', 'audi', 'mercedes', 'hyundai']

_ = plot_embeddings(model, search, topn=10, show_all=False, labels=True, 
    algo='umap', n_neighbors=15, min_dist=10, spread=20, random_state=5)

In [None]:
_ = plot_embeddings(model, search, topn=30, n_dims=3, 
    algo='umap', n_neighbors=15, min_dist=.1, spread=40, random_state=5)

In [None]:
# PCA plot - better to explain analogies:
# difference vectors of pickup trucks "f150"-"ford", "tacoma"-"toyota" and 
# "frontier"-"nissan" are almost parallel. 
# "x5"-"bmw" is pointing to a somewhat different direction. 

model = models['autos_w2v_sg_5'] 
search = ['ford', 'f150', 'toyota', 'tacoma', 'nissan', 'frontier', 'bmw', 'x5']
_ = plot_embeddings(model, search, topn=0, algo='pca', labels=True, colors=False)

## Blueprint: Using Tensorflow Embedding Projector


In [None]:
import csv

model_path = './models' ###
model = models['autos_w2v_sg_30']

with open(f'{model_path}/{name}_words.tsv', 'w', encoding='utf-8') as tsvfile:
    tsvfile.write('\n'.join(model.vocab))

with open(f'{model_path}/{name}_vecs.tsv', 'w', encoding='utf-8') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t', 
                        dialect=csv.unix_dialect, quoting=csv.QUOTE_MINIMAL)
    for w in model.vocab:
        _ = writer.writerow(model[w].tolist())

## Blueprint: Constructing a Similarity Tree


In [None]:
import networkx as nx
from collections import deque

def sim_tree(model, word, top_n, max_dist):

    graph = nx.Graph()
    graph.add_node(word, dist=0)

    to_visit = deque([word])
    while len(to_visit) > 0:
        source = to_visit.popleft() # visit next node
        dist = graph.nodes[source]['dist']+1

        if dist <= max_dist: # discover new nodes
            for target, sim in model.most_similar(source, topn=top_n):
                if target not in graph:
                    to_visit.append(target)
                    graph.add_node(target, dist=dist)
                    graph.add_edge(source, target, sim=sim, dist=dist)
    return graph

In [None]:
def plt_add_margin(pos, x_factor=0.1, y_factor=0.1):
    # rescales the image s.t. all captions fit onto the canvas
    x_values, y_values = zip(*pos.values())
    x_max = max(x_values)
    x_min = min(x_values)
    y_max = max(y_values)
    y_min = min(y_values)

    x_margin = (x_max - x_min) * x_factor
    y_margin = (y_max - y_min) * y_factor
    # return (x_min - x_margin, x_max + x_margin), (y_min - y_margin, y_max + y_margin)

    plt.xlim(x_min - x_margin, x_max + x_margin)
    plt.ylim(y_min - y_margin, y_max + y_margin)

def scale_weights(graph, minw=1, maxw=8):
    # rescale similarity to interval [minw, maxw] for display
    sims = [graph[s][t]['sim'] for (s, t) in graph.edges]
    min_sim, max_sim = min(sims), max(sims)

    for source, target in graph.edges:
        sim = graph[source][target]['sim']
        graph[source][target]['sim'] = (sim-min_sim)/(max_sim-min_sim)*(maxw-minw)+minw

    return graph

def solve_graphviz_problems(graph):
    # Graphviz has problems with unicode
    # this is to prevent errors during positioning
    def clean(n):
        n = n.replace(',', '')
        n = n.encode().decode('ascii', errors='ignore')
        n = re.sub(r'[{}\[\]]', '-', n)
        n = re.sub(r'^\-', '', n)
        return n
    
    node_map = {n: clean(n) for n in graph.nodes}
    # remove empty nodes
    for n, m in node_map.items(): 
        if len(m) == 0:
            graph.remove_node(n)
    
    return nx.relabel_nodes(graph, node_map)

In [None]:
from networkx.drawing.nx_pydot import graphviz_layout

def plot_tree(graph, node_size=1000, font_size=12):
    graph = solve_graphviz_problems(graph) ###

    pos = graphviz_layout(graph, prog='twopi', root=list(graph.nodes)[0])
    plt.figure(figsize=(10, 4), dpi=200) ###
    plt.grid(b=None) ### hide box
    plt.box(False) ### hide grid
    plt_add_margin(pos) ### just for layout

    colors = [graph.nodes[n]['dist'] for n in graph] # colorize by distance
    nx.draw_networkx_nodes(graph, pos, node_size=node_size, node_color=colors, 
                           cmap='Set1', alpha=0.4)
    nx.draw_networkx_labels(graph, pos, font_size=font_size)
    scale_weights(graph) ### not in book
    
    for (n1, n2, sim) in graph.edges(data='sim'):
         nx.draw_networkx_edges(graph, pos, [(n1, n2)], width=sim, alpha=0.2)

    plt.show()

In [None]:
### image may be slightly different to book as models initialized with randoms weights
### are not completely comparable
model = models['autos_w2v_sg_2']
graph = sim_tree(model, 'noise', top_n=10, max_dist=3)
plot_tree(graph, node_size=500, font_size=8)

In [None]:
### image may be slightly different to book as models initialized with randoms weights
### are not completely comparable
model = models['autos_w2v_sg_30']
graph = sim_tree(model, 'spark-plug', top_n=8, max_dist=2)
plot_tree(graph, node_size=500, font_size=8)

# Closing Remarks


# Further Reading
