# Word embeddings/vectors 
This notebook creates word vectors for a provided text. Word vectors are numerical representations of words in a text and are used for determining semantic relationships between words. In this notebook, two different methods can be used and a number of different parameters can be specified for generating the vectors. More importantly instead of trying to guess what parameters are optimal, this notebook will create vectors based on all the permutations of the possible parameters, thus giving you the opportunity to see what works best for your own dataset.  The word vectors can be exported for visualization or used to augment other NLP tasks such as document similarity or classification.

Read more about word embeddings here [https://www.coveo.com/blog/word2vec-explained/](https://www.coveo.com/blog/word2vec-explained/)

This notebook uses both Word2Vec and FastText, for which you can find more about [https://jalammar.github.io/illustrated-word2vec/](https://jalammar.github.io/illustrated-word2vec/) and [https://fasttext.cc/](https://fasttext.cc/)

Visualization of the word vectors can be done using [http://projector.tensorflow.org/](http://projector.tensorflow.org/) 

## A few notes before starting
Before you start generating word vectors you need to get your data into the right shape. On the most basic level this means turning your collection of text into a single text file and making sure that each sentence begins on its own line. 

In addition doing the any of the following will increase the accuracy of the translation process from text to vectors
 - "normalizing" text such as turning semantically equivalent but lexically different characters to a single representation (unicode to ascii)
 - removing unwanted symbols
 - lowercasing all text
 - removing stopwords
 - removing numbers (depending if they are significant to your dataset)

## Run the following three cells to get started

In [2]:
import os 
import re
import random
import pandas as pd
import ipywidgets as widgets 
import multiprocessing
from itertools import product
from tqdm import tqdm
from glob import glob
from gensim.models import KeyedVectors
from gensim.models import Word2Vec, FastText 
from gensim.models.word2vec import LineSentence

pd.set_option('max_colwidth', 1600)
pd.set_option('display.max_columns', 500)
MODEL_PATH = './models'
models = None
sentences = None

In [12]:
def permutations(parameters):
    '''return all permutations of model parameters'''
    
    keys = parameters.keys()
    values = parameters.values()
    return [dict(zip(keys,tup)) for tup in product(*values)]

def gen_file_name(model_name, kwargs):
    '''generate the a model's filename'''
    
    filename = model_name + ''.join([f"_{k}_{v}" for k,v in kwargs.items()]) + '.bin'
    filename = filename.replace('fasttext','ft')\
                       .replace('word2vec','w2v')\
                       .replace('vector_size','vs')\
                       .replace('window','wn')
    filename = re.sub('_workers_\d\d?','',filename)
    return filename

def gen_w2v_models(save_dir, models, args, parameters, constant_kwargs):
    '''generate word vector models'''
    
    total_iterations = len(models) * len(parameters)
    models = {}

    with tqdm(total=total_iterations) as pbar:
        for name, model in models: 
            for kwargs in parameters:
                kwargs.update(constant_kwargs)
                filename = gen_file_name(name, kwargs)
                pbar.set_description(f"Generating {filename}")
                m = model(*args,**kwargs)
                m.wv.save_word2vec_format(os.path.join(save_dir, filename), binary=True)
                models[filename] = m.wv
                pbar.update(1)

    return models

#Load models and put results into a dataframe
def load_w2v_models(path):
    return {os.path.basename(model).split('.bin')[0]: KeyedVectors.load_word2vec_format(model, binary=True) 
            for model in glob(os.path.join(path,'*.bin'))}

def return_similar(word, topn=5):
    '''create a dataframe that shows the topn similar words to 'word' across all models'''
    df = pd.DataFrame()
    sorted_models = sorted(models.items(), key=lambda tup: tup[0])
    for name, model in sorted_models:
        name = name.replace('fasttext','ft')\
                   .replace('word2vec','w2v')\
                   .replace('vector_size','vs')\
                   .replace('window','wn')\
                   .replace('_workers_6','')
        df[name] = [f"{word} {percent:.3f}" for word,percent in model.most_similar(topn=topn,positive=[word])]
    return df.T

def export_wv_tensor_ep(tensor_ep_dir, models):
    '''export word vectors from a model for visualization in http://projector.tensorflow.org/'''

    if not os.path.exists(tensor_ep_dir):
        os.mkdir(tensor_ep_dir)
    
    print('Exporting word vectors')
    for model_name in models:
        print(model_name)
        with open(os.path.join(tensor_ep_dir, model_name) + '_words.tsv', 'w') as metadata_f:
            vector_names = sorted(models[model_name].key_to_index.keys())
            metadata_f.write('\n'.join(vector_names))

        with open(os.path.join(tensor_ep_dir, model_name) + '_vecs.tsv', 'w') as vectors_f:
            vectors = ['\t'.join(map(str, models[model_name][vn])) + '\n' for vn in vector_names] 
            vectors_f.writelines(vectors)


Once the cell below is run it creates several widgets that allow you to set parameters and select the vector creation algorithms. 

Either 
1. "Load your data" and then "Generate models". 
2. Or "Load existing models" that you have generated previously.

A word of caution, generating models can take quite a bit of time (30 minutes+)

Also some notes on what the various parameters mean.

- [Window size](https://stackoverflow.com/questions/22272370/word2vec-effect-of-window-size-used/30447723#30447723): Larger windows tend to capture more topic/domain information: what other words (of any type) are used in related discussions? Smaller windows tend to capture more about word itself: what other words are functionally similar?
- Vector size: The number of dimensions for each vector. The more dimensions there are, the more information there is for situating relationships between the vectors (within reason). 200 or 300 are common numbers.
- Word2Vec/FastText: Two different methods of turning words into word vectors. FastText is a newer method.

Enter in values separated by commas to try more than one parameter.

In [13]:
output = widgets.Output(layout={'border': '1px solid black'})

data_textbox = widgets.Text(
    value='doc_sent_file.txt',
    placeholder='Enter data filename here',
    description='Data filename:',
    disabled=False   
)
data_load_btn = button = widgets.Button(
    description='Load data',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click me',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)

@output.capture(clear_output=True)
def dlb_callback(_):
    global sentences
    #Load data here and change color of button to red if failure/green if success
    data_path = data_textbox.value
   
    try:
        with open(data_path) as dsf:
            max_sent_length = max([len(line) for line in dsf.readlines()])

        sentences = LineSentence(data_path, max_sentence_length=max_sent_length)
        data_load_btn.style.button_color = 'lightgreen'
        toggle_gm_widgets(False)
        print('data loaded successfully')
    except FileNotFoundError:
        data_load_btn.style.button_color = 'red'
        print('path not found')
    
data_load_btn.on_click(dlb_callback)
    
model_dir = widgets.Text(
    value=MODEL_PATH,
    placeholder=MODEL_PATH,
    description='Model directory:',
    disabled=False   
)
model_load_btn = widgets.Button(
    description='Load existing models',
    disabled=False,
)


@output.capture(clear_output=True)
def mlb_callback(_): 
    global models 
    print("Loading models")
    models = load_w2v_models(model_dir.value)
    if not models:
        model_load_btn.style.button_color = 'red'
        print('bad model directory')
    else:
        enable_model_functions()
        model_load_btn.style.button_color = 'lightgreen'
        for model in models:
            print(model)
        
model_load_btn.on_click(mlb_callback)

mgp_label = widgets.Label(value="Model generation parameters")
vs_param = widgets.Text(
    value='50,100,200',
    description='Vector size:',
    disabled=False   
)
ws_param = widgets.Text(
    value='2,5,8',
    description='Window size:',
    disabled=False   
)
cbow_param = widgets.Checkbox(
    value=True,
    description='CBOW',
    disabled=False,
    indent=False
)
sg_param = widgets.Checkbox(
    value=True,
    description='Skip Gram',
    disabled=False,
    indent=False
)
ag_params = widgets.SelectMultiple(
    options=['Word2Vec','FastText'],
    value=['Word2Vec','FastText'],
    description='Algorithms',
    disabled=False
)
save_model_dir = widgets.Text(
    value=MODEL_PATH,
    placeholder=MODEL_PATH,
    description='Model directory:',
    disabled=False   
)
gm_btn = widgets.Button(
    description='Generate models',
    disabled=False,
)

@output.capture(clear_output=True)
def gm_callback(_):
    global models
    
    try:
        vs = [int(num) for num in vs_param.value.split(',')]
        ws = [int(num) for num in ws_param.value.split(',')]
        
        sg = []
        if cbow_param.value: sg.append(0)
        if sg_param.value: sg.append(1)
        if not sg: 
            raise ValueError("One of CBOW or Skip Gram must be selected")
            
        ag = []
        if 'Word2Vec' in ag_params.value:
            ag.append(('word2vec',Word2Vec))
        if 'FastText' in ag_params.value:
            ag.append(('fasttext',FastText))

        if not os.path.isdir(save_model_dir.value):
            raise ValueError("Could not save models to non directory path " + save_model_dir.value)
            
        parameters = permutations({'vector_size':vs, 'window':ws, 'sg':sg})

        models = gen_w2v_models(save_model_dir,
                               ag,
                               [sentences],
                               parameters,
                               {"workers":multiprocessing.cpu_count()})
        
        enable_model_functions()
        gm_btn.style.button_color = 'lightgreen'
    except Exception as e:
        gm_btn.style.button_color = 'red'
        raise(e)

gm_btn.on_click(gm_callback)

gen_models_widgets = [mgp_label, vs_param, ws_param, cbow_param, sg_param, ag_params, gm_btn] 
gmw_area=widgets.VBox(gen_models_widgets)

def toggle_gm_widgets(off=True):
    for widget in gen_models_widgets:
        widget.disabled=off

toggle_gm_widgets(True)

compare_topn = widgets.Text(
    value='',
    placeholder='Enter word to compare here',
    description='Compare to...',
    disabled=True   
)

export_output = widgets.Output(layout={'border': '1px solid black'})

export_wv = widgets.Text(
    value='',
    placeholder='Enter directory to export word vectors',
    description='Export path',
    disabled=True   
)
export_wv_btn = widgets.Button(
    description='Export vectors',
    disabled=True,
)

@export_output.capture(clear_output=True)
def export_callback(_):
    try:
        export_wv_tensor_ep(export_wv.value, models)
        export_wv_btn.style.button_color = 'lightgreen'
    except Exception as e:
        export_wv_btn.style.button_color = 'red'
        
export_wv_btn.on_click(export_callback)

def enable_model_functions():
    compare_topn.disabled=False
    export_wv.disabled=False
    export_wv_btn.disabled=False
    
widgets.VBox([widgets.HBox([data_textbox, data_load_btn]), 
              widgets.HBox([model_dir, model_load_btn]),
              gmw_area,
              output])

VBox(children=(HBox(children=(Text(value='doc_sent_file.txt', description='Data filename:', placeholder='Enter…

This code cell exports vectors for visualization in [http://projector.tensorflow.org/](http://projector.tensorflow.org/). It is enabled once existing models have been loaded or generated.

In [25]:
display(widgets.VBox([widgets.HBox([export_wv, export_wv_btn]), export_output]))

VBox(children=(HBox(children=(Text(value='test', description='Export path', placeholder='Enter directory to ex…

Run the cell below and enter in a word, then run the cell again to see how it compares with other words across the models. 

In [44]:
if compare_topn.value == '':
    #Take a random word in the vocab as a default
    compare_topn.value = random.choice(list(list(models.values())[0].key_to_index))
display(compare_topn)
return_similar(compare_topn.value)

Text(value='well-educated', description='Compare to...', placeholder='Enter word to compare here')

Unnamed: 0,0,1,2,3,4
ft_vs_100_wn_2_sg_0,well-ordered 0.914,educated 0.903,well-planned 0.897,well-trained 0.896,well-informed 0.895
ft_vs_100_wn_2_sg_1,educated 0.855,uneducated 0.841,well-ordered 0.810,well-trained 0.802,well-armed 0.776
ft_vs_100_wn_5_sg_0,educated 0.905,uneducated 0.878,well-trained 0.872,well-planned 0.868,well-regulated 0.858
ft_vs_100_wn_5_sg_1,educated 0.874,uneducated 0.854,well-ordered 0.792,well-endowed 0.759,well-informed 0.754
ft_vs_100_wn_8_sg_0,educated 0.905,uneducated 0.875,well-trained 0.860,well-planned 0.852,well-advised 0.835
ft_vs_100_wn_8_sg_1,educated 0.883,uneducated 0.854,well-to-do 0.779,well-ordered 0.774,educate 0.773
ft_vs_200_wn_2_sg_0,educated 0.893,well-ordered 0.882,uneducated 0.879,well-advised 0.875,well-trained 0.874
ft_vs_200_wn_2_sg_1,educated 0.856,uneducated 0.832,well-trained 0.756,well-ordered 0.754,well-informed 0.753
ft_vs_200_wn_5_sg_0,educated 0.895,uneducated 0.868,well-planned 0.848,well-trained 0.847,well-regulated 0.832
ft_vs_200_wn_5_sg_1,educated 0.854,uneducated 0.828,well-trained 0.748,well-ordered 0.729,educate 0.726
