<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M2-training-word-vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training customized word embeddings

Word embeddings became big around 2013 and are linked to [this paper](https://arxiv.org/abs/1301.3781) with the beautiful title 
*Efficient Estimation of Word Representations in Vector Space* by Tomas Mokolov et al. coming out of Google. This was the foundation of Word2Vec.

The idea behind it is easiest summarized by the following quote: 


> *You shall know a word by the company it keeps (Firth, J. R. 1957:11)*

![](https://ruder.io/content/images/size/w2000/2016/04/word_embeddings_colah.png)

Let me start with a fascinating example of word embeddings in practice. Below, you can see a figure from the paper: 
*Dynamic Word Embeddings for Evolving Semantic Discovery*. Here (in simple terms) the researchers estimated word vectors for from textual inputs in different time-frames. They picked out some terms and person that obviously changed *their company* over the years. Then they look at the relative position of these terms compared to terms that did not change much (anchors). If you are interested in this kind of research, check out [this blog](https://blog.acolyer.org/2018/02/22/dynamic-word-embeddings-for-evolving-semantic-discovery/) that describes the paper briefly or the [original paper](https://arxiv.org/abs/1703.00607).

![alt text](https://adriancolyer.files.wordpress.com/2018/02/evolving-word-embeddings-fig-1.jpeg)

Word embeddings allow us to create term representations that "learn" meaning from semantic and syntactic features. These models take a sequence of sentences as an input and scan for all individual terms that appear in the whole corpus and all their occurrences. Such contextual learning seems to be able to pick up non-trivial conceptual details and it is this class of models that today enable technologies such as chatbots, machine translation and much more.

The early word embedding models were Word2Vec and [GloVe](https://nlp.stanford.edu/projects/glove/).
In December 2017 Facebook presented [fastText](https://fasttext.cc/) (by the way - by 2017 Tomas Mikolov was working for Facebook and is one of the authors of the [paper](https://arxiv.org/abs/1607.04606) that introduces the research behind fastText). This model extends the idea of Word2Vec, enriching these vectors by information from sub-word elements. What does that mean? Words are not only defined by surrounding words but in addition also by the various syllables that make up the word. Why should that be a good idea? Well, now words such as *apple* and *apples* do not only get similar vectors due to them often sharing context but also because they are composed of the same sub-word elements. This comes in particularly handy when we are dealing with language that have a rich morphology such as Turkish or Russian.  This is also great when working with web-text, which is often messy and misspelt.

The current state-of-the-art transformer models go even further and implement context-specificity (a word may change meaning depending on the context in which it occurs)

Now the good news: You will find pre-trained vectors from all mentioned models online. They will do great in most cases. However, when working with specific tasks: Some obscure languages and/or specific technical jargon (specific scientific field or industry e.g. finance, insurance), it is nice to know how to train such word-vectors.


In this tutorial we will train the "classic" Word2Vec model, considering bi-grams. We will also look a bit into data-engineering issues in sequence-training. Finally, we will look at how we can use such models for text representation beyond individual words.

## Data

The data used here are 10k cooking related posts from Reddit. They come in JSON-lines format and can be either downloaded first or opened via requests.

## Plan of attack
In this tutorial we will not be using Spacy, as it is not fast enough for use in training of large language models.
The intent is to understand training from disk - where the file is not opened (with e.g. pandas) and an object in memory but streamed from disk.

In [None]:
# download data (optional when training from memory)
!wget https://raw.githubusercontent.com/aaubs/ds-master/main/data/reddit_r_cooking_sample.jsonl

In [None]:
# installs
!pip install --upgrade gensim

In [None]:
import pandas as pd
import numpy as np
import json

# we will use nltk for sentence tokenization
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

# we will be using gensim for training
import gensim
from gensim import utils
from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS


# Logging settings
import logging

for handler in logging.root.handlers[:]:
   logging.root.removeHandler(handler)

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Simple In-memory training

To better understand the training itself we start with simple model training out of memory. All the data will be loaded with pandas.
Preprocessing results will also be stored in the dataframe. This is a viable approache up a certain data-size. When going beyond 5M texts (depending on the hardware) that's probably not a good idea..

In [None]:
# load data
data = pd.read_json('https://raw.githubusercontent.com/aaubs/ds-master/main/data/reddit_r_cooking_sample.jsonl', lines=True)

In [None]:
data.head()

Word2Vec uses sentences to train, not paragraphs. Therefore we will need to sentence-tokenize.

In [None]:
# NLTK tokenizer:
sent_tokenize('this is a sentence. also that one.')

In [None]:
# Let's apply that to all texts
sentences = []
for i in data['text']:
  sentences.extend(sent_tokenize(i))

In [None]:
len(sentences)

Gensim has efficient simple preprocessing as part of the utility functions. That works well for most latin-letter texts. Check out [Gensim docos](https://tedboy.github.io/nlps/generated/generated/gensim.utils.simple_preprocess.html) for more into.

In [None]:
# simple prepro (tokenization, lowercase, de-accent (otional))
sentences_prepro = [utils.simple_preprocess(line) for line in sentences]

We are not removing stopwords for Word2Vec, as the model actually cares about syntax. One thing that we can do is identifying n-grams (phrases).

In [None]:
# trainig a model to identify n-grams
phrase_model = Phrases(sentences_prepro, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)

In [None]:
# apply the model
sentences_phrased = [phrase_model[line] for line in sentences_prepro]

In [None]:
# quick check
sentences_phrased[:5]

obviousely, some hyperparameter tuning is needed

In [None]:
# adjusting min_count and threshold (that's a value calculated within the model - read docus)
phrase_model = Phrases(sentences_prepro, min_count=25, threshold=20, connector_words=ENGLISH_CONNECTOR_WORDS)
sentences_phrased = [phrase_model[line] for line in sentences_prepro]
sentences_phrased[:5]

In [None]:
# did we actually find anything?
for phrase, score in phrase_model.find_phrases(sentences_prepro).items():
    print(phrase, score)

Once sentences are pre-processed (tokenized, list of lists) we can train the model.

In [None]:
model = gensim.models.Word2Vec(sentences=sentences_phrased, 
                               vector_size=300, 
                               window=5, 
                               min_count=5, 
                               workers=4, 
                               epochs=15)

In [None]:
# check most similar terms
model.wv.most_similar('dutch_oven')

In [None]:
# we can call the vector of each word
model.wv['kettle']

In [None]:
model.wv.vectors.shape

In [None]:
# from here you can ennter key-word dicts for mapping
model.wv.key_to_index

## Training Word2Vec from disk

Let's assume you want to train a word-embeddding model from disk. You downloaded all of Wikipedia or one of the large (multi GB datasets from Huggingface)

In [None]:
# open file (not read yet) from disk
texts_reddit = open('/content/reddit_r_cooking_sample.jsonl','r')

In [None]:
# read single line (this will iterate over the lines)
texts_reddit.readline()

In [None]:
# Decode JSON
json.loads(texts_reddit.readline())

We need to turn our comments into sentences (tokenize) and preprocess. No need to do on-the-fly preprocessing 15 times
For that we create a new file `sentences.txt`, we tokenize our texts and write all sentences as lines into the new file. Using 1-sentence-per-line in TXTs is a common approach.

In [None]:
# We need re-open to start from top
texts_reddit = open('/content/reddit_r_cooking_sample.jsonl','r')

In [None]:
# open file
with open('sentances.txt','w') as f:
  for line in texts_reddit: # iterate over the json-lines with comments (alternative to readline())
    line = json.loads(line) # decode json
    for sent in sent_tokenize(line['text']): # sent-tokenize
      f.write(sent) # write sents into the new file
      f.write('\n')
  f.close()

The next step is not easy but important and your first step to writing "real code".
We need to define something that allows us to retrieve our sentences from the stored file one by one (and start from the beginning after the last one).

A class with an `__iter__` function can help here. This becomes an iterator that yields them one by one. `yield` is different from `return`. The latter ends an execution and returns the "overall" result of a function. `yield` is called repeatedly.

In [None]:
path = "/content/sentances.txt"

In [None]:
class MyCorpus:
    """An iterator that yields sentences (lists of str)."""
    def __iter__(self):
        for line in open(path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

Let's try out how that works

In [None]:
# instantiate a corpus object
sentences_disk = MyCorpus()

In [None]:
# define a generator (similar to list comprehension but on "stand-by")
test_gen = (a for a in sentences_disk)

In [None]:
# every time we call next, it runs one iteration
next(test_gen)

Let's train our Phrases model from the disk-corpus

In [None]:
sentences_disk = MyCorpus()

In [None]:
phrase_model = Phrases(sentences_disk, min_count=25, threshold=20, connector_words=ENGLISH_CONNECTOR_WORDS)

In [None]:
for phrase, score in phrase_model.find_phrases(sentences_disk).items():
    print(phrase, score)

🚀🚀🚀
**Efficiency** is key when working from disk.
Let's preprocess the inputs using simple-prepro and the phrases model.
Since we preprocess our sentences into lists we need to store them using json such that we can load them into python objects, not strings

In [None]:
sentences_disk = MyCorpus()

In [None]:
# open new file (txt file with json-input)
with open('sentances_phrases.txt','w') as f:
  for sent in sentences_disk: # iterate over the json-lines with comments (alternative to readline())
    f.write(json.dumps(phrase_model[sent])) # write sents into the new file
    f.write('\n')
  f.close()

In [None]:
path = '/content/sentances_phrases.txt'

In [None]:
class MyCorpus_processed:
    """An iterator that yields sentences (lists of str)."""
    def __iter__(self):
        for line in open(path):
            # assume there's one document per line, tokens separated by whitespace
            yield json.loads(line)

In [None]:
sentences_disk = MyCorpus_processed()

In [None]:
# or we just add it to the training
model = gensim.models.Word2Vec(sentences=sentences_disk, 
                               vector_size=300, 
                               window=5, 
                               min_count=5, 
                               workers=4, 
                               epochs=15)

In [None]:
model.wv.most_similar('coriander')

### Bonus: Training FastText

training of FastText is syntax-wise the same.
There are a few other paras that you can tune

In [None]:
model_fasttext = FastText(sentences = sentences_disk, 
                          vector_size=300, 
                          window=8, 
                          min_count=5, 
                          workers=4, 
                          epochs=15)

In [None]:
model_fasttext.wv.most_similar('coriander')

In [None]:
model.wv['powder']

## Visualizing Word-Vectors

now that we have our Word-vectors we should be able to reduce their dimensionality to explore visually

In [None]:
!pip install umap-learn -q

In [None]:
import random
import umap
import altair as alt

In [None]:
# picking 2000 random vectors from the W2V model
idx = random.sample(range(len(model.wv.vectors)), 2000)

In [None]:
# creating 2D reduction
umap_reducer = umap.UMAP(random_state=42, n_components=2)
embeddings = umap_reducer.fit_transform(model.wv.vectors[idx])

In [None]:
# df for plot
df_plot = pd.DataFrame(embeddings, columns=['x','y'])

In [None]:
# vector-labels
labels = [model.wv.index_to_key[ix] for ix in idx]

In [None]:
df_plot['labels'] = labels

In [None]:
# plot
alt.Chart(df_plot).mark_circle(size=60).encode(
    x='x',
    y='y',
    tooltip=['labels']
).properties(
    width=800,
    height=600
).interactive()

## Create sentence embeddings from our W2V model

The final aim is to use the custom W2V embeddings to vectorize sentences
We will look at average vectors and tfidf weighted avg. embeddings

In [None]:
test_sents = ['I love chicken super much with soy',
              'I enjoy asian food, especially chicken',
              'Give me cake', 'mexican food is amazing', 
              'I enjoy cuisine italian']

### Average W2V vectors

In [None]:
# tokenize
tokens = phrase_model[utils.simple_preprocess(test_sents[0])]

In [None]:
# filter out only those words that are part of the vocab
tokens = [t for t in tokens if t in model.wv.key_to_index.keys()]

In [None]:
# create average-vectors
avg_vec = np.average([model.wv[t] for t in tokens], axis=0)

let's package this process up into a vectorizer-function

In [None]:
def w2v_vectorize(text):
  tokens = phrase_model[utils.simple_preprocess(text)] # preprocess just as model inputs
  tokens = [t for t in tokens if t in model.wv.key_to_index.keys()] # filter only tokens that are in vocab
  return np.average([model.wv[t] for t in tokens], axis=0) # calculate avg vector

In [None]:
# it's a goof idea to stack them using numpy into a matrix
vecs = np.vstack([w2v_vectorize(s) for s in test_sents])

In [None]:
# quick explaininng of the vectors (not really part of the code)
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(vecs)

### TFIDF weighted W2V Embeddings

Very similar to avg-embeddings, however here we will use sklearn TfidfVectorizer (that one we already know) to weight our vecs
The approach is a bit "hacky" but efficient

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# function that does absolutely nothing...
# cause we do prepro and tokenization in one using gensim, we will define it for prepro
def dummy_fun(doc):
    return doc

In [None]:
[phrase_model[utils.simple_preprocess(text)] for text in test_sents]

In [None]:
# we define a preprocessing function to pass into the TfidfVectorizer
def gensim_prepro(doc):
  return phrase_model[utils.simple_preprocess(doc)]

In [None]:
# we turn of any preprocessing and align vocabulary with the one
# used by our embeddings
# that will allow us to use TFIDF vectors to weight the embeddings

tfidf_new_text = TfidfVectorizer(
    vocabulary=model.wv.key_to_index.keys(), # here using the W2V vocab
    tokenizer=dummy_fun,
    preprocessor=gensim_prepro,
    token_pattern=None)  

In [None]:
# create TFIDF matrix (we could also just use that one for search)
new_tfidf = tfidf_new_text.fit_transform(test_sents)

In [None]:
new_tfidf

This here is a cool little trick: Since N-columns for the TFIDF is the same as n-rows for our word-embeddings we can simply take a dot-product here.
Another cool feature: this can be done sequentially for large datasets (when no space in ram)

In [None]:
# calculating TFIDF-weighted avg. embeddings
test_w2v_tfidf = new_tfidf @ model.wv.vectors

In [None]:
cosine_similarity(test_w2v_tfidf)

## Using these embeddings for semantic search
We can use such embeddings (and others) for semantic search (similarity maximization) and also downstream in unsuprvised/supervised tasks.

In [None]:
# create TFIDF matrix for all
tfidf_all = tfidf_new_text.fit_transform(data['text'])

In [None]:
# get vecs by dot-product
tfidf_w2v_all = tfidf_all @ model.wv.vectors

In [None]:
# make query and transform it into same vector-space

query = 'Steak egg'

tfidf_q = tfidf_new_text.transform([query]) 
tfidf_w2v_q = tfidf_q @ model.wv.vectors

In [None]:
# calculate cos-sim between the query and all vecs

distances = cosine_similarity(tfidf_w2v_q,tfidf_w2v_all)

In [None]:
# get corresponding texts
ids = np.flip(np.argsort(distances))[0]
ids

In [None]:
# print
for ix in ids[:10]:
  print(data['text'].values[ix])

### Serialization

Gensim models can be (ans should be) saved to disk after training.

In [None]:
phrase_model.save('bigram_model.m')

In [None]:
model.save('w2v_food.m')

In [None]:
g = Word2Vec.load('/content/w2v_food.m')

In [None]:
g.wv.most_similar('garlic')