## Word2Vec embeddings

This notebook calculates Word2Vec embeddings for each token or phrase in the OECD corpus. We try a variety of models with different vector sizes and context sizes for words.



### 1. Replace curated ngrams with single tokens in docs

There is a curated list of phrases spanning more than one word which represent terms that need a single vector associated with them. This list is given in `ngram_replacements.json`. We replace these with single tokens in this step.

In [1]:
import sys
sys.path.append('../util/') # import python preprocessing script

from pathlib import Path
import os
from preprocessing import preprocess_word2vec
import json

path = Path(os.getcwd())
models_dir = os.path.join(path.parents[0], "models")
data_dir = os.path.join(path.parents[0], "data-files")

with open(os.path.join(data_dir, "processed_ngram_ner_data.json")) as f:
    corpus = json.load(f)

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/kodymoodley/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### 2. Preprocess text for word2vec algorithm

Prepare the text for training the embeddings. This performs further preprocessing such as removing stopwords, punctuation and performing lemmatization.

In [2]:
# combine all docs into one string
corpus_as_str = ''
for key in corpus:
    corpus_as_str += corpus[key] + '. '

processed_corpus = preprocess_word2vec(corpus_as_str, custom_stopwords=None)

### 3. Train the embedding models

Try different parameters for vector and window sizes.

In [3]:
import gensim.models

In [4]:
model_100_20 = gensim.models.Word2Vec(sentences=processed_corpus, sg=1, vector_size=100, window=20, workers=4, min_count=2, epochs=25)

In [5]:
model_100_30 = gensim.models.Word2Vec(sentences=processed_corpus, sg=1, vector_size=100, window=30, workers=4, min_count=2, epochs=25)

### 4. Save the embedding models

Save the models to file

In [6]:
import tempfile

with tempfile.NamedTemporaryFile(delete=False) as tmp:
    filepath_100_20 = os.path.join(models_dir, 'gensim-oecd-word2vec-100-20.model')
    filepath_100_30 = os.path.join(models_dir, 'gensim-oecd-word2vec-100-30.model')
    model_100_20.save(filepath_100_20)
    model_100_30.save(filepath_100_30)

### 5. Test the models

Explore the quality of the embeddings

In [7]:
c_testmodel = gensim.models.Word2Vec.load(filepath_100_20)

In [8]:
c_testmodel.wv.most_similar(positive=['finance'], topn=20)

[('financing', 0.7480107545852661),
 ('investment', 0.7048020362854004),
 ('commercial', 0.7008841633796692),
 ('concessional', 0.6848236918449402),
 ('mobilising', 0.6835558414459229),
 ('mezzanine', 0.6796742081642151),
 ('mobilise', 0.6683375835418701),
 ('blended', 0.6679754257202148),
 ('attract', 0.6674732565879822),
 ('debt', 0.6533348560333252),
 ('leverage', 0.6512179374694824),
 ('non-concessional', 0.646682620048523),
 ('organi-sations', 0.6398669481277466),
 ('repayable', 0.6380059719085693),
 ('loan', 0.6368876099586487),
 ('blend', 0.6358164548873901),
 ('risk-management', 0.6299599409103394),
 ('repayable_finance', 0.6278873682022095),
 ('financier', 0.6156778931617737),
 ('investor', 0.6143190264701843)]

In [9]:
e_testmodel = gensim.models.Word2Vec.load(filepath_100_30)

In [10]:
e_testmodel.wv.most_similar(positive=['finance'], topn=20)

[('financing', 0.7811167240142822),
 ('commercial', 0.740609884262085),
 ('investment', 0.7234548926353455),
 ('blended', 0.6920834183692932),
 ('attract', 0.684935450553894),
 ('mobilise', 0.6698009967803955),
 ('investor', 0.6550965309143066),
 ('mezzanine', 0.6516242027282715),
 ('concessional', 0.6506146788597107),
 ('repayable', 0.6481335163116455),
 ('debt', 0.6457288861274719),
 ('repayable_finance', 0.6439858078956604),
 ('leverage', 0.6382477283477783),
 ('blend', 0.6378662586212158),
 ('loan', 0.6332045793533325),
 ('vipa', 0.6283848285675049),
 ('financier', 0.6255854368209839),
 ('organi-sations', 0.6253201365470886),
 ('blended_finance', 0.6248964667320251),
 ('mobilising', 0.6224984526634216)]