## Word2Vec embeddings

This notebook calculates Word2Vec embeddings for each token or phrase in the OECD corpus. We try a variety of models with different vector sizes and context sizes for words.



### 1. Replace curated ngrams with single tokens in docs

There is a curated list of phrases spanning more than one word which represent terms that need a single vector associated with them. This list is given in `ngram_replacements.json`. We replace these with single tokens in this step.

In [1]:
from pathlib import Path
import os

path = Path(os.getcwd())
models_dir = os.path.join(path.parents[0], "models")

from preprocessing import replace_ngrams_with_unigrams_curated_phrases, preprocess_word2vec

# replace ngrams with single tokens
corpus = replace_ngrams_with_unigrams_curated_phrases()

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/kodymoodley/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### 2. Preprocess text for word2vec algorithm

Prepare the text for training the embeddings. This performs further preprocessing such as removing stopwords, punctuation and performing lemmatization.

In [2]:
# combine all docs into one string
corpus_as_str = ''
for key in corpus:
    corpus_as_str += corpus[key] + '. '
    
# preprocess for word2vec
processed_corpus = preprocess_word2vec(corpus_as_str, custom_stopwords=None)

### 3. Train the embedding models

Try different parameters for vector and window sizes.

In [3]:
import gensim.models
model_200_10 = gensim.models.Word2Vec(sentences=processed_corpus, sg=1, vector_size=200, window=10, workers=4, min_count=2, epochs=25)

In [4]:
model_200_20 = gensim.models.Word2Vec(sentences=processed_corpus, sg=1, vector_size=200, window=20, workers=4, min_count=2, epochs=25)

In [5]:
model_200_30 = gensim.models.Word2Vec(sentences=processed_corpus, sg=1, vector_size=200, window=30, workers=4, min_count=2, epochs=25)

In [6]:
model_100_10 = gensim.models.Word2Vec(sentences=processed_corpus, sg=1, vector_size=100, window=10, workers=4, min_count=2, epochs=25)

In [7]:
model_100_20 = gensim.models.Word2Vec(sentences=processed_corpus, sg=1, vector_size=100, window=20, workers=4, min_count=2, epochs=25)

In [8]:
model_200_40_30 = gensim.models.Word2Vec(sentences=processed_corpus, sg=1, vector_size=200, window=40, workers=4, min_count=2, epochs=30)

In [9]:
model_100_30 = gensim.models.Word2Vec(sentences=processed_corpus, sg=1, vector_size=100, window=30, workers=4, min_count=2, epochs=25)

In [10]:
model_100_40_30 = gensim.models.Word2Vec(sentences=processed_corpus, sg=1, vector_size=100, window=40, workers=4, min_count=2, epochs=30)

### 4. Save the embedding models

Save the models to file

In [12]:
import tempfile

with tempfile.NamedTemporaryFile(delete=False) as tmp:
    filepath_200_10 = os.path.join(models_dir, 'gensim-oecd-word2vec-200-10.model')
    filepath_200_20 = os.path.join(models_dir, 'gensim-oecd-word2vec-200-20.model')
    filepath_200_30 = os.path.join(models_dir, 'gensim-oecd-word2vec-200-30.model')
    filepath_100_10 = os.path.join(models_dir, 'gensim-oecd-word2vec-100-10.model')
    filepath_100_20 = os.path.join(models_dir, 'gensim-oecd-word2vec-100-20.model')
    filepath_100_30 = os.path.join(models_dir, 'gensim-oecd-word2vec-100-30.model')
    filepath_200_40_30 = os.path.join(models_dir, 'gensim-oecd-word2vec-200-40-30.model')
    filepath_100_40_30 = os.path.join(models_dir, 'gensim-oecd-word2vec-100-40-30.model')
    model_200_10.save(filepath_200_10)
    model_200_20.save(filepath_200_20)
    model_200_30.save(filepath_200_30)
    model_100_10.save(filepath_100_10)
    model_100_20.save(filepath_100_20)
    model_100_30.save(filepath_100_30)
    model_200_40_30.save(filepath_200_40_30)
    model_100_40_30.save(filepath_100_40_30)

### 5. Test the models

Explore the quality of the embeddings

In [13]:
# vec_water = model.wv['water']

In [14]:
# print(vec_water)

In [15]:
# model.wv.most_similar(positive=['adaptation'], topn=10)

In [16]:
# for index, word in enumerate(model.wv.index_to_key):
#     if index == 10:
#         break
#     print(f"word #{index}/{len(model.wv.index_to_key)} is {word}")

In [17]:
a_testmodel = gensim.models.Word2Vec.load(filepath_100_10)
b_testmodel = gensim.models.Word2Vec.load(filepath_200_10)

In [18]:
a_testmodel.wv.most_similar(positive=['finance'], topn=20)

[('financing', 0.7322310209274292),
 ('leverage', 0.7108215093612671),
 ('non-concessional', 0.7107137441635132),
 ('mezzanine', 0.6873796582221985),
 ('blended', 0.6777296662330627),
 ('repayable_finance', 0.6750354170799255),
 ('crowd-in', 0.6733455061912537),
 ('infrastruc-ture', 0.6724364161491394),
 ('concessional', 0.6682285666465759),
 ('concessionary', 0.6642774343490601),
 ('commercial', 0.6617627143859863),
 ('repayable', 0.6491576433181763),
 ('private_infrastructure_development_group', 0.646681010723114),
 ('investment', 0.6420419812202454),
 ('mobilising', 0.6417251229286194),
 ('repay-ment', 0.6297632455825806),
 ('risk-management', 0.625847339630127),
 ('risk-mitigation', 0.6249263286590576),
 ('delmon', 0.6246047616004944),
 ('sswsp', 0.6230606436729431)]

In [19]:
b_testmodel.wv.most_similar(positive=['finance'], topn=20)

[('financing', 0.652104377746582),
 ('non-concessional', 0.5843335390090942),
 ('mezzanine', 0.5741749405860901),
 ('blend', 0.5659187436103821),
 ('blended', 0.5621514916419983),
 ('leverage', 0.561673104763031),
 ('crowd-in', 0.5472071170806885),
 ('repayable', 0.5443536639213562),
 ('investment', 0.5376092195510864),
 ('mobilize', 0.5322173833847046),
 ('repay-ment', 0.5277232527732849),
 ('organi-sations', 0.5272273421287537),
 ('repaid', 0.5250483155250549),
 ('de-risk', 0.5231276750564575),
 ('sswsp', 0.5227317214012146),
 ('non-earmarked', 0.5218403339385986),
 ('infrastruc-ture', 0.5181378722190857),
 ('concessionary', 0.5178442597389221),
 ('commercial', 0.5167564749717712),
 ('delmon', 0.5132759809494019)]

In [20]:
c_testmodel = gensim.models.Word2Vec.load(filepath_100_20)
d_testmodel = gensim.models.Word2Vec.load(filepath_200_20)

In [21]:
c_testmodel.wv.most_similar(positive=['finance'], topn=20)

[('financing', 0.7665867209434509),
 ('commercial', 0.7176423668861389),
 ('mezzanine', 0.7036367654800415),
 ('blended', 0.6958461999893188),
 ('investment', 0.6906468272209167),
 ('non-concessional', 0.6781517267227173),
 ('blend', 0.6725146174430847),
 ('concessional', 0.6671544313430786),
 ('repayable', 0.6667763590812683),
 ('attract', 0.665846586227417),
 ('leverage', 0.6639302372932434),
 ('organi-sations', 0.6467216610908508),
 ('debt', 0.6463366150856018),
 ('mobilising', 0.6432472467422485),
 ('vipa', 0.6408852338790894),
 ('concessionary', 0.6321372985839844),
 ('risk-management', 0.6316962242126465),
 ('repay-ment', 0.6298185586929321),
 ('loan', 0.6263399720191956),
 ('repayable_finance', 0.6248188614845276)]

In [22]:
d_testmodel.wv.most_similar(positive=['finance'], topn=20)

[('financing', 0.6960784196853638),
 ('investment', 0.6266279816627502),
 ('commercial', 0.6079007387161255),
 ('mezzanine', 0.5823829770088196),
 ('blend', 0.58002769947052),
 ('blended', 0.5657579302787781),
 ('leverage', 0.5431210398674011),
 ('tri', 0.5420008897781372),
 ('repayable', 0.5408057570457458),
 ('non-concessional', 0.5402007102966309),
 ('concessional', 0.5383909344673157),
 ('concessionary', 0.5347957611083984),
 ('karana', 0.5289083123207092),
 ('repayable_finance', 0.5284245610237122),
 ('hita', 0.5270977020263672),
 ('investor', 0.5215573906898499),
 ('mobilise', 0.5163821578025818),
 ('risk-mitigation', 0.5159808397293091),
 ('de-risk', 0.5150316953659058),
 ('development_bank_of_the_philippines', 0.5147721767425537)]

In [23]:
e_testmodel = gensim.models.Word2Vec.load(filepath_100_30)
f_testmodel = gensim.models.Word2Vec.load(filepath_200_30)

In [25]:
e_testmodel.wv.most_similar(positive=['rockefeller'], topn=20)

[('foundation/arup', 0.7402074337005615),
 ('centennial', 0.6189465522766113),
 ('kassel', 0.587670087814331),
 ('foundation_for_applied_water_research', 0.5441530346870422),
 ('gauvin', 0.5439850687980652),
 ('kase', 0.5407016277313232),
 ('grantham', 0.5347704291343689),
 ('mode-of-action-based', 0.5277999043464661),
 ('climate-resilient', 0.527142345905304),
 ('janssenal', 0.5230787396430969),
 ('the_smart_water_grid', 0.5219584703445435),
 ('craft', 0.5201272964477539),
 ('district_of_colombia', 0.518730878829956),
 ('simoni', 0.5153480172157288),
 ('oost', 0.5131468176841736),
 ('battery', 0.5124571323394775),
 ('letzel', 0.5098364353179932),
 ('surrogate', 0.5080912709236145),
 ('macroeconomic_growth', 0.5072043538093567),
 ('hilary_delage', 0.5061073899269104)]

In [27]:
f_testmodel.wv.most_similar(positive=['rockefeller'], topn=20)

[('foundation/arup', 0.6886305809020996),
 ('centennial', 0.6167237162590027),
 ('meeting/workshop', 0.46759840846061707),
 ('cri', 0.4624117314815521),
 ('circle_of_blue', 0.454412579536438),
 ('westhoek', 0.45238032937049866),
 ('kassel', 0.4514877498149872),
 ('iiasa', 0.45003455877304077),
 ('duhon', 0.44657039642333984),
 ('intergovernmental_panel_on', 0.4455540180206299),
 ('chartered', 0.44546106457710266),
 ('hargrove', 0.44524097442626953),
 ('macroeconomic_growth', 0.44167494773864746),
 ('kase', 0.4379238486289978),
 ('roy', 0.43525102734565735),
 ('srex', 0.4349420666694641),
 ('deltares_science_institute', 0.4344605803489685),
 ('siegel', 0.4343560039997101),
 ('non-targeted', 0.43401914834976196),
 ('foundation', 0.4324767589569092)]

In [28]:
g_testmodel = gensim.models.Word2Vec.load(filepath_200_40_30)
h_testmodel = gensim.models.Word2Vec.load(filepath_100_40_30)

In [None]:
g_testmodel.wv.most_similar(positive=['rockefeller'], topn=20)

In [None]:
h_testmodel.wv.most_similar(positive=['inei'], topn=20)