# Gensim Word2Vec Tutorial
- based on https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial
- based on https://rare-technologies.com/word2vec-tutorial/

## File Loading

In [1]:
import gensim, logging
import multiprocessing
import os

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Memory Efficient File Loading for Word2Vec Training

if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file, line by line:

In [None]:
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()
 

In [None]:
sentences = MySentences('/some/directory')
model = gensim.models.Word2Vec(sentences) #simple word2vec Training

Or for just one file, simply

In [2]:
class SentenceReader:

    def __init__(self, filepath):
        self.filepath = filepath


    def __iter__(self):
        for line in open(self.filepath):
            yield line.split(' ')

In [3]:
sentences = SentenceReader('NewsPaperEmbedding.txt')

## Most Frequent Words:
Mainly a sanity check of the effectiveness of the lemmatization, removal of stopwords, and addition of bigrams.

In [4]:
from collections import defaultdict

word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

248644

In [5]:
sorted(word_freq, key=word_freq.get, reverse=True)[:100]

['',
 '\n',
 '있다',
 '등',
 '수',
 '있는',
 '일',
 '고',
 '년',
 '이',
 '한다',
 '것으로',
 '월',
 '및',
 '대한',
 '위해',
 '시',
 '며',
 '했다',
 '지난',
 '한',
 '또',
 '개',
 '위한',
 '도내',
 '등을',
 '통해',
 '경우',
 '지역',
 '것',
 '말했다',
 '도',
 '함께',
 '중',
 '것이',
 '만',
 '따라',
 '특히',
 '지난해',
 '할',
 '것은',
 '전',
 '이날',
 '현재',
 '대해',
 '관계자는',
 '이라고',
 '하지만',
 '하는',
 '최근',
 '이에',
 '를',
 '이번',
 '큰',
 '올해',
 '가장',
 '명',
 '더',
 '제',
 '같은',
 '밝혔다',
 '것이다',
 '등의',
 '등이',
 '는',
 '없다',
 '~',
 '모두',
 '을',
 '그',
 '된다',
 '있어',
 '이어',
 '하고',
 '것을',
 '그러나',
 '전국',
 '없는',
 '춘천',
 '때문에',
 '될',
 '에',
 '따르면',
 '가운데',
 '이상',
 '로',
 '다양한',
 '일부',
 '후',
 '이를',
 '있도록',
 '가',
 '때문이다',
 '많은',
 '이후',
 '대',
 '내년',
 '물론',
 '간',
 '관련']

# Training the model
## Gensim Word2Vec Implementation:
We use Gensim implementation of word2vec: https://radimrehurek.com/gensim/models/word2vec.html

In [6]:
import multiprocessing

from gensim.models import Word2Vec

## Why I seperate the training of the model in 3 steps:
I prefer to separate the training in 3 distinctive steps for clarity and monitoring.
1. `Word2Vec()`: 
>In this first step, I set up the parameters of the model one-by-one. <br>I do not supply the parameter `sentences`, and therefore leave the model uninitialized, purposefully.
2. `.build_vocab()`: 
>Here it builds the vocabulary from a sequence of sentences and thus initialized the model. <br>With the loggings, I can follow the progress and even more important, the effect of `min_count` and `sample` on the word corpus. I noticed that these two parameters, and in particular `sample`, have a great influence over the performance of a model. Displaying both allows for a more accurate and an easier management of their influence.
3. `.train()`:
>Finally, trains the model.<br>
The loggings here are mainly useful for monitoring, making sure that no threads are executed instantaneously.

In [7]:
cores = multiprocessing.cpu_count() # Count the number of cores in a computer

## The parameters:

* `min_count` <font color='purple'>=</font> <font color='green'>int</font> - Ignores all words with total absolute frequency lower than this - (2, 100)


* `window` <font color='purple'>=</font> <font color='green'>int</font> - The maximum distance between the current and predicted word within a sentence. E.g. `window` words on the left and `window` words on the left of our target - (2, 10)


* `size` <font color='purple'>=</font> <font color='green'>int</font> - Dimensionality of the feature vectors. - (50, 300)


* `sample` <font color='purple'>=</font> <font color='green'>float</font> - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial.  - (0, 1e-5)


* `alpha` <font color='purple'>=</font> <font color='green'>float</font> - The initial learning rate - (0.01, 0.05)


* `min_alpha` <font color='purple'>=</font> <font color='green'>float</font> - Learning rate will linearly drop to `min_alpha` as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00


* `negative` <font color='purple'>=</font> <font color='green'>int</font> - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)


* `workers` <font color='purple'>=</font> <font color='green'>int</font> - Use these many worker threads to train the model (=faster training with multicore machines)

In [8]:
w2v_model = Word2Vec(min_count=10,
                     window=5,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

## Building the Vocabulary Table:
Word2Vec requires us to build the vocabulary table (simply digesting all the words and filtering out the unique words, and doing some basic counts on them):

In [9]:
from time import time
t = time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

2021-04-08 10:04:21,352 : INFO : collecting all words and their counts
2021-04-08 10:04:21,354 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-04-08 10:04:21,470 : INFO : PROGRESS: at sentence #10000, processed 228937 words, keeping 55660 word types
2021-04-08 10:04:21,543 : INFO : PROGRESS: at sentence #20000, processed 464881 words, keeping 92629 word types
2021-04-08 10:04:21,611 : INFO : PROGRESS: at sentence #30000, processed 704593 words, keeping 124557 word types
2021-04-08 10:04:21,679 : INFO : PROGRESS: at sentence #40000, processed 932795 words, keeping 151017 word types
2021-04-08 10:04:21,755 : INFO : PROGRESS: at sentence #50000, processed 1170276 words, keeping 175705 word types
2021-04-08 10:04:21,822 : INFO : PROGRESS: at sentence #60000, processed 1406489 words, keeping 198737 word types
2021-04-08 10:04:21,888 : INFO : PROGRESS: at sentence #70000, processed 1639260 words, keeping 220788 word types
2021-04-08 10:04:21,955 : INFO : PROGR

Time to build vocab: 0.08 mins


## Training of the model:
_Parameters of the training:_
* `total_examples` <font color='purple'>=</font> <font color='green'>int</font> - Count of sentences;
* `epochs` <font color='purple'>=</font> <font color='green'>int</font> - Number of iterations (epochs) over the corpus - [10, 20, 30]

In [10]:
t = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

2021-04-08 10:04:45,427 : INFO : training model with 39 workers on 18990 vocabulary and 300 features, using sg=0 hs=0 sample=6e-05 negative=20 window=5
2021-04-08 10:04:46,481 : INFO : EPOCH 1 - PROGRESS: at 23.97% examples, 178036 words/s, in_qsize 76, out_qsize 1
2021-04-08 10:04:47,390 : INFO : worker thread finished; awaiting finish of 38 more threads
2021-04-08 10:04:47,410 : INFO : worker thread finished; awaiting finish of 37 more threads
2021-04-08 10:04:47,413 : INFO : worker thread finished; awaiting finish of 36 more threads
2021-04-08 10:04:47,420 : INFO : worker thread finished; awaiting finish of 35 more threads
2021-04-08 10:04:47,430 : INFO : worker thread finished; awaiting finish of 34 more threads
2021-04-08 10:04:47,438 : INFO : worker thread finished; awaiting finish of 33 more threads
2021-04-08 10:04:47,444 : INFO : worker thread finished; awaiting finish of 32 more threads
2021-04-08 10:04:47,447 : INFO : worker thread finished; awaiting finish of 31 more thread

Time to train the model: 1.03 mins


As we do not plan to train the model any further, we are calling init_sims(), which will make the model much more memory-efficient:

In [11]:
w2v_model.init_sims(replace=True)

2020-10-20 13:59:47,845 : INFO : precomputing L2-norms of word weight vectors


In [12]:
w2v_model.save("NewsDataVec")

2020-10-20 13:59:55,685 : INFO : saving Word2Vec object under NewsDataVec, separately None
2020-10-20 13:59:55,687 : INFO : not storing attribute vectors_norm
2020-10-20 13:59:55,689 : INFO : not storing attribute cum_table
2020-10-20 13:59:59,309 : INFO : saved NewsDataVec


# Exploring the model
## Most similar to:



In [13]:
from gensim.models import KeyedVectors
w2v_model = KeyedVectors.load("NewsDataVec", mmap='r')

2020-10-20 14:00:00,804 : INFO : loading Word2VecKeyedVectors object from NewsDataVec
2020-10-20 14:00:01,324 : INFO : loading wv recursively from NewsDataVec.wv.* with mmap=r
2020-10-20 14:00:01,326 : INFO : setting ignored attribute vectors_norm to None
2020-10-20 14:00:01,327 : INFO : loading vocabulary recursively from NewsDataVec.vocabulary.* with mmap=r
2020-10-20 14:00:01,327 : INFO : loading trainables recursively from NewsDataVec.trainables.* with mmap=r
2020-10-20 14:00:01,328 : INFO : setting ignored attribute cum_table to None
2020-10-20 14:00:01,329 : INFO : loaded NewsDataVec


In [14]:
w2v_model.wv.most_similar(positive=["강원도"])

2020-10-20 14:00:04,616 : INFO : precomputing L2-norms of word weight vectors


[('도', 0.5024597644805908),
 ('홀대', 0.49366965889930725),
 ('강원도의', 0.4537508189678192),
 ('`강원도', 0.4424399137496948),
 ('안중에도', 0.4412866234779358),
 ('국비확보', 0.4102475047111511),
 ('소외', 0.4083894193172455),
 ('`강원도의', 0.40002626180648804),
 ('선도산업인', 0.3917698264122009),
 ('문화체육관광부', 0.39128005504608154)]

In [15]:
w2v_model.wv.most_similar(positive=["춘천"])

[('서울', 0.5327453017234802),
 ('서울고속도로', 0.5239557027816772),
 ('번가', 0.5187969207763672),
 ('춘천과', 0.5147794485092163),
 ('브라운', 0.5087863802909851),
 ('서울~춘천', 0.47780299186706543),
 ('막국수', 0.473845511674881),
 ('`춘천', 0.46909213066101074),
 ('양구와', 0.46901997923851013),
 ('삼천동', 0.46867045760154724)]

## Similarities:
Here, we will see how similar are two words to each other :

In [16]:
w2v_model.wv.similarity("강원도", '도내')

0.25325537

In [17]:
w2v_model.wv.similarity('서울', '춘천')

0.5327453

## Doesn't Match
Here, we will see how similar are two words to each other :

In [18]:
w2v_model.wv.doesnt_match(['서울', '춘천', '일본'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'일본'

In [19]:
w2v_model.wv.doesnt_match(["대한", "서울", "강원도"])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'대한'

Damn, they really do not like you Homer!

## Analogy difference:
Which word is to woman as homer is to marge?

In [20]:
w2v_model.wv.most_similar(positive=["춘천", "서울"], negative=["일본"], topn=3)

[('서울고속도로', 0.5442186594009399),
 ('서울~춘천', 0.5037895441055298),
 ('민자사업자인', 0.4909842312335968)]