# Natural Language Processing Assignment 1: word2vec

### Evan James Heetderks / 高修杰 / 2019280025

##### Disclaimer: Some source code used from https://github.com/Adoni/word2vec_pytorch

## PART I: Gradients Calculation

Please refer to "2019280025_Word2Vec\Report\Gradients_Calculation.pdf" to see my gradient derivations

## PART II: Word2vec Implementation

I first downloaded a 100 megabyte wikipedia dump file "enwiki-latest-pages-articles14.xml-p7697595p7744800.bz2" from the following website: https://dumps.wikimedia.org/enwiki/latest/. I have not included it in my deliverables folder due to it's size.

I then implemented word2vec by combining existing code (see disclaimer above) with my own methods to generate a wikipedia corpus that is free of selected stopwords and other words that appear under a minimum amount. We import the top-level source file here:

In [1]:
from word2vec import Word2Vec

### Training a model with 100 embedding dimensions

We now initialize the first model (100 embedding dimensions) using the wikipedia dump file above. The other parameters will be the same for each model in part II.

In [3]:
Model_1 = Word2Vec(wikidump_filename="enwiki-latest-pages-articles14.xml-p7697595p7744800.bz2",
               output_text_filename="corpus1.txt", 
               emb_dimension=100, 
               batch_size=50,
               window_size= 8, 
               iteration=1, 
               initial_lr=0.025, 
               min_count=3)

Corpus is ready...
Word Count: 41283
Sentence Length: 2172215


We now train the 1st model (100 embedding dimensions) here:

In [4]:
Model_1.train()

Loss: 133.50167847, lr: 0.000063: 100%|█████████████████████████████████████| 645635/645635 [1:11:14<00:00, 151.04it/s]


We now calculate the spearman coefficient of the 1st model (100 embedding dimensions) here:

In [5]:
spearman_coefficient = Model_1.wordsim353_spearman("combined.csv")
print('Spearman coefficient for 100 dimensions: ', spearman_coefficient)

Spearman coefficient for 100 dimensions:  0.2806071138329367


### Training a model with 200 embedding dimensions

We now initialize the second model (200 embedding dimensions) using the wikipedia dump file at the beginning of this document.

In [6]:
Model_2 = Word2Vec(wikidump_filename="enwiki-latest-pages-articles14.xml-p7697595p7744800.bz2",
               output_text_filename="corpus2.txt", 
               emb_dimension=200, 
               batch_size=50,
               window_size= 8, 
               iteration=1, 
               initial_lr=0.025, 
               min_count=3)

Corpus is ready...
Word Count: 41283
Sentence Length: 2172215


We now train the 1st model (200 embedding dimensions) here:

In [7]:
Model_2.train()

Loss: 131.42134094, lr: 0.000063: 100%|█████████████████████████████████████| 645635/645635 [1:03:23<00:00, 169.73it/s]


We now calculate the spearman coefficient of the 2nd model (200 embedding dimensions) here:

In [11]:
spearman_coefficient = Model_2.wordsim353_spearman("combined.csv")
print('Spearman coefficient for 200 dimensions: ', spearman_coefficient)

Spearman coefficient for 200 dimensions:  0.2810809339989552


### Training a model with 300 embedding dimensions

We now initialize the third model (300 embedding dimensions) using the wikipedia dump file at the beginning of this document

In [9]:
Model_3 = Word2Vec(wikidump_filename="enwiki-latest-pages-articles14.xml-p7697595p7744800.bz2",
               output_text_filename="corpus3.txt", 
               emb_dimension=300, 
               batch_size=50,
               window_size= 8, 
               iteration=1, 
               initial_lr=0.025, 
               min_count=3)

Corpus is ready...
Word Count: 41283
Sentence Length: 2172215


We now train the 1st model (300 embedding dimensions) here:

In [10]:
Model_3.train()

Loss: 136.37429810, lr: 0.000063: 100%|███████████████████████████████████████| 645635/645635 [59:26<00:00, 181.01it/s]


We now calculate the spearman coefficient of the 3rd model (300 embedding dimensions) here:

In [12]:
spearman_coefficient = Model_3.wordsim353_spearman("combined.csv")
print('Spearman coefficient for 300 dimensions: ', spearman_coefficient)

Spearman coefficient for 300 dimensions:  0.28626610843443434


### Conclusion:
#### Spearman coefficient (100 dimensions): 0.2806071138329367
#### Spearman coefficient (200 dimensions): 0.2810809339989552
#### Spearman coefficient (300 dimensions): 0.28626610843443434

#### We thus confirm that the model performs best with a higher number of embedding dimensions

## PART III: Word2vec Improvement

To further improve the model, the window size was increased from 8 to 10, which will allow the target word to map to more context words and thus increase the reliability of the model. The model retains the same 300 embedding dimension size as in Part II.

In [13]:
Model_4 = Word2Vec(wikidump_filename="enwiki-latest-pages-articles14.xml-p7697595p7744800.bz2",
               output_text_filename="corpus4.txt", 
               emb_dimension=300, 
               batch_size=50,
               window_size= 10, 
               iteration=1, 
               initial_lr=0.025, 
               min_count=3)

Corpus is ready...
Word Count: 41283
Sentence Length: 2172215


We now train the improved model (300 embedding dimensions, window size 10) here:

In [14]:
Model_4.train()

Loss: 116.94097900, lr: 0.000007: 100%|█████████████████████████████████████| 816230/816230 [1:13:32<00:00, 184.99it/s]


We now calculate the spearman coefficient of the improved model (300 embedding dimensions, window size 10) here:

In [15]:
spearman_coefficient = Model_4.wordsim353_spearman("combined.csv")
print('Spearman coefficient for 300 dimensions, window size 10: ', spearman_coefficient)

Spearman coefficient for 300 dimensions, window size 10:  0.31037887473152614


### Conclusion:
#### Spearman coefficient (300 dimensions, window size 10): 0.31037887473152614

#### We thus confirm that the model performs best with a higher window size