## GloVe implementation with Python (+glove-python)
- Note: This code is written in Python 3.6.1 (+Glove)
- glove-python: https://github.com/maciejkula/glove-python

### How to install glove-python(https://github.com/maciejkula/glove-python/issues/42)
- git clone https://github.com/maciejkula/glove-python.git

- go to cloned directory location and open setup.py and remove 'stdc++' from libraries=[] paramerter after removing it will look like below

<br> Extension("glove.corpus_cython", [glove_corpus],
<br>language='C++',
<br> libraries=[],
<br> extra_link_args=compile_args,
<br> extra_compile_args=compile_args)]

- conda install cython

- open cmd from that location where setup.py is stored and run below command
    <br>python setup.py install

In [50]:
import re
import numpy as np

from glove import Corpus, Glove
from nltk.corpus import gutenberg
from multiprocessing import Pool
from scipy import spatial

### Import training dataset
- Import Shakespeare's Hamlet corpus from nltk library

In [11]:
sentences = list(gutenberg.sents('shakespeare-hamlet.txt'))   # import the corpus and convert into a list

In [12]:
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']']
['Actus', 'Primus', '.']
['Fran', '.']


### Preprocess data
- Use re module to preprocess data
- Convert all letters into lowercase
- Remove punctuations, numbers, etc.

In [13]:
for i in range(len(sentences)):
    sentences[i] = [word.lower() for word in sentences[i] if re.match('^[a-zA-Z]+', word)]  

In [14]:
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

['the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare']
['actus', 'primus']
['fran']


### Create Corpus instance
- Sentences should be fitted into the Corpus instance
- Recall that GloVe takes advantage of both count-based matrix factorization and local context-based window methods

In [15]:
corpus = Corpus()

In [16]:
corpus.fit(sentences, window = 3)    # window parameter denotes the distance of context

In [17]:
glove = Glove(no_components = 100, learning_rate = 0.05)

### Train model
- GloVe model is trained with corpus matrix (global statistics of words)
- Key parameter description
    - **matrix**: co-occurence matrix of the corpus
    - **epochs**: number of epochs (i.e., training iterations)
    - **no_threads**: number of training threads
    - **verbose**: whether to print out the progress messages

In [27]:
glove.fit(matrix = corpus.matrix, epochs = 30, no_threads = Pool()._processes, verbose = True)

Performing 30 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19
Epoch 20
Epoch 21
Epoch 22
Epoch 23
Epoch 24
Epoch 25
Epoch 26
Epoch 27
Epoch 28
Epoch 29


In [24]:
glove.add_dictionary(corpus.dictionary)    #  supply a word-id dictionary to allow similarity queries

### Save and load model
- word2vec model can be saved and loaded locally
- Doing so can reduce time to train model again

In [32]:
glove.save('glove_model')

In [40]:
glove.load('glove_model')

### Similarity calculation
- Similarity between embedded words (i.e., vectors) can be computed using metrics such as cosine similarity
- For other metrics and comparisons between them, refer to: https://github.com/taki0112/Vector_Similarity

In [26]:
glove.most_similar('king', number = 10)

[('queene', 0.99407553073822963),
 ('matter', 0.99349224230824584),
 ('players', 0.98878981880933492),
 ('the', 0.98819079663149711),
 ('world', 0.98768057684646038),
 ('against', 0.98706467981587631),
 ('winde', 0.98687851286064199),
 ('drinke', 0.98627319315331974),
 ('very', 0.98547774026678192)]

In [47]:
# define a function that converts word into embedded vector
def vector_converter(word):
    idx = glove.dictionary[word]
    return glove.word_vectors[idx]

In [49]:
# define a function that computes cosine similarity between two words
def cosine_similarity(v1, v2):
    return 1 - spatial.distance.cosine(v1, v2)

In [52]:
v1 = vector_converter('king')
v2 = vector_converter('queen')

In [53]:
cosine_similarity(v1, v2)

0.30658440396162456