## GloVe introduction

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space

More information can be found here: http://nlp.stanford.edu/projects/glove/

We will use word vectors that have been trained on the Twitter dataset (2B tweets, 27B tokens, 1.2M vocab, uncased, 200d vectors). This file can be downloaded from the GloVe website. Extract the .txt file and move it to the working directory

In [1]:
import numpy as np
import collections
from scipy import spatial

In [2]:
filename = 'glove.twitter.27B.200d.txt'

## Reading in the dataset

Let's read in the data and do some checks

In [3]:
with open(filename,'r') as f:
    lines = f.readlines()
    numWords = len(lines)
    numDimensions = len(lines[200].split(' ')[1:])
    print(numWords, numDimensions)

(1193514, 200)


Looks like there's slightly less words in the vocabulary than claimed (1,193,514 words vs. 1,200,000) and each word is mapped to a 200 dimensional vector. Let's have a look at what a word vector looks like

In [4]:
    print(lines[200])
    print(lines[201])
    print(lines[202])
    
f.close()

please 0.079204 0.38973 -0.15059 -0.010345 -0.43449 -1.0396 1.142 -0.12891 0.021345 -0.31301 0.67416 0.020708 -0.21758 -0.25822 -0.087623 -0.21197 0.19887 -0.18434 0.11543 -0.045039 -0.21852 -0.4629 -0.40147 0.88832 -0.28331 0.15793 0.43682 0.62241 0.29734 0.025521 0.04076 0.42191 -0.17571 0.38485 -0.222 -0.12087 0.53335 0.60102 -0.14619 -0.2134 0.33717 -0.46093 -0.31229 0.0040756 -0.11045 -0.26965 -0.64615 -0.66332 0.39245 0.10454 0.073493 0.54851 0.36091 -1.1031 0.25083 0.06513 0.046064 0.56705 -0.072345 -0.19426 0.17681 -0.13486 0.33334 -0.18167 0.11279 0.42252 -0.11612 -0.10706 0.1187 -0.044723 0.053748 0.064657 0.12535 0.04816 -0.29935 -0.10651 -0.29289 -0.18884 -0.4127 0.32664 -0.22715 0.67269 0.41074 -0.35499 0.38288 0.083866 -0.76714 0.29737 -0.27832 0.2076 0.015894 -0.074241 0.040225 0.46588 -0.73723 -0.18881 0.062039 0.27367 0.12206 -0.3957 -0.21934 -0.065962 -0.19748 -0.18956 0.027889 -0.126 0.037872 0.54629 0.37619 -0.26709 -0.27878 -0.12053 -0.66942 0.064615 1.2675 0.05705

## Create data structures

We will create a few data structures that will allow us to easily reference parts of the dataset later on

1. The list `wordList` will store the words indexed by the order in which the words appear in the input file

2. The matrix `wordVectorMatrix` will store the word vectors indexed by the order of `wordList`

3. The dictionary `wordVectorDictionary` will store the (word, word vector) pairs

More information about using defaultdict can be found [here](http://stackoverflow.com/questions/5900578/how-does-collections-defaultdict-work) and [here](http://stackoverflow.com/questions/19629682/ordereddict-vs-defaultdict-vs-dict)

We will limit the number of words to read in here because of the size of the dataset. The variable `readWords` will hold the indices we will read in. Let's try read in the first 100,000 words, which takes about 3 minutes on my computer. If you want to read in a different subset, or the entire dataset, simply update the `readWords` variable

In [5]:
readWords = range(100000)

wordList = []
wordVectorMatrix = np.zeros((len(readWords),numDimensions))
wordVectorDictionary = collections.defaultdict(list)

with open(filename,'r') as f:
    index = 0
    for line in f:
        if index in readWords:
            split = line.split()
            word = (split[0])
            wordList.append(word)
            listValues = map(float, split[1:])
            wordVectorMatrix[index] = listValues
            wordVectorDictionary[word] = listValues
            index += 1
        else:
            break

Let's check that we have successfully read in our data

In [6]:
print('The length of wordList is: %d, and the length of wordVectorDictionary is: %d' %(len(wordList),len(wordVectorDictionary))) 
print('The dimensions of wordVectorMatrix are: %s' %(wordVectorMatrix.shape,))

print('The %d-th word in our word list is: %s' %(len(wordList),wordList[-1]))
print('The first 5 dimensions of "%s" are: %s' %(wordList[-1], tuple(wordVectorMatrix[-1,:5])))
print('Does the vector in wordVectorDictionary for "%s" match the vector in wordVectorMatrix: %s' %(wordList[-1],all(wordVectorDictionary[wordList[-1]] == wordVectorMatrix[-1,:])))

The length of wordList is: 100000, and the length of wordVectorDictionary is: 100000
The dimensions of wordVectorMatrix are: (100000, 200)
The 100000-th word in our word list is: валентина
The first 5 dimensions of "валентина" are: (-0.13675999999999999, -0.43297999999999998, -0.47603000000000001, -0.48025000000000001, -0.46645999999999999)
Does the vector in wordVectorDictionary for "валентина" match the vector in wordVectorMatrix: True


## Nearest neighbors

The Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words. Sometimes, the nearest neighbors according to this metric reveal rare but relevant words that lie outside an average human's vocabulary. For example, here are the closest words to the word "please" from our `wordList` vocabulary

In [7]:
def findClosestWords(word, numWords):
    indexOfWord = wordList.index(word)
    wordVector = wordVectorMatrix[indexOfWord]
    similarityDictionary = {}
    for i in readWords:
        if i == indexOfWord:
            continue
        closeness = 1 - spatial.distance.cosine(wordVector, wordVectorMatrix[i,:])
        similarityDictionary[wordList[i]] = closeness
    for w in sorted(similarityDictionary, key=similarityDictionary.get, reverse=True)[:numWords]:
        print(w, similarityDictionary[w]) 

In [8]:
closestWordList = findClosestWords('please', 10)
closestWordList

('pls', 0.86398293210897281)
('plz', 0.78704690310582259)
('help', 0.78122570098971478)
('pleasee', 0.77198784350021787)
('you', 0.73852868220176038)
('follow', 0.72495757262223237)
('need', 0.71108471050433231)
('guys', 0.70307052290636773)
('can', 0.70267652777623324)
('if', 0.70099452649437954)
