### Using pre-trained GloVe Embedding

GloVe: Global Vectors for Word Representation.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. 

This dataset contains English word vectors pre-trained on the combined Wikipedia 2014 + Gigaword 5th Edition corpora (6B tokens, 400K vocab). 
All tokens are in lowercase. This dataset contains 50-dimensional, 100-dimensional and 200-dimensional pre trained word vectors.

Ref:
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. 
https://nlp.stanford.edu/pubs/glove.pdf


Files can be downloaded from https://www.kaggle.com/rtatman/glove-global-vectors-for-word-representation/data

For this notebook i have used **glove.6B.50d.txt**, which contains a 50-dimensional version of the embedding.

#### If you open above file you will see a token (word) followed by the weights (50 numbers) on each line.

In [1]:
import os
import sys
import numpy as np

def load_glove(fileName):
    embeddings_map = {}
    word_map={}
    i=0
    
    print('loading word embedding file')
    
    with open(os.path.join(".", fileName)) as f:
        for line in f:
            values = line.split()
            word = values[0]
            word_embedding = np.asarray(values[1:], dtype='float32')
            embeddings_map[word] = word_embedding
            word_map[word]=i
            i = i + 1
    
    print('Found %s word vectors.' % len(embeddings_map))
    return embeddings_map, word_map

In [2]:
embeddings_map, word_map = load_glove('glove.6B.50d.txt')
print('Print details for word: {}'.format('where'))
print(embeddings_map['where'])
print(word_map['where'])

loading word embedding file
Found 400000 word vectors.
Print details for word: where
[  6.92369998e-01   4.49710011e-01  -2.02930003e-01  -1.67830005e-01
   3.05029988e-01  -4.87599999e-01  -6.90280020e-01   1.81630000e-01
  -1.62949994e-01  -4.74770010e-01  -3.30440002e-03  -6.52079999e-01
  -1.01480000e-01  -5.75100005e-01   3.01889986e-01   3.56389999e-01
   2.86289990e-01   4.73670006e-01  -7.14559972e-01  -1.88650005e-02
   1.70959994e-01   2.70969987e-01   1.90709993e-01   7.63260007e-01
  -1.75860003e-01  -1.79809999e+00  -3.37220013e-01   2.73250014e-01
   5.40950000e-02  -5.23500025e-01   3.49079990e+00  -2.76190005e-02
  -2.39490002e-01  -8.69759977e-01   2.66119987e-01   8.79559964e-02
  -1.98850006e-01   1.85340002e-01   4.32500005e-01   4.12079990e-01
  -3.91900003e-01   2.28569999e-01   7.34430030e-02   1.09010004e-01
  -2.34950006e-01   1.60820007e-01  -1.63640007e-02  -1.03470004e+00
  -2.41600007e-01  -4.86799985e-01]
111


#### Now, we have our word vector.

As an example, let's take a sample input sentence and construct its vector representation aka **Embedding Matrix**

In [3]:
MAX_NUM_WORDS = 10 # just for a sample sentence
MAX_DIMENSION = 50 # we have used 50 dimension embedding of GloVe

def generate_embedding_matrix(sentence):
    words = sentence.split()  # string array
    embedding_matrix = np.zeros((MAX_NUM_WORDS), dtype='int32')   # method response
    
    for i in range(len(words)):
        if i >= MAX_NUM_WORDS:
            continue
            
        embedding_vector = word_map[words[i]] 
        
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
            
    return embedding_matrix

In [4]:
sentence = 'today i am feeling great'
print('sentence vector obtained after replaceing words with integer representation from word vector')
matrix = generate_embedding_matrix(sentence)
print(matrix)
print(matrix.shape)

sentence vector obtained after replaceing words with integer representation from word vector
[ 373   41  913 2518  353    0    0    0    0    0]
(10,)
