## Most of the NLP tasks can be specified into following 5 steps

1. Training a word vector generation model ( such as Word2Vec) or loading pretrained word vectors
2. Creating an ID's matrix for our training set (We'll discuss this a bit later)
3. RNN (with LSTM units) graph creation
4. Training 
5. Testing

## Loading Data

In [21]:
import numpy as np
words_list =  np.load('data/sentiment_analysis/wordsList.npy')
print('Loaded the word list')
words_list = words_list.tolist()  # Originally loaded as numpy array
words_list = [word.decode('UTF-8') for word in words_list]  # Encode words as UTF-8
for i in range(5):
    print(words_list[i])
word_vectors = np.load('data/sentiment_analysis/wordVectors.npy') 
print('Loaded the word vectors')
for i in range(5):
    print(word_vectors[i])

Loaded the word list
proposed
intelligence
giving
hotel
finally


Loaded the word vectors
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0.]
[ 0.013441  0.23682  -0.16899   0.40951   0.63812   0.47709  -0.42852
 -0.55641  -0.364    -0.23938   0.13001  -0.063734 -0.39575  -0.48162
  0.23291   0.090201 -0.13324   0.078639 -0.41634  -0.15428   0.10068
  0.48891   0.31226  -0.1252   -0.037512 -1.5179    0.12612  -0.02442
 -0.042961 -0.28351   3.5416   -0.11956  -0.014533 -0.1499    0.21864
 -0.33412  -0.13872   0.31806   0.70358   0.44858  -0.080262  0.63003
  0.32111  -0.46765   0.22786   0.36034  -0.37818  -0.56657   0.044691
  0.30392 ]
[ 1.5164e-01  3.0177e-01 -1.6763e-01  1.7684e-01  3.1719e-01  3.3973e-01
 -4.3478e-01 -3.1086e-01 -4.4999e-01 -2.9486e-01  1.6608e-01  1.1963e-01
 -4.1328e-01 -4.2353e-01  5.9868e-01  2.8825e-01 -1.1547e-01 -4.1848e-02
 -6.7989e-01 -2.5063e-01  1.8472e-01  8.6876e-02  4.6582e-01  1.5035e-02
  4.3474e-02 -1.4671e+00 -3

Just to make sure everything has been loaded in correctly, we can look the dimentions

In [16]:
print(len(words_list))
print(len(word_vectors.shape))

400000
2


We can also search our wordlist for a word like "baseball", and then access its corresponding vector through the embedding matrix.

In [22]:
baseball_index = words_list.index('baseball')
print(baseball_index)
print(word_vectors[baseball_index])


1444
[-1.9327    1.0421   -0.78515   0.91033   0.22711  -0.62158  -1.6493
  0.07686  -0.5868    0.058831  0.35628   0.68916  -0.50598   0.70473
  1.2664   -0.40031  -0.020687  0.80863  -0.90566  -0.074054 -0.87675
 -0.6291   -0.12685   0.11524  -0.55685  -1.6826   -0.26291   0.22632
  0.713    -1.0828    2.1231    0.49869   0.066711 -0.48226  -0.17897
  0.47699   0.16384   0.16537  -0.11506  -0.15962  -0.94926  -0.42833
 -0.59457   1.3566   -0.27506   0.19918  -0.36008   0.55667  -0.70315
  0.17157 ]


Lets take an input sentence and then constructing its vector representation.

In [24]:
import tensorflow as tf
max_seq_length = 10  # Maximum length of sentence
num_dimensions = 300  # Dimensions for each word vector
first_sentence = np.zeros((max_seq_length), dtype='int32')
first_sentence[0] = words_list.index('i')
first_sentence[1] = words_list.index('thought')
first_sentence[2] = words_list.index('the')
first_sentence[3] = words_list.index('movie')
first_sentence[4] = words_list.index('was')
first_sentence[5] = words_list.index('incredible')
first_sentence[6] = words_list.index('and')
first_sentence[7] = words_list.index('inspiring')

# first_sentence[8] and first_sentence[9] are going to be zero
print(first_sentence.shape)
print(first_sentence)

(10,)
[    41    804 201534   1005     15   7446      5  13767      0      0]


In [26]:
with tf.Session() as sess:
    print(tf.nn.embedding_lookup(word_vectors, first_sentence).eval().shape)

(10, 50)


10 X 50 output  should contain the 50 dimentional word vectors for each of the 10 words in the sequence.