# Natural Language Processing Project - Seq NLP

### Sentiment Classification

### Aim

1. Build a Sequential Model using Keras for Sentiment Classification task. 
2. Report the Accuracy of the model. 
3. Retrive the output of each layer in keras for a given single test sample from the trained model you built.


##### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



##### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

##### Step 1: Loading Dataset

In [0]:
from keras.datasets import imdb

vocab_size = 10000 #vocab size

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.

In [0]:
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [0]:
x_train.shape #number of review, number of words in each review

(25000, 300)

To take a look at the review and sentiment:

In [0]:
(training_data, training_labels), (test_data, test_labels)= imdb.load_data(num_words=vocab_size, index_from=3)

In [0]:
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


In [0]:
id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in training_data[6] ))
print('The sentiment is:', training_labels[6])

<START> lavish production values and solid performances in this straightforward adaption of jane <UNK> satirical classic about the marriage game within and between the classes in <UNK> 18th century england northam and paltrow are a <UNK> mixture as friends who must pass through <UNK> and lies to discover that they love each other good humor is a <UNK> virtue which goes a long way towards explaining the <UNK> of the aged source material which has been toned down a bit in its harsh <UNK> i liked the look of the film and how shots were set up and i thought it didn't rely too much on <UNK> of head shots like most other films of the 80s and 90s do very good results
The sentiment is: 1


Here is a postive review number 6 from the training set. 

##### Step 2: Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [0]:
import time
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding
from keras.layers import LSTM, TimeDistributed
model = Sequential()
model.add(Embedding(10000, 100, input_length=maxlen)) #10000 for vocab size, 8 for dimensionality of embedding
model.add(LSTM(64, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(100)))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
lstm_3 (LSTM)                (None, 300, 64)           42240     
_________________________________________________________________
time_distributed_2 (TimeDist (None, 300, 100)          6500      
_________________________________________________________________
flatten_1 (Flatten)          (None, 30000)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 250)               7500250   
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 251       
Total params: 8,549,241
Trainable params: 8,549,241
Non-trainable params: 0
_________________________________________________________________


In [0]:
x_train.shape #number of examples, number or words

(25000, 300)

In [0]:
x_train[1] # words are representedby numbers

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    1,  194, 1153,  194, 8255,   78,  228,    5,    6, 1463,
       4369, 5012,  134,   26,    4,  715,    8,  118, 1634,   14,  394,
         20,   13,  119,  954,  189,  102,    5,  207,  110, 3103,   21,
         14,   69,  188,    8,   30,   23,    7,   

In [0]:
y_train.shape 

(25000,)

In [0]:
y_test.shape

(25000,)

In [0]:
x_test[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    1,   14,   22, 3443,
          6,  176,    7, 5063,   88,   12, 2679,   23, 1310,    5,  109,
        943,    4,  114,    9,   55,  606,    5,  111,    7,    4,  139,
        193,  273,   23,    4,  172,  270,   11, 7216,    2,    4, 8463,
       2801,  109, 1603,   21,    4,   22, 3861,    8,    6, 1193, 1330,
         10,   10,    4,  105,  987,   35,  841,    2,   19,  861, 1074,
          5, 1987,    2,   45,   55,  221,   15,  670, 5304,  526,   14,
       1069,    4,  405,    5, 2438,    7,   27,   85,  108,  131,    4,
       5045, 5304, 3884,  405,    9, 3523,  133,    5,   50,   13,  104,
         51,   66,  166,   14,   22,  157,    9,    4,  530,  239,   34,
       8463, 2801,   45,  407,   31,    7,   41, 37

#### Step 3: Trainig model:

In [0]:
start = time.clock()
history = model.fit(x_train, y_train, epochs=5, batch_size=128, validation_data=(x_test, y_test))
end = time.clock()
print('Time spent:', end-start)

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Time spent: 1262.687027


In [0]:
model.layers #get all layers from model

[<keras.layers.embeddings.Embedding at 0x7f34e9841f98>,
 <keras.layers.recurrent.LSTM at 0x7f34e9761c50>,
 <keras.layers.wrappers.TimeDistributed at 0x7f34e961ad68>,
 <keras.layers.core.Flatten at 0x7f34e975d898>,
 <keras.layers.core.Dense at 0x7f34e95ffeb8>,
 <keras.layers.core.Dense at 0x7f34e9630b70>]

#### Step 4: Evaluating Model

In [0]:
score = model.evaluate(x_test, y_test)



In [0]:
score

[0.8053167874073982, 0.85976]

In [0]:
print('\nTest Acc: %.2f%%' %(score[1]*100))


Test Acc: 85.98%


In [0]:
model.layers[0].output

<tf.Tensor 'embedding_3/embedding_lookup/Identity:0' shape=(?, 300, 100) dtype=float32>

In [0]:
import numpy as np
model.predict(np.array([x_test[11],]))

array([[0.00787853]], dtype=float32)

In [0]:
model.layers[4].output

<tf.Tensor 'dense_3/Relu:0' shape=(?, 250) dtype=float32>

In [0]:
model.layers[4].output

<tf.Tensor 'dense_3/Relu:0' shape=(?, 250) dtype=float32>

In [0]:
from keras import backend as K

inp = model.input                                           # input placeholder
outputs = [layer.output for layer in model.layers]          # all layer outputs
functors = [K.function([inp, K.learning_phase()], [out]) for out in outputs]    # evaluation functions

# Testing
test = np.array([x_test[11],])
layer_outs = [func([test, 1.]) for func in functors]
print(len(layer_outs[0][0][0][0]))

100
