In [1]:
import pandas as pd
import glob

Input to embedding layer is onehot vector, which returns embedding, which we forward to RNN. That will be like training our own embeddings.  
We can use pretrain embeddings as well, which in a way acts as transfer learning.  

In [2]:
pd.set_option('display.max_colwidth', -1)

In [3]:
POS_PATH = '/home/archit/notebooks/datastories/deep_learning/aclImdb_v1/aclImdb/train/pos/*.txt'
NEG_PATH = '/home/archit/notebooks/datastories/deep_learning/aclImdb_v1/aclImdb/train/neg/*.txt'

In [4]:
pos_files = glob.glob(POS_PATH)
neg_files = glob.glob(NEG_PATH)

In [5]:
pos_list = []

for ff in pos_files:
    with open(ff) as f:
        review = f.read()
        pos_list.append(review)
        
pos_df = pd.DataFrame({'review':pos_list, 'sentiment':0})    


neg_list = []

for ff in neg_files:
    with open(ff) as f:
        review = f.read()
        neg_list.append(review)
        
neg_df = pd.DataFrame({'review':neg_list, 'sentiment':1})    

In [6]:
train_df = pd.concat([pos_df, neg_df])

In [7]:
train_df = train_df.sample(frac=1)

In [8]:
train_df.shape

(25000, 2)

### First

In [9]:
from keras.datasets import imdb

vocab_size = 5000

(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=vocab_size, # Keeping most frequet 5000 words in review
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=3)

Using TensorFlow backend.


https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e

We need to pad x_train, so each of review has same length.

In [10]:
max_len = len(max(x_train+x_test, key=len)) # Max function also takes key as argument
min_len = len(min(x_train+x_test, key=len))

In [11]:
print max_len, min_len

2697 70


In [12]:
desired_len = 500

In [13]:
from keras.preprocessing import sequence

In [14]:
x_train = sequence.pad_sequences(x_train, maxlen = desired_len)
x_test = sequence.pad_sequences(x_test, maxlen = desired_len)

In [15]:
x_train[0].shape

(500,)

In [16]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Activation

In [17]:
embedding_dim = 32

model = Sequential()

model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=desired_len))
model.add(LSTM(units=100))
model.add(Dense(units=10))
model.add(Dense(units=1))
model.add(Activation('sigmoid'))

Instructions for updating:
Colocations handled automatically by placer.


In [18]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1010      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11        
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0         
Total params: 214,221
Trainable params: 214,221
Non-trainable params: 0
_________________________________________________________________


Each review has 500 words. There are total 5000 words in dictionary.     
Embedding input : each word is represented by 5000 dimention vector.  
Embedding output : each word is represnted by 32 dimention vector.  
So we have a sentence of 500 words where each word is represented by 32 dimention vector. 
We give it to LSTM, LSTM has 100 units, which means activation has 100 dims, which means output has 100 dims.  

At this point each sentence is reduced to 100 dim vector, which is further reduced to 10 dimention vector, then 1 dimention, upon which we apply sigmoid.

In [19]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [20]:
model.fit(x_train, y_train, validation_split=0.2, epochs=5)

Instructions for updating:
Use tf.cast instead.
Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7efefb643590>

Here we used kera's embedding layer, which interbally gets translated into FC layer only, just that representation is simpler. 

In [21]:
y_pred = model.predict(x_test)

In [22]:
print y_pred[:3], "\n", y_test[:3]

[[0.02762657]
 [0.99611044]
 [0.01141503]] 
[0 1 1]


In [24]:
_, accuracy = model.evaluate(x_test, y_test)



In [25]:
print "Accuracy = {:.2f}".format(accuracy)

Accuracy = 0.86
