# Neural network model

Since we are working with text, we choose to train reccurent neural network, LSTM. 
Our architecture can be described as many-to-one - for many words on input we need to produce one label - 1 for positive and 0 for negative sentiment. 

Detailed architecture is explained further. 

In [1]:
import pandas as pd
import numpy as np
from tensorflow import keras

In [18]:
import os

In [13]:
from tensorflow.keras.layers import LSTM, Dense, Embedding, Bidirectional

In [None]:
%load_ext tensorboard
%tensorboard --logdir logs --bind_all

In [3]:
# load data
train = pd.read_pickle('../data/train/comments_embed.pkl')
test = pd.read_pickle('../data/test/comments_embed.pkl')

In [4]:
# prepare for training
train.head(n=5)

Unnamed: 0,comment,sentiment,comment_ids,words_n,x
0,"[movi, get, respect, sure, lot, memor, quot, l...",1,"[1, 8, 615, 140, 67, 751, 1564, 716, 1145, 354...",29,"[1, 8, 615, 140, 67, 751, 1564, 716, 1145, 354..."
1,"[bizarr, horror, movi, fill, famou, face, stol...",1,"[966, 109, 1, 624, 701, 228, 2183, 6760, 1478,...",93,"[966, 109, 1, 624, 701, 228, 2183, 6760, 1478,..."
2,"[solid, unremark, film, matthau, einstein, won...",1,"[998, 7012, 2, 2525, 4637, 102, 379, 61, 33, 1...",24,"[998, 7012, 2, 2525, 4637, 102, 379, 61, 33, 1..."
3,"[strang, feel, sit, alon, theater, occupi, par...",1,"[473, 60, 424, 502, 503, 3788, 597, 13585, 137...",214,"[473, 60, 424, 502, 503, 3788, 597, 13585, 137..."
4,"[probabl, alreadi, know, addit, episod, never,...",1,"[156, 385, 35, 1006, 176, 48, 673, 229, 116, 1...",66,"[156, 385, 35, 1006, 176, 48, 673, 229, 116, 1..."


In [5]:
train.x[0]

array([    1,     8,   615,   140,    67,   751,  1564,   716,  1145,
         354,     1,   779, 10299,    63,    79,  5503, 10634,    16,
       12978, 12979,     9,   287,   783,    11,  1362, 12980,  6525,
         476,  5294,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0])

In [6]:
test.head(n=5)

Unnamed: 0,comment,sentiment,comment_ids,x
0,"[base, actual, stori, john, boorman, show, str...",1,"[332, 63, 13, 221, 9212, 18, 764, 190, 786, 54...","[332, 63, 13, 221, 9212, 18, 764, 190, 786, 54..."
1,"[gem, film, four, product, anticip, qualiti, i...",1,"[1145, 2, 619, 218, 2348, 367, 750, 518, 150, ...","[1145, 2, 619, 218, 2348, 367, 750, 518, 150, ..."
2,"[realli, like, show, drama, romanc, comedi, ro...",1,"[15, 4, 18, 373, 717, 106, 847, 3, 587, 344, 2...","[15, 4, 18, 373, 717, 106, 847, 3, 587, 344, 2..."
3,"[best, experi, disney, themepark, certainli, b...",1,"[51, 345, 723, 369, 55, 85, 2, 147, 2136, 55, ...","[51, 345, 723, 369, 55, 85, 2, 147, 2136, 55, ..."
4,"[korean, movi, ive, seen, three, realli, stuck...",1,"[2752, 1, 116, 43, 217, 15, 1382, 27, 207, 109...","[2752, 1, 116, 43, 217, 15, 1382, 27, 207, 109..."


From the previous script, we know that our vocab contains 15000 words and max length of our comment is 100. 
We also choose our embedding size to be 100 for now - however, these are the hyper-parameters to played with later.

In [8]:
COMMENT_SIZE = 100
VOCAB_SIZE = 15000
EMBEDDING_SIZE = 100

Since we have pandas dataframe, structure of our data is np.array of np.arrays (not np.ndarray). 
This might cause problems when training - we need to explicitely convert it to 2d array - one way is using np.stack:

In [102]:
# no good, we need shappe (25000, 100)
train.x.values.shape

(25000,)

In [103]:
train_x = np.stack(train.x.values)

In [104]:
# ok
train_x.shape

(25000, 100)

In [106]:
test_x = np.stack(test.x.values)

In [111]:
# target (to make sure we have np arrays)
train_y = np.array(train.sentiment.values)
test_y = np.array(test.sentiment.values)

Our first neural network consists of layers:
- Embedding layer (to train basic word embedding fror NN to work with) (later we will compare with pretrained embeddings (or train our own embeddings))
- Bidirectional LSTM layer (we needed recurrent NN since we work with sequential data - text - so we chose LSTM. WE also went for Bidirectional since we read that it is capable of better understanding of context when making predictions - but there is also a potential to try and use other different architectures. )
Size of LSTM layer is also parametrizable - we can try different sizes and compare results - we will start with 64. 
- Since we need one number at the end - either 1 or 0 (positive or negative sentiment), we needed to add Dense layer to transform our result to such number. For activation function, we chose sigmoid (we were thinking about softmax, but since softmax is just generalized sigmoig (and usable for multiclass classification), we stayed with sigmoid in our problem)

Our first NN might be prone to overfitting. In future, we can add for example Dropout layer to try to prevent overfitting. 

In [117]:
# define NN architecture
class SentimentClassifier_v1(keras.Model):

    def __init__(self, vocab_size, embedding_size, comment_size, lstm_size):
        super(SentimentClassifier_v1, self).__init__()
        
        # train embedding 
        self.emb = Embedding(
            input_dim=vocab_size,
            output_dim=embedding_size,
            input_length=comment_size,
            mask_zero=True, 
            trainable=True
        )
    
        self.lstm_layer = Bidirectional(LSTM(lstm_size))
        self.output_layer = Dense(1, activation="sigmoid")

    def call(self, x):
        x = self.emb(x)
        x = self.lstm_layer(x)
        x = self.output_layer(x)

        return x

In [122]:
# create NN object
nn_v1 = SentimentClassifier_v1(VOCAB_SIZE + 1, EMBEDDING_SIZE, COMMENT_SIZE, 64)

Before compiling our model, we need to choose optimizer. 

For the first try, we will go with Adam. Next we can try others like SGD.
Our loss function is now binary_crossentropy.

Our metrics is accuracy. We have balanced dataset (the same number of positive and negative classes) and in such case we think it is an ok metrics. 

In [123]:
# add callbacks - tensorboard and compile
callbacks = [
    keras.callbacks.TensorBoard(
        log_dir=os.path.join("logs", "sentiment_classifier_v1"),
        histogram_freq=1,
        profile_batch=0)
]

nn_v1.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy'])

... aaand it is time for training!

In [None]:
nn_v1.fit(
    x=train_x,
    y=train_y,
    batch_size=32,
    epochs=10,
    validation_data=(test_x, test_y),
    callbacks=callbacks
)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10