# LSTM: Sentimental Analysis using IMDB movie review data set

- In this tutorial, we will create a model that determines whether movie reviews are positive or negative.
- This tutorial is intended for those who have a good understanding of python and a brief understanding of the LSTM model.


## Prerequisites

You need to install 
    - Python 3.6. https://www.python.org/downloads/
    - mxnet(gpu ver.) https://mxnet.incubator.apache.org/get_started/install.html

You also need to understand the LSTM model.  
In this tutorial, a simple knowledge already presented on several websites is sufficient. However, if you want a deep understanding, please refer to the following article.
    - https://arxiv.org/pdf/1503.00185v5.pdf
    
Finally, you need to know about python, mxnet, gloun package, and so on.


## The Data

We will use IMDB. You can download it from the following link.  
It is divided into 25000 train data and test data each labeled.  
See the data set's README for more information.
    - http://ai.stanford.edu/~amaas/data/sentiment/
    
We will use Stanford's Global Vector for Word Representation (GloVe) embedding for word embedding.  
Download glove.42B.300d.zip from the following link:
    - https://nlp.stanford.edu/projects/glove/
    
    
## Prepare the Data

##### 1. Read the data and label it. (1 means positive and 0 means negative.)

In [1]:
import os

# helper function to read files
def read_files(folder_name):
    sentiments =[]
    filenames = os.listdir(os.curdir+"/"+folder_name)

    for file in filenames:
        with open(folder_name+"/"+file, "r", encoding="utf8") as f:
            data = f.read().replace("\n", "")
            sentiments.append(data)

    return sentiments

data_path = '../data/aclImdb'

# prepare train data
train_pos_foldername = data_path + "/train/pos"
train_pos_sentiments = read_files(train_pos_foldername)

train_neg_foldername = data_path + "/train/neg"
train_neg_sentiments = read_files(train_neg_foldername)

train_pos_labels = [1 for _ in train_pos_sentiments]
train_neg_labels = [0 for _ in train_neg_sentiments]
train_all_labels = train_pos_labels + train_neg_labels

train_all_sentiments = train_pos_sentiments + train_neg_sentiments

# prepare test data
test_pos_foldername = data_path + "/test/pos"
test_pos_sentiments = read_files(test_pos_foldername)

test_neg_foldername = data_path + "/test/neg"
test_neg_sentiments = read_files(test_neg_foldername)

test_pos_labels = [1 for _ in test_pos_sentiments]
test_neg_labels = [0 for _ in test_neg_sentiments]
test_all_labels = test_pos_labels + test_neg_labels

test_all_sentiments = train_pos_sentiments + train_neg_sentiments

##### 2. We will make a dictionary from the words in the train data. Before that, let's define some helper functions.

In [2]:
# Delete various special characters from the review sentence.
def clear_str(sentiment):
    import re
    string = sentiment.lower().replace("<br />", " ")
    remove_spe_chars = re.compile("[^A-Za-z0-9 ]+")

    return re.sub(remove_spe_chars, "", string.lower())

# Count the frequency of occurrence of a word.
def count_word(sentiments):
    from collections import Counter
    
    word_counter = Counter()
    for sentiment in sentiments:
        for word in (clear_str(sentiment)).split():
            if word not in word_counter.keys():
                word_counter[word] = 1
            else:
                word_counter[word] += 1

    return word_counter

# make dictionary
def create_word_index(word_counter):
    idx = 1
    word_dict = {}

    for word in word_counter.most_common():
        word_dict[word[0]] = idx
        idx += 1

    return word_dict 

# save file. 
# The dictionary is very large. 
# Therefore, it is better to do this only the first time you create it, and then save and load it as a file.
import _pickle as pkl
def save_file(file_obj, file_name):
    f = open(file_name, 'wb')
    pkl.dump(file_obj, f, -1)
    f.close()

##### 3. Now let's make a dictionary!

In [3]:
def make_dictionary(dictionary_file_name, sentiments):
    # if file exist, return file
    if os.path.exists(dictionary_file_name):
        f = open(dictionary_file_name, 'rb')
        word_dict = pkl.load(f)
        f.close()
        return word_dict

    else:
        # count word
        word_counter = count_word(sentiments)
        word_dict = create_word_index(word_counter)
        save_file(word_dict, dictionary_file_name)
        return word_dict
    

dictionary_file_name = 'imdb.dict.pkl'
word_dict = make_dictionary( dictionary_file_name, train_all_sentiments )

##### 4. Encode the sentence using the dictionary created and fit it into numpy format.

In [4]:
def encode_sentences(input_file, word_dict):
    output_string = []
    for line in input_file:
        output_line = []
        for word in clear_str(line).split():
            if word in word_dict:
                output_line.append(word_dict[word])
        output_string.append(output_line)

    return output_string

train_pos_encoded = encode_sentences(train_pos_sentiments, word_dict)
train_neg_encoded = encode_sentences(train_neg_sentiments, word_dict)
train_all_encoded = train_pos_encoded + train_neg_encoded

test_pos_encoded = encode_sentences(test_pos_sentiments, word_dict)
test_neg_encoded = encode_sentences(test_neg_sentiments, word_dict)
test_all_encoded = test_pos_encoded + test_neg_encoded


import numpy as np
voca_size = 10000 # total number of voca we will track
train_data = [np.array([i if i < (voca_size-1) else (voca_size-1) for i in s]) for s in train_all_encoded]
test_data = [np.array([i if i < (voca_size-1) else (voca_size-1) for i in s]) for s in test_all_encoded]

##### 5. We now want to show what each word means through word embedding, which is another big task. In this case, we would like to use the already created model, Stanford's Global Vector for Word Representation (Glove) embedding. You have already downloaded it from The Data.

In [6]:
from mxnet import nd
def load_glove_index(path):
    import io
    f = io.open(path, encoding="utf8")
    embeddings_index = {}

    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype = 'float32')
        embeddings_index[word] = coefs
    f.close()
    return embeddings_index

def create_embed(filename, glove_path, num_embed):
    if os.path.exists(filename):
        f = open(filename, 'rb')
        embedding_matrix = pkl.load(f)
        f.close()
        return embedding_matrix

    embedding_index = load_glove_index(glove_path)
    
    embedding_matrix = np.zeros((voca_size, num_embed))
    for word, i in word_dict.items():
        if i >= voca_size:
            continue
        embedding_vector = embedding_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    embedding_matrix = nd.array(embedding_matrix)
    save_file(embedding_matrix, filename)
    return embedding_matrix

num_embed = 300
embed_matrix_filename = 'embed.metrix.pkl'
glove_path = '../data/glove.42B.300d.txt'
embedding_matrix = create_embed(embed_matrix_filename, glove_path, num_embed)

  import OpenSSL.SSL


##### 6. Shuffle the train data. As you already know, train data is in positive-negative order. However, if learning proceeds in this order, positive will be learned first, and correct learning will not be achieved.

In [7]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train_data, train_all_labels, test_size=0, random_state=42)

x_test = test_data
y_test = test_all_labels

##### 7. Match the length of the sentence and change it to nd array form of mxnet.


In [8]:
seq_len = 250 # set the max word length of each movie review

# if sentence is greater than max_len, truncates
# if less, pad with value
def pad_sequences(sentences, max_len=500, value = 0):
    padded_sentences = []
    for sentence in sentences:
        new_sentence = []
        if (len(sentence) > max_len):
            new_sentence = sentence[:max_len]
            padded_sentences.append(new_sentence)
        else:
            new_sentence = np.append(sentence, [value]*(max_len-len(sentence)))
            padded_sentences.append(new_sentence)

    return padded_sentences

import mxnet as mx
context = mx.gpu()
X_train = nd.array(pad_sequences(x_train, max_len=seq_len, value=0), context)
X_test = nd.array(pad_sequences(x_test, max_len=seq_len, value=0), context)
Y_train = nd.array(y_train, context)
Y_test = nd.array(y_test, context)

## Create the Model

As you already know, lstm unit has the following form, which saves the previous state:

![Alt text](../data/lstm_cell.png)

We will build this lstm unit and build a model that determines the positive and negative by setting the sigmoid fucntion on top of it to obtain the output between 0 and 1.

![Alt text](../data/lstm_diagram.png)

In [9]:
num_classes = 1
num_hidden = 16
learning_rate = .01
epochs = 10
batch_size = 100

from mxnet.gluon import nn, rnn

model = nn.Sequential()
with model.name_scope():
    model.embed = nn.Embedding(voca_size, num_embed)
    model.add(rnn.LSTM(num_hidden, layout = 'NTC', dropout=0.5, bidirectional=False))
    model.add(nn.Dense(num_classes))

## Train & Evaluate the Model

1. We will proceed initialization with the xavior initializer. It is known to have the best performance among existing initializers.
2. We will use adadelta as the optimizer. This is also known to be better than classic sgd.

In [None]:
from mxnet import gluon, autograd

model.collect_params().initialize(mx.init.Xavier(), ctx=context)

model.embed.weight.set_data(embedding_matrix.as_in_context(context))

trainer = gluon.Trainer(model.collect_params(), 'adadelta')

sigmoid_cross_entropy = gluon.loss.SigmoidBCELoss()

# helper function: evaluate accuracy
def eval_accuracy(x, y, batch_size):
    accuracy = mx.metric.Accuracy()

    for i in range(x.shape[0] // batch_size):
        data = x[i*batch_size:(i*batch_size + batch_size), ]
        target = y[i*batch_size:(i*batch_size + batch_size), ]

        output = model(data)
        predictions = nd.array([( 1 if out >= 0.5 else 0 ) for out in output ] , context)
        accuracy.update(preds=predictions, labels=target)

    return accuracy.get()[1]

# train
for epoch in range(epochs):

    for b in range(X_train.shape[0] // batch_size):
        data = X_train[b*batch_size:(b*batch_size + batch_size),]
        target = Y_train[b*batch_size:(b*batch_size + batch_size),]

        data = data.as_in_context(context)
        target = target.as_in_context(context)
        
        with autograd.record():
            output = model(data)
            L = sigmoid_cross_entropy(output, target)
        L.backward()
        trainer.step(data.shape[0])

    # filename = "lstm_net.params_epoch" + str(epoch)
    # model.save_params(filename)

    test_accuracy = eval_accuracy(X_test, Y_test, batch_size)
    train_accuracy = eval_accuracy(X_train, Y_train, batch_size)
    print("Epoch %s. Train_acc %s, Test_acc %s" %
          (epoch, train_accuracy, test_accuracy))

## Summary

So far, sentimental analysis has been done through the model using lstm.  
As you can see from the code above, the accuracy is about 0.86, but it is very difficult to raise it. You can get better accuracy by adjusting the batch size, optimizer, and number of hidden layers, but it will not change much.   
This is because of the limitations of the LSTM model itself, there are other models with better accuracy of sentimental analysis.