## Movie reviews - LSTM sentiment analysis

In this mini project we will implement model for sentiment analysis, based on movie reviews from IMDB, we will predict sentiment (positive/negative).

In [1]:
import os
import tarfile
from six.moves import urllib
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split

In [3]:
TAR_FILE_NAME = 'ImdbReviews.tar.gz'
DIR_NAME = "aclImdb"
URL_PATH = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
MAX_SEQUENCE_LENGTH = 100

### DATA

First we will download and unpack a dataset provided by Stanford Artificial Intelligence Laboratory.   

In [4]:
def download_file(url):
    if not os.path.exists(DIR_NAME):
        file, _ = urllib.request.urlretrieve(url, TAR_FILE_NAME)
        with tarfile.open(TAR_FILE_NAME) as tar:
            tar.extractall()
            tar.close()

Now we will look over the each txt file, and extract text into np.array. Postivie reviews are in the directory: "/train/pos/", and negative in: "/train/neg/". When we will have extracted and labeled reviews, we will shuffle them to have a better disribution. 
Let's define our function for extraction and shuffling 

In [5]:
def get_reviews_from_files(dirname, positive = True):
    label = 1 if positive else 0
    
    reviews = []
    labels = []
    for filename in os.listdir(dirname):
        if filename.endswith(".txt"):
            with open(dirname + filename, 'r+', encoding="utf8", ) as f:
                review = f.read()
                review = review.lower().replace("<br />", " ")
                
                reviews.append(review)
                labels.append(label)
    
    return reviews, labels

In [6]:
def shuffle(x, y):
    np.random.seed(1)
    shuffle_indices = np.random.permutation(np.arange(len(x)))
    
    x_shuffled = x[shuffle_indices]
    y_shuffled = y[shuffle_indices]
    
    return x_shuffled, y_shuffled

To pass text data into LSTM, we will have to create a vocabulary and map sequence of words into sequence of id's of this words from our vocabulary.
To do this we will use VocabularyProcessor.

In [7]:
def get_data():
    
    pos_reviews, pos_labels = get_reviews_from_files(DIR_NAME + "/train/pos/", positive = True)
    neg_reviews, neg_labels = get_reviews_from_files(DIR_NAME + "/train/neg/", positive = False)
    
    labels =np.array(pos_labels + neg_labels)
    data = pos_reviews + neg_reviews
    
    vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_SEQUENCE_LENGTH)
    data = np.array(list(vocab_processor.fit_transform(data)))
    data, labels = shuffle(data, labels)
    return data, labels, len(vocab_processor.vocabulary_)


Now we will use ur helper function to prepare dataset and split it into training set and the test set.

In [9]:
download_file(URL_PATH)
x_train, y_train, vocabulary_size = get_data()
x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size = 0.20, random_state = 0)

### Model

Now we will implement our LSTM model, we will define 3 layers:
    - Embedding layer, we will use 64 length vectors
    - LSTM layer wirh 64 memory units, and 0.7 keep_prop
    - Dense layer as an output layer with softmax activation function and only one output neuron

In [16]:
num_epochs = 10
batch_size = 25
embedding_size = 64
max_label = 2

Define placeholder, which will be an input and output of our model

In [9]:
tf.reset_default_graph()

x = tf.placeholder(tf.int32, [None, MAX_SEQUENCE_LENGTH])
y = tf.placeholder(tf.int32, [None])

Define embedding matrix and embedding vecotr for each word in our vocabulary

In [10]:
embedding_matrix = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embeddings = tf.nn.embedding_lookup(embedding_matrix, x)

Define LSTM cell, set input size owhich will be our embedding vector size, and add a dropout with 0.7 keep propability after LSTM.

In [11]:
lstmCell = tf.contrib.rnn.BasicLSTMCell(embedding_size)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob = 0.7)

Run our lstm network with define lstm cell and embeddings. As the output layer we will define dense layer with two labels (positive/ negative)

In [12]:
_, (encoding, _) = tf.nn.dynamic_rnn(lstmCell, embeddings, dtype= tf.float32)
logits = tf.layers.dense(encoding, max_label, activation = None)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits = logits, labels = y)

Now we will define our loss function, optimizer and a train step which should minimize value of our loss function.
Additionally we will define accuracy and a prediction so we could messure performance of our model

In [13]:
loss = tf.reduce_mean(cross_entropy)

prediction = tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))

optimizer = tf.train.AdamOptimizer(0.001)
train_step = optimizer.minimize(loss)

### Training

Now we can implement our training session. In for loop we will perform mini batch training and we will repeat training for each epoch.

In [17]:
init = tf.global_variables_initializer()

with tf.Session() as session:
    init.run()
    
    for epoch in range(num_epochs):
        
        num_batches = int(len(x_train)//batch_size) +1
        
        for batch in range(num_batches):
            
            min_batch_x = batch * batch_size
            max_batch_x = np.min([len(x_train), ((batch+1) * batch_size)])
           
            x_train_batch = x_train[min_batch_x:max_batch_x]
            y_train_batch = y_train[min_batch_x:max_batch_x]
            
            train_dict = {x: x_train_batch, y: y_train_batch}
            session.run(train_step, feed_dict = train_dict)
            
            train_loss, train_acc = session.run([loss, accuracy], feed_dict = train_dict)
            
            
        
        test_dict = {x: x_test, y: y_test}
        
        test_loss, test_acc = session.run([loss, accuracy], feed_dict=test_dict)
        print('Epoch: {}, Test Loss: {:.2}, Test Acc: {:.5}'.format(epoch + 1, test_loss, test_acc))
    

Epoch: 1, Test Loss: 0.62, Test Acc: 0.7378
Epoch: 2, Test Loss: 0.47, Test Acc: 0.7954
Epoch: 3, Test Loss: 0.5, Test Acc: 0.806
Epoch: 4, Test Loss: 0.66, Test Acc: 0.8054
Epoch: 5, Test Loss: 0.76, Test Acc: 0.7978
Epoch: 6, Test Loss: 0.8, Test Acc: 0.7944
Epoch: 7, Test Loss: 0.97, Test Acc: 0.8028
Epoch: 8, Test Loss: 0.93, Test Acc: 0.7996
Epoch: 9, Test Loss: 1.4, Test Acc: 0.799
Epoch: 10, Test Loss: 1.2, Test Acc: 0.7956
