# Sentiment analysis with TFLearn - LSTM
In this notebook, I adapt the Andrew Trask's work, and Siraj Sentimental Network, by building a network for sentiment analysis on the movie review data. I've used TFLearn, a high-level library built on top of TensorFlow. TFLearn makes it simpler to build networks just by defining the layers. It takes care of most of the details for you.
We'll start off by importing all the modules we'll need, then load and prepare the data.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tflearn

from tflearn.data_utils import to_categorical, pad_sequences

### Read words

In [2]:
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)

### Count word's frequency

In [3]:
from collections import Counter
total_counts = Counter()
for _, row in reviews.iterrows():
    total_counts.update(row[0].split(' '))
print("Total words in data set: ", len(total_counts))

Total words in data set:  74074


In [4]:
vocab = sorted(total_counts, key=total_counts.get, reverse=True)[:10000]
print(vocab[:60])

['', 'the', '.', 'and', 'a', 'of', 'to', 'is', 'br', 'it', 'in', 'i', 'this', 'that', 's', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'you', 'on', 't', 'not', 'he', 'are', 'his', 'have', 'be', 'one', 'all', 'at', 'they', 'by', 'an', 'who', 'so', 'from', 'like', 'there', 'her', 'or', 'just', 'about', 'out', 'if', 'has', 'what', 'some', 'good', 'can', 'more', 'she', 'when', 'very', 'up', 'time', 'no']


### Create the bag of words

In [5]:
word2idx = {word: i for i, word in enumerate(vocab)}

### Transform sentences in vectors

In [7]:
def text_to_vector2(text): 
    text_vector = text.split(' ')
    word_vector = np.zeros(len(text_vector), dtype=np.int_)
    for j in range(len(text_vector)):
        idx = word2idx.get(text_vector[j], None)
        if idx is None:
            continue
        else:
            word_vector[j] = idx
    return np.array(word_vector)

### Transform a sequence of arrays on a bidimensional array

In [12]:
word_vectors2 = []    
    
for ii, (_, text) in enumerate(reviews.iterrows()):
    word_vectors2.append(text_to_vector2(text[0]))

pad_word_vectors2 = pad_sequences(word_vectors2, maxlen=100, value=0.)

In [None]:
# Printing out the first 5 word vectors
pad_word_vectors2[:5, :23]

### Create the train and test sets

In [17]:
Y = (labels=='positive').astype(np.int_)
records = len(labels)

shuffle = np.arange(records)
np.random.shuffle(shuffle)
test_fraction = 0.9

train_split, test_split = shuffle[:int(records*test_fraction)], shuffle[int(records*test_fraction):]
trainX, trainY = pad_word_vectors2[train_split,:], to_categorical(Y.values[train_split], 2)
testX, testY = pad_word_vectors2[test_split,:], to_categorical(Y.values[test_split], 2)

### Buid the network

In [18]:
# Network building
net = tflearn.input_data([None, 100])
net = tflearn.embedding(net, input_dim=10000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy')

### Training the network

In [19]:
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
          batch_size=32)

Training Step: 7039  | total loss: [1m[32m0.06520[0m[0m | time: 143.965s
| Adam | epoch: 010 | loss: 0.06520 - acc: 0.9843 -- iter: 22496/22500
Training Step: 7040  | total loss: [1m[32m0.07942[0m[0m | time: 149.196s
| Adam | epoch: 010 | loss: 0.07942 - acc: 0.9827 | val_loss: 0.79218 - val_acc: 0.8120 -- iter: 22500/22500
--


In [20]:
predictions = (np.array(model.predict(testX))[:,0] >= 0.5).astype(np.int_)
test_accuracy = np.mean(predictions == testY[:,0], axis=0)
print("Test accuracy: ", test_accuracy)

Test accuracy:  0.812


In [31]:
# Helper function that uses your model to predict sentiment
def test_sentence(sentence):
    
    s = []
    
    s.append(text_to_vector2(sentence.lower()))
    
    sent_pad = pad_sequences(s, maxlen=100, value=0.)
    
    positive_prob = model.predict(sent_pad)[0][1]
    print('Sentence: {}'.format(sentence))
    print('P(positive) = {:.3f} :'.format(positive_prob), 
          'Positive' if positive_prob > 0.5 else 'Negative')

### Test the network

In [41]:
sentence = "Moonlight is by far the best great movie of 2016."
test_sentence(sentence)

sentence = "It's amazing anyone could be talented enough to make something this spectacularly awful"
test_sentence(sentence)

Sentence: Moonlight is by far the best great movie of 2016.
P(positive) = 0.997 : Positive
Sentence: It's amazing anyone could be talented enough to make something this spectacularly awful
P(positive) = 0.016 : Negative
