# Sentiment analysis with TFLearn

In this notebook, instead of building ANN using numpy by hand, we use [TFLearn](http://tflearn.org/), a high-level library built on top of TensorFlow. TFLearn makes it simpler to build networks just by defining the layers. It takes care of most of the details.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical
from collections import Counter

Here we first extract out the data using Pandas.

In [2]:
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
print(reviews.shape)
print(reviews.iloc[0])

(25000, 1)
0    bromwell high is a cartoon comedy . it ran at ...
Name: 0, dtype: object


In [3]:
total_counts = Counter()
for idx, row in reviews.iterrows():
    for l in row.str.split(' '):
        for word in l:
            total_counts[word.lower()] += 1
print(len(total_counts))

74074


In [4]:
n = len(total_counts)
print(n)

74074


Here we sort the words according to their frequency in the reviews.

In [5]:
vocab = sorted(total_counts, key=total_counts.get, reverse=True)[:10000]
print(vocab[:60])
print(vocab[-1], ': ', total_counts[vocab[-1]])

['', 'the', '.', 'and', 'a', 'of', 'to', 'is', 'br', 'it', 'in', 'i', 'this', 'that', 's', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'you', 'on', 't', 'not', 'he', 'are', 'his', 'have', 'be', 'one', 'all', 'at', 'they', 'by', 'an', 'who', 'so', 'from', 'like', 'there', 'her', 'or', 'just', 'about', 'out', 'if', 'has', 'what', 'some', 'good', 'can', 'more', 'she', 'when', 'very', 'up', 'time', 'no']
offspring :  30


Here we map words to indices in vector.

In [6]:
word2idx = {}
for i, word in enumerate(vocab):
    word2idx[word] = i
print(word2idx['and'])

3


In [7]:
def text2vec(text):
    vec = np.zeros(len(vocab))
    for word in text.split(' '):
        if word.lower() in vocab:
            vec[word2idx[word.lower()]] += 1
    return vec

print(text2vec('The is on not have be one they'))
print(text2vec('The tea is for a party to celebrate the movie so she has no time for a cake'))

[ 0.  1.  0. ...,  0.  0.  0.]
[ 0.  2.  0. ...,  0.  0.  0.]


Convert the reviews into two dimensional array. Each row corresponds to each sample in reviews.

In [8]:
wvec = np.zeros((len(reviews), len(vocab)))
for idx, row in reviews.iterrows():
    wvec[idx] = text2vec(row[0])

In [9]:
print(wvec[0:2,])

[[ 18.   9.  27. ...,   0.   0.   0.]
 [  5.   4.   8. ...,   0.   0.   0.]]


Construct training set and testing (validation) set

In [23]:
n = len(labels)
Y = np.zeros(n)
for i, row in labels.iterrows():
    if row[0].lower() == 'positive':
        Y[i] = 1

In [73]:
np.sum(Y == 1)

12500

In [75]:
train = np.random.choice(n, size = (int)(0.8 * n), replace=False)
test = [i for i in range(n) if i not in train]
train_reviews, test_reviews = wvec[train,:], wvec[test,:]
train_labels, test_labels = Y[train], Y[test]

In [24]:
shuffle = np.arange(n)
np.random.shuffle(shuffle)
test_fraction = 0.9

train_split, test_split = shuffle[:int(n * test_fraction)], shuffle[int(n * test_fraction):]
trainX, trainY = wvec[train_split,:], to_categorical(Y[train_split], 2)
testX, testY = wvec[test_split,:], to_categorical(Y[test_split], 2)

In [25]:
trainY

array([[ 0.,  1.],
       [ 1.,  0.],
       [ 0.,  1.],
       ..., 
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.]])

## Building the network

[TFLearn](http://tflearn.org/) lets you build the network by [defining the layers](http://tflearn.org/layers/core/). 

### Input layer

For the input layer, you just need to tell it how many units you have. For example, 

```
net = tflearn.input_data([None, 100])
```

would create a network with 100 input units. The first element in the list, `None` in this case, sets the batch size. Setting it to `None` here leaves it at the default batch size.

The number of inputs to your network needs to match the size of your data. For this example, we're using 10000 element long vectors to encode our input data, so we need 10000 input units.


### Adding layers

To add new hidden layers, you use 

```
net = tflearn.fully_connected(net, n_units, activation='ReLU')
```

This adds a fully connected layer where every unit in the previous layer is connected to every unit in this layer. The first argument `net` is the network you created in the `tflearn.input_data` call. It's telling the network to use the output of the previous layer as the input to this layer. You can set the number of units in the layer with `n_units`, and set the activation function with the `activation` keyword. You can keep adding layers to your network by repeated calling `net = tflearn.fully_connected(net, n_units)`.

### Output layer

The last layer you add is used as the output layer. Therefore, you need to set the number of units to match the target data. In this case we are predicting two classes, positive or negative sentiment. You also need to set the activation function so it's appropriate for your model. Again, we're trying to predict if some input data belongs to one of two classes, so we should use softmax.

```
net = tflearn.fully_connected(net, 2, activation='softmax')
```

### Training
To set how you train the network, use 

```
net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
```

Again, this is passing in the network you've been building. The keywords: 

* `optimizer` sets the training method, here stochastic gradient descent
* `learning_rate` is the learning rate
* `loss` determines how the network error is calculated. In this example, with the categorical cross-entropy.

Finally you put all this together to create the model with `tflearn.DNN(net)`. So it ends up looking something like 

```
net = tflearn.input_data([None, 10])                          # Input
net = tflearn.fully_connected(net, 5, activation='ReLU')      # Hidden
net = tflearn.fully_connected(net, 2, activation='softmax')   # Output
net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
model = tflearn.DNN(net)
```

> **Exercise:** Below in the `build_model()` function, you'll put together the network using TFLearn. You get to choose how many layers to use, how many hidden units, etc.

In [29]:
# Network building
def build_model():
    # This resets all parameters and variables, leave this here
    tf.reset_default_graph()
    
    net = tflearn.input_data([None, len(vocab)])                # Input
    net = tflearn.fully_connected(net, 5, activation='ReLU')    # Hidden
    net = tflearn.fully_connected(net, 2, activation='softmax')   # Output
    net = tflearn.regression(net, optimizer='sgd', learning_rate=0.01, loss='categorical_crossentropy')
    model = tflearn.DNN(net)
    return model

## Intializing the model

Next we need to call the `build_model()` function to actually build the model. In my solution I haven't included any arguments to the function, but you can add arguments so you can change parameters in the model if you want.

> **Note:** You might get a bunch of warnings here. TFLearn uses a lot of deprecated code in TensorFlow. Hopefully it gets updated to the new TensorFlow version soon.

In [30]:
model = build_model()

## Training the network

Now that we've constructed the network, saved as the variable `model`, we can fit it to the data. Here we use the `model.fit` method. You pass in the training features `trainX` and the training targets `trainY`. Below I set `validation_set=0.1` which reserves 10% of the data set as the validation set. You can also set the batch size and number of epochs with the `batch_size` and `n_epoch` keywords, respectively. Below is the code to fit our the network to our word vectors.

You can rerun `model.fit` to train the network further if you think you can increase the validation accuracy. Remember, all hyperparameter adjustments must be done using the validation set. **Only use the test set after you're completely done training the network.**

In [35]:
# Training
model.fit(trainX, trainY, validation_set=0.1, show_metric=True, batch_size=128, n_epoch=20)

Training Step: 4769  | total loss: [1m[32m0.42440[0m[0m | time: 2.073s
| SGD | epoch: 030 | loss: 0.42440 - acc: 0.8195 -- iter: 20224/20250
Training Step: 4770  | total loss: [1m[32m0.43760[0m[0m | time: 3.093s
| SGD | epoch: 030 | loss: 0.43760 - acc: 0.8126 | val_loss: 0.45457 - val_acc: 0.8053 -- iter: 20250/20250
--


## Testing
After tuning with the hyperparameters, run the network on the test set to measure its performance. Remember, only do this after finalizing the hyperparameters.

In [36]:
predictions = (np.array(model.predict(testX))[:,0] >= 0.5).astype(np.int_)
test_accuracy = np.mean(predictions == testY[:,0], axis=0)
print("Test accuracy: ", test_accuracy)

Test accuracy:  0.8112


## Let us do some predictions

In [37]:
def test_sentence(sentence):
    positive_prob = model.predict([text2vec(sentence.lower())])[0][1]
    print('Sentence: {}'.format(sentence))
    print('P(positive) = {:.3f} :'.format(positive_prob), 
          'Positive' if positive_prob > 0.5 else 'Negative')

In [38]:
sentence = "Moonlight is by far the best movie of 2016."
test_sentence(sentence)

sentence = "It's amazing anyone could be talented enough to make something this spectacularly awful"
test_sentence(sentence)

Sentence: Moonlight is by far the best movie of 2016.
P(positive) = 0.598 : Positive
Sentence: It's amazing anyone could be talented enough to make something this spectacularly awful
P(positive) = 0.359 : Negative
