# Sentiment Analysis (Deep Learning, CNN)

In this tutorial, we perform sentiment analysis using deep learning, where we use a basic Convolutional Neural Network (CNN) network structure.

## Import required packages

In [2]:
import os
import numpy as np
import pandas as pd
import csv

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Activation, Conv1D, MaxPooling1D, Flatten, Embedding

from sklearn.preprocessing import LabelBinarizer

# The next imports are only needed for the preprocessing
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from utils.nlputil import preprocess_text

We need a tokenizer and a lemmatizer for the preprocessing

In [3]:
tweet_tokenizer = TweetTokenizer()
wordnet_lemmatizer = WordNetLemmatizer()

Let's also define a set of parameters we need for later

In [4]:
NUM_LABELS = 3       # We have 3 polarity classes
MAX_WORDS = 1000     # We only consider the 1,000 most frequent terms
EMBEDDING_DIM = 50   # Size of the word vectors

## Date preparation

### Load data from files

In [5]:
df_tweets_train = pd.read_csv('data/twitter-sentiment/twitter-sentiment-bowden-training.csv')

# Print the first 5 lines
df_tweets_train.head()

Unnamed: 0,tweet,senti
0,@united UA5396 can wait for me. I'm on the gro...,0
1,I hate Time Warner! Soooo wish I had Vios. Can...,0
2,Tom Shanahan's latest column on SDSU and its N...,2
3,Found the self driving car!! /IWo3QSvdu2,2
4,@united arrived in YYZ to take our flight to T...,0


### Preprocess training and test data

In [6]:
train_tweets = df_tweets_train['tweet']
train_polarities = df_tweets_train['senti']

train_tweets_processed = [''] * len(train_tweets)

for idx, doc in enumerate(train_tweets):
    train_tweets_processed[idx] = preprocess_text(doc, tokenizer=tweet_tokenizer, lemmatizer=wordnet_lemmatizer)

In [7]:
df_tweets_test = pd.read_csv('data/twitter-sentiment/twitter-sentiment-bowden-test.csv')

test_tweets = df_tweets_test['tweet']
test_polarities = df_tweets_test['senti']  

test_tweets_processed = [''] * len(test_tweets)

for idx, doc in enumerate(test_tweets):
    test_tweets_processed[idx] = preprocess_text(doc, tokenizer=tweet_tokenizer, lemmatizer=wordnet_lemmatizer)  

### Prepare labels

In [8]:
encoder = LabelBinarizer()
encoder.fit(train_polarities)
y_train = encoder.transform(train_polarities)
y_test = encoder.transform(test_polarities)

print(y_test[:10])

[[1 0 0]
 [1 0 0]
 [1 0 0]
 [0 0 1]
 [1 0 0]
 [1 0 0]
 [0 0 1]
 [1 0 0]
 [0 0 1]
 [0 1 0]]


### Calculate maximum sequence length 

Most neural networks assume inputs of the same size. Since we are using tweets here which are usually rather short, we can find the longest one (in terms of the number of words) and define its length as the maximum sequence length. In case of longer texts, e.g., reviews, the maximum sequence length is specified a priori to typically a couple of hundred.

In [9]:
longest_train_tweet = max([len(s.split()) for s in train_tweets_processed])
longest_test_tweet = max([len(s.split()) for s in test_tweets_processed])

max_seq_len = max(longest_train_tweet, longest_test_tweet)

print("Maximum sequence length: {}".format(max_seq_len))

Maximum sequence length: 29


In [10]:
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(train_tweets_processed)

In [11]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 2709 unique tokens.


### Convert strings to sequences

The `tokenizer.word_index` each word in the vocabulary to an index. The method `texts_to_sequences` now converts a string into a list of indexes representing the words in the string

In [12]:
X_train = tokenizer.texts_to_sequences(train_tweets_processed)
X_test = tokenizer.texts_to_sequences(test_tweets_processed)

max_idx = max([ max(l) for l in X_train if len(l) > 0])

print(X_train[0])
print("Largest used index: {}".format(max_idx)) # This should be (MAX_WORDS-1)


[4, 838, 50, 8, 498, 174, 6, 200, 337, 499, 500]
Largest used index: 999


### Sequence padding.

We have to ensure that all inputs have the same length. Above, we calculated the maximum length being 29. That means, we have to "pad" all tweets that are shorter than that. Keras comes with a handy method for that. `padding='post'` specifies that the padding is done after the last wors. `truncating='post'` is not required in this example; it would cut of words from then end tweets that are too long (which cannot happen here).

In [13]:
X_train = pad_sequences(X_train, maxlen=max_seq_len, padding='post', truncating='post')
X_test = pad_sequences(X_test, maxlen=max_seq_len, padding='post', truncating='post')

print(X_train[0])
print("Sequence length: {}".format(len(X_train[0])))

[  4 838  50   8 498 174   6 200 337 499 500   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0]
Sequence length: 29


## Training the model (without word embeddings)

### Using "raw" sequences

Technically, we can train the network on the word indexes (e.g., `[  4 846  52   8 506 178   6 204 344 507 508   0   0 ...]`) without vecorizing the words. However, as you will see, the performance will be very poor.

In [14]:
X_train_raw = np.expand_dims(X_train, axis=2)
X_test_raw = np.expand_dims(X_test, axis=2)

In [15]:
model_raw = Sequential()
model_raw.add(Conv1D(128, 5, activation='relu', input_shape=(max_seq_len, 1)))
#model_raw.add(MaxPooling1D(5))
model_raw.add(Flatten())
model_raw.add(Dense(128))
model_raw.add(Activation('relu'))
model_raw.add(Dense(64)) 
model_raw.add(Activation('relu'))
model_raw.add(Dense(NUM_LABELS))
model_raw.add(Activation('softmax'))

print(model_raw.summary())

Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 25, 128)           768       
_________________________________________________________________
flatten_1 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               409728    
_________________________________________________________________
activation_1 (Activation)    (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                8256      
_________________________________________________________________
activation_2 (Activation)    (None, 64)                0         
___________________________________________________________

In [16]:
model_raw.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [17]:
history_raw = model_raw.fit(X_train_raw, y_train, batch_size=32, epochs=20, verbose=1, validation_split=0.1)

Train on 629 samples, validate on 70 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [18]:
score_raw = model_raw.evaluate(X_test_raw, y_test, batch_size=32, verbose=1)
print('Test score:', score_raw[0])
print('Test accuracy:', score_raw[1])

Test score: 1.420248966089031
Test accuracy: 0.39932885906040266


### Using one-hot word vectors

Here, we vectorize each word by converting them into one-hot vectors. Each vector has the length 1,000 (as size of the vocabulary, the 1,000 most frequent words).

Instead of using the `Tokenizer` class of Keras, we do the conversion manually for illustration.

In [19]:
def convert_to_word_onehot(X):
    X_onehot = np.empty(shape=(X.shape[0], X.shape[1], MAX_WORDS))
    for seq_idx, seq in enumerate(X):
        for word_idx, word in enumerate(seq):
            if word > 0:
                X_onehot[seq_idx, word_idx, word] = 1
    return X_onehot
        
X_train_onehot = convert_to_word_onehot(X_train)  
X_test_onehot = convert_to_word_onehot(X_test)  

In [20]:
model_onehot = Sequential()
model_onehot.add(Conv1D(128, 5, activation='relu', input_shape=(max_seq_len, MAX_WORDS)))
#model_onehot.add(MaxPooling1D(5))
model_onehot.add(Flatten())
model_onehot.add(Dense(128))
model_onehot.add(Activation('relu'))
model_onehot.add(Dense(64)) 
model_onehot.add(Activation('relu'))
model_onehot.add(Dense(NUM_LABELS))
model_onehot.add(Activation('softmax'))

print(model_onehot.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_2 (Conv1D)            (None, 25, 128)           640128    
_________________________________________________________________
flatten_2 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 128)               409728    
_________________________________________________________________
activation_4 (Activation)    (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 64)                8256      
_________________________________________________________________
activation_5 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 195       
__________

In [21]:
model_onehot.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [22]:
history_onehot = model_onehot.fit(X_train_onehot, y_train, batch_size=32, epochs=20, verbose=1, validation_split=0.1)

Train on 629 samples, validate on 70 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [23]:
score_onehot = model_onehot.evaluate(X_test_onehot, y_test, batch_size=32, verbose=1)
print('Test score:', score_onehot[0])
print('Test accuracy:', score_onehot[1])

Test score: 1.3819968148365918
Test accuracy: 0.6677852344992977


## Training the model (with word embeddings)

Finally, we use word embeddings.

### Define network model

This model now has an `Embedding` layer. 

In [24]:
model_embed = Sequential()
model_embed.add(Embedding(MAX_WORDS, EMBEDDING_DIM, input_length=max_seq_len))
model_embed.add(Conv1D(128, 5, activation='relu'))
#model_embed.add(MaxPooling1D(5))
model_embed.add(Flatten())
model_embed.add(Dense(128))
model_embed.add(Activation('relu'))
model_embed.add(Dense(64)) 
model_embed.add(Activation('relu'))
model_embed.add(Dense(NUM_LABELS))
model_embed.add(Activation('softmax'))

print(model_embed.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 29, 50)            50000     
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 25, 128)           32128     
_________________________________________________________________
flatten_3 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_7 (Dense)              (None, 128)               409728    
_________________________________________________________________
activation_7 (Activation)    (None, 128)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 64)                8256      
_________________________________________________________________
activation_8 (Activation)    (None, 64)                0         
__________

### Compile model

In [25]:
model_embed.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### Train model

In [26]:
history_embed = model_embed.fit(X_train, y_train, batch_size=32, epochs=20, verbose=1, validation_split=0.1)

Train on 629 samples, validate on 70 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### Evaluate model

In [27]:
score_embed = model_embed.evaluate(X_test, y_test, batch_size=32, verbose=1)
print('Test score:', score_embed[0])
print('Test accuracy:', score_embed[1])

Test score: 1.7192326168085905
Test accuracy: 0.697986577181208
