# Convolutional Neural Networks for Text Classification
CNNs are typically associated with computer vision. They have been shown to offer dramatic improvements in image classification, as shown by ImageNet.

CNNs can also be used for some NLP tasks, particularly text classification, which is the task of classifying texts into two or more categories. Although they haven't given the same boost in performance to NLP as they have to computer vision, they can still be used as an effective machine learning algorithm. In this notebook, we will:

* Look at a popular (non-biomedical)* dataset and NLP task
* Train a CNN for sentiment analysis
* Compare a CNN using pretrained word embeddings

\* Note: Deep learning models need lots of data. Since this is a supervised task, we need lots of *labeled* data. For this reason, we aren't going to be using Biomedical tasks as examples, but the concepts can be transferred to any field.

## 1. Sentiment Analysis - IMDB Dataset
*Sentiment Analysis* is a popular NLP classification task. In sentiment analysis, we are looking at a piece of text and trying to determine what emotion the text is expressing. It is often binary, which means a text can be either **positive** or **negative**.

Reviews are an excellent example of texts that can be used for this task. A popular dataset is the IMDB dataset, which has 50,000 movie reviews, split between positive and negative. Our task will be to predict whether a review is positive (the reviewer liked the movie) or negative.

In [None]:
from keras import models

In [None]:
from keras.datasets import imdb

MAX_FEATURES = 10000 # Number of words to consider as features
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=MAX_FEATURES)

Let's take a look at what our data looks like. First, here's what it looks like when we load it from keras:

In [None]:
print(x_train)

In [None]:
y_train

`y` is fairly straightforward: 0 means negative and 1 means positive. But what does x mean? 

Let's consider the first data point:

In [None]:
x0 = x_train[0]
print(len(x0))
print(x0)

Each row of x is a list of integers. The first row has a length of 106. What do these integers mean?

Each number is the index for a particular word. A text is transformed from strings to integers. Remember how we limited our number of features to 10,000 words? That's the length of our vocabulary, and any words outside of that vocabulary will just be ignored.

Each list of numbers is called a **sequence**, and sequences are primarily what we'll be dealing with.

In [None]:
word_index = imdb.get_word_index()
print(len(word_index))
for word in ['hello', 'world']:
    print(word, word_index[word])

In [None]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
reverse_word_index.get('?')

Let's look at what the data actually looks like. Before we loaded the data, the string reports had already been preprocessed and mapped from strings (words) to integers (indices). To see the data in a (somewhat) human-readable form, we'll write a function `inverse_transform` that reverses this process:

In [None]:
def inverse_transform(seq, word_index):
    # word_index is a dictionary mapping words to an integer index
    # We reverse it, mapping integer indices to words
    reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

    # We decode the review; note that our indices were offset by 3
    # because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".

    
    decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in seq])
    
    return decoded_review

In [None]:
neg_idx = list(y_train).index(0)
pos_idx = list(y_train).index(1)

In [None]:
# Let's look at a positive review
negative_decoded_review = inverse_transform(x_train[neg_idx], word_index)
    
# And a negative review
positive_decoded_review = inverse_transform(x_train[pos_idx], word_index)

In [None]:
print(negative_decoded_review)

In [None]:
print(positive_decoded_review)

## 1b. More Data Processing

As part of the next step, we'll do a bit more data processing.

The first thing to consider is how long each sequence is. With our Cats vs. Dogs classifier, each image was resized to be the same size and shape. Keras expects data to be formatted like this. Let's look at how long our reviews are:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
review_lengths = [len(row) for row in x_train]


In [None]:
print("Mean: {}".format(np.mean(review_lengths)))
print("Standard Deviation: {}".format(np.std(review_lengths)))
print("Max: {}".format(max(review_lengths)))
print("Min: {}".format(min(review_lengths)))

In [None]:
_ = plt.hist(review_lengths, bins=20)

As you can see, most reviews are around 200 words long. There's a long tail of some more long-winded reviews, and a few very short ones as well.

We'll have to normalize the sequences so that each one is the same length. We'll do this two ways: for long reviews, we'll cut them down using the parameter `MAX_SEQUENCE_LENGTH`, and for any reviews shorter than that number, we'll "pad" them by adding 0's to the beginning of those shorter reviews: 

In [None]:
from keras.preprocessing import sequence
MAX_FEATURES = 10000 # number of words to consider as features
MAX_SEQUENCE_LENGTH = 500 # cut texts after this number of words (among top max_features most common words)
BATCH_SIZE = 128
EMBEDDING_DIM = 200

print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=MAX_SEQUENCE_LENGTH)
x_test = sequence.pad_sequences(x_test, maxlen=MAX_SEQUENCE_LENGTH)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Each sequence is now only 500 words long. Let's look at what our earlier negative review looks like:

In [None]:
print(x_train[0])

In [None]:
print(inverse_transform(x_train[0], word_index))

And here's what the longest review looks like:

TODO

## 2.Train a CNN for Sentiment Analysis

TODO: Do some descriptions of what CNNs for text look like

In [None]:
metrics = {}

In [None]:
from keras.callbacks import TensorBoard
callbacks = [TensorBoard('./logs', batch_size=BATCH_SIZE)]

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, MaxPooling1D, Flatten, Dense
from keras import layers
from keras.optimizers import RMSprop

model = Sequential(name='model_no_pretrained_embeddings')
model.add(Embedding(input_dim=MAX_FEATURES,
                    output_dim=EMBEDDING_DIM,
#                     batch_size=BATCH_SIZE,
                    input_length=MAX_SEQUENCE_LENGTH))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(Flatten())
model.add(layers.Dense(1))


model.compile(optimizer=RMSprop(lr=1e-4),
 loss='binary_crossentropy',
 metrics=['acc'])
model.summary()

In [None]:
history = model.fit(x_train, y_train, epochs=10, 
                    batch_size=BATCH_SIZE, 
                  validation_data=(x_test, y_test),
                   callbacks=callbacks)

Similarly to the Cats vs. Dogs notebook, we're going to load a pretrained model  and training history:

In [None]:
model = models.load_model('saved_models/imdb.h5')

In [None]:
# Save the history
import pickle
with open('logs/imdb_history.pkl', 'rb') as f:
    h = pickle.load(f)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(h['acc'], marker='.', linestyle='dotted', alpha=0.4, label='IMDB Training Acc')
plt.plot(h['val_acc'], marker='.', label="IMDB Validation Acc")
plt.xlabel('# epochs')
plt.ylim((0.5, 0.92))
plt.legend(loc='upper center', ncol=2,mode='expand')

In [None]:
model = Sequential(name='model_4_epochs')
model.add(Embedding(input_dim=MAX_FEATURES,
                    output_dim=EMBEDDING_DIM,
#                     batch_size=BATCH_SIZE,
                    input_length=MAX_SEQUENCE_LENGTH))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(Flatten())
model.add(layers.Dense(1))


model.compile(optimizer=RMSprop(lr=1e-4),
 loss='binary_crossentropy',
 metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train, epochs=4, 
                    batch_size=BATCH_SIZE, 
                  validation_data=(x_test, y_test),
                   callbacks=callbacks)

In [None]:
word_index.get(2)

In [None]:
model.save('saved_models/imdb_4_epochs.h5')

In [None]:
model = models.load_model('saved_models/imdb_4_epochs.h5')

In [None]:
# Let's try predicting our own movie reviews:
from nltk.tokenize import word_tokenize
def classify_reviews(texts, model):
    x = np.array([prepare_text(text) for text in texts])
    x = sequence.pad_sequences(x, maxlen=MAX_SEQUENCE_LENGTH)
    
    return model.predict_classes(x)
    return seq

def prepare_text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    
    seq = words2seq(tokens)
    
    return seq

def words2seq(words):
    seq = []
    for w in words:
        idx = word_index.get(w)
        if idx is not None and idx < MAX_FEATURES:  # 2 is the placeholder for out-of-vocabulary
            seq.append(idx + 3)
        else:
            seq.append(2)
    return seq

In [None]:
positive_mamma_mia = """
Even better than the original, creating a great backstory and bringing a touching and gratifying closure to the \
mother-daughter story of Mamma Mia. Excellent choreography, catchy songs and beautiful performances by Lilly James and \
Amanda Seyfried, plus just the right amount of humor and sentimentality.
"""

In [None]:
negative_mamma_mia = """
As a film, it was overly reliant on the audiences nostalgia, incorporating the lower quality Abba songs which \
remind you how much more you wanted to watch the original. The original Swedish script echoes in this, with much \
of the dialogue being poorly localised and therefore making very little sense at all. \
A very basic and safe plot is used, making it evident that this film was only made as a cash grab from a fanbase still \
in love with the original
"""

In [None]:
classify_reviews([positive_mamma_mia, negative_mamma_mia],
                model)

# 3. Add pretrained word embedings

## Word Embeddings

In [None]:
EMBEDDINGS_PATH = '/Users/alec/Data/glove.6B/glove.6B.200d.txt'

In [None]:
import numpy as np

In [None]:
f = open(EMBEDDINGS_PATH)
embeddings_index = {}

for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
    
f.close()

In [None]:
embedding_matrix = np.zeros((MAX_FEATURES, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < MAX_FEATURES:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [None]:
word_index.get('movie')

## Model w/ Pretrained Embeddings

In [None]:
from keras.models import Sequential
from keras import layers
from keras.layers import Flatten, Embedding
from keras.optimizers import RMSprop
# from kears.layers import 



In [None]:
# Something's not right here
model_pretrained_emb = Sequential(name='model_pretrained_embeddings')
model_pretrained_emb.add(
    Embedding(input_dim=MAX_FEATURES,
                    output_dim=EMBEDDING_DIM,
                    weights=[embedding_matrix],
                    input_length=MAX_SEQUENCE_LENGTH))
model_pretrained_emb.add(layers.Conv1D(32, 7, activation='relu'))
model_pretrained_emb.add(layers.MaxPooling1D(5))
model_pretrained_emb.add(layers.Conv1D(32, 7, activation='relu'))
model_pretrained_emb.add(layers.MaxPooling1D(5))
model_pretrained_emb.add(Flatten())
model_pretrained_emb.add(layers.Dense(1))

In [None]:
model_pretrained_emb.compile(optimizer=RMSprop(lr=1e-4),
 loss='binary_crossentropy',
 metrics=['acc'])
history_pretrained_emb = model_pretrained_emb.fit(x_train, y_train,
                         validation_data=(x_test, y_test),
                         epochs=10, batch_size=BATCH_SIZE)

In [None]:
y_pred_pre = model_pretrained_emb.predict_classes(x_test)

In [None]:
model_pretrained_emb.save('saved_models/imdb_pretrained.h5')

In [None]:
with open('logs/history_pretrained.pkl', 'wb') as f:
    pickle.dump(h_pre, f)

In [None]:
h_pre = history_pretrained_emb.history

In [None]:
histories = {'imdb': h,
            'imdb_pretrained': h_pre}

In [None]:
# def plot_scores(histories):
fig, ax = plt.subplots()
x = range(10)
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
i = 0
for name, history in histories.items():
    ax.plot(history['acc'], marker='.', linestyle='dotted', label="{} train acc".format(name, alpha=0.4), color=colors[i])
    ax.plot(history['val_acc'], marker='.', label="{} val acc".format(name))
    i += 1
    
    
# ax.set_title('CNN Sentiment Analysis Validation Accuracy')
ax.set_xlabel('# epochs')
fig.legend(loc='upper center', ncol=2,mode='expand')



# plot_scores(histories)

# 4. Compare a Traditional Machine Learning Baseline

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import OneHotEncoder

from sklearn.metrics import classification_report

In [None]:
def seqs2bow(x, vectorizer=None):
    """
    Takes a list of sequences
    and converts them into a bag of words vector.
    """
    x_dicts = []
    for seq in x:
        d = {}
        for word in seq:
            if word not in d:
                d[word] = 0
            d[word] += 1
        x_dicts.append(d)
        
    if not vectorizer:
        vectorizer = DictVectorizer()
        x = vectorizer.fit_transform(x_dicts)
    else:
        x = vectorizer.transform(x_dicts)
    
    return x, vectorizer

In [None]:
x_train_new, vectorizer = seqs2bow(x_train)

In [None]:
x_test_new, vectorizer = seqs2bow(x_test, vectorizer)

In [None]:
print(x_train_new.shape)

In [None]:
clf = RandomForestClassifier()
clf.fit(x_train_new, y_train)

In [None]:
pred_train = clf.predict(x_train_new)
bow_train_acc = accuracy_score(y_train, pred_train)

In [None]:
pred = clf.predict(x_test_new)

In [None]:
from sklearn.metrics import accuracy_score
bow_val_acc = accuracy_score(y_test, pred)

In [None]:
# def plot_scores(histories):
fig, ax = plt.subplots()
x = range(10)
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
i = 0
for name, history in histories.items():
    ax.plot(history['acc'], marker='.', linestyle='dotted', label="{} train acc".format(name, alpha=0.4), color=colors[i])
    ax.plot(history['val_acc'], marker='.', label="{} val acc".format(name))
    i += 1
    
# Add a horizontal line showing the BOW accuracy
ax.hlines(y=bow_train_acc, xmin=0, xmax=10, label='BOW training accuracy', color=colors[i], linestyle='dotted')
ax.hlines(y=bow_val_acc, xmin=0, xmax=10, label='BOW validation accuracy', color=colors[i], alpha=0.4)
    
# ax.set_title('CNN Sentiment Analysis Validation Accuracy')
ax.set_xlabel('# epochs')
fig.legend(loc='upper center', ncol=3,mode='expand')



# plot_scores(histories)

In [None]:
import pandas as pd
def score_table():
    df = pd.DataFrame(columns=['Model Name', 'Max Training Accuracy', 'Max Validation Accuracy'],
                     data=[
                         {'Model Name': "IMDB", 
                          "Max Training Accuracy": max(histories['imdb']['acc']),
                          'Max Validation Accuracy': max(histories['imdb']['val_acc'])
                         },
                     {"Model Name":'IMDB Pretrained Embeddings', 
                     "Max Training Accuracy": max(histories['imdb_pretrained']['acc']),
                     "Max Validation Accuracy": max(histories['imdb_pretrained']['val_acc'])},
                     {'Model Name': 'Random Forest BOW',
                     "Max Training Accuracy": bow_train_acc,
                      "Max Validation Accuracy": bow_val_acc
                     }])
    return df

In [None]:
df

In [None]:
fig

# Conclusion