# Document Classification
I'm interested in the task of _document classification_. In this notebook I try out several different techniques for document classification using the [IMDB Large Movie Review dataset](https://ai.stanford.edu/~amaas/data/sentiment/). Each row in the dataset consists of a written movie review along with the rating associated with the review (1 - 10). A positive review is defined as a review > 5. Anything lower than 5 is a negative review. I've reformulated this task as _sentiment classification_ by creating positive/negative ratings for each movie.

After the normal data downloading and cleaning phase I experiment with a few different strategies:
- V1: learning an embedding layer
- V2: training glove embeddings on the dataset.
- V3a: using pretrained embeddings (???)
- V3b: using pretrained embeddings (Google universal embeddings)
- V4:  more complicated deep learning architecture: embed sentences and documents, use CNN over documents.

## Imports

In [10]:
import os
import re
import numpy as np
import pandas as pd
import sklearn
import tensorflow as tf

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  1.14.0
Eager mode:  False
GPU is NOT AVAILABLE


## Load the data

In [11]:
! ls ../data/aclImdb/train

labeledBow.feat [1m[34mpos[m[m             unsupBow.feat   urls_pos.txt
[1m[34mneg[m[m             [1m[34munsup[m[m           urls_neg.txt    urls_unsup.txt


In [12]:
# Load data and assign to the correct class (positive or negative)
# Ignore the pre-assigned train/test split

data = []
data_dirs = ["../data/aclImdb/train/pos", "../data/aclImdb/test/pos", "../data/aclImdb/train/neg", "../data/aclImdb/test/neg"]

for dir_ in data_dirs:
    for file in os.listdir(dir_):
        with open(os.path.join(dir_, file), "r") as f:
            lines = ""
            for line in f:
                lines += line

            if "pos" in dir_:
                class_ = 1
            elif "neg" in dir_:
                class_ = 0

            # rating = int(file.split("_")[1][0]) # we won't actually use this for the sentiment analysis task
            data.append([class_, lines])

In [13]:
df = pd.DataFrame(data, columns=["class", "text"])

In [14]:
# Look at some bad reviews
pd.set_option('display.max_colwidth', -1)
df[df["class"] == 0].head(3)

Unnamed: 0,class,text
25000,0,"Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.<br /><br />Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form."
25001,0,"Well...tremors I, the original started off in 1990 and i found the movie quite enjoyable to watch. however, they proceeded to make tremors II and III. Trust me, those movies started going downhill right after they finished the first one, i mean, ass blasters??? Now, only God himself is capable of answering the question ""why in Gods name would they create another one of these dumpster dives of a movie?"" Tremors IV cannot be considered a bad movie, in fact it cannot be even considered an epitome of a bad movie, for it lives up to more than that. As i attempted to sit though it, i noticed that my eyes started to bleed, and i hoped profusely that the little girl from the ring would crawl through the TV and kill me. did they really think that dressing the people who had stared in the other movies up as though they we're from the wild west would make the movie (with the exact same occurrences) any better? honestly, i would never suggest buying this movie, i mean, there are cheaper ways to find things that burn well."
25002,0,"Ouch! This one was a bit painful to sit through. It has a cute and amusing premise, but it all goes to hell from there. Matthew Modine is almost always pedestrian and annoying, and he does not disappoint in this one. Deborah Kara Unger and John Neville turned in surprisingly decent performances. Alan Bates and Jennifer Tilly, among others, played it way over the top. I know that's the way the parts were written, and it's hard to blame actors, when the script and director have them do such schlock. If you're going to have outrageous characters, that's OK, but you gotta have good material to make it work. It didn't here. Run away screaming from this movie if at all possible."


In [15]:
# Good reviews.
df[df["class"] == 1].head(3)

Unnamed: 0,class,text
0,1,"For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan ""The Skipper"" Hale jr. as a police Sgt."
1,1,"Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV's ""Flamingo Road"") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina's pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things really start cooking. The neighbors, including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles & Beverly D'Angelo, are a diabolical lot, and Eli Wallach is great fun as a wily police detective. The movie is nearly a cross-pollination of ""Rosemary's Baby"" and ""The Exorcist""--but what a combination! Based on the best-seller by Jeffrey Konvitz, ""The Sentinel"" is entertainingly spooky, full of shocks brought off well by director Michael Winner, who mounts a thoughtfully downbeat ending with skill. ***1/2 from ****"
2,1,"A solid, if unremarkable film. Matthau, as Einstein, was wonderful. My favorite part, and the only thing that would make me go out of my way to see this again, was the wonderful scene with the physicists playing badmitton, I loved the sweaters and the conversation while they waited for Robbins to retrieve the birdie."


## Clean the data

In [16]:
def pre_process_document(document):
    # Normalize the text.
    document = document.lower()

    # Remove non-alphanumeric characters.
    document = re.sub(r"[^a-zA-Z0-9\s]", "", document)
    
    # Remove sequences of whitespace.
    document = re.sub(r" +", r" ", document)
    
    # Strip trailing newline characters.
    document = document.strip()
    
    return document

In [17]:
df["text"] = df["text"].apply(pre_process_document)

In [18]:
df.head(3)

Unnamed: 0,class,text
0,1,for a movie that gets no respect there sure are a lot of memorable quotes listed for this gem imagine a movie where joe piscopo is actually funny maureen stapleton is a scene stealer the moroni character is an absolute scream watch for alan the skipper hale jr as a police sgt
1,1,bizarre horror movie filled with famous faces but stolen by cristina raines later of tvs flamingo road as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the gateway to hell the scenes with raines modeling are very well captured the mood music is perfect deborah raffin is charming as cristinas pal but when raines moves into a creepy brooklyn heights brownstone inhabited by a blind priest on the top floor things really start cooking the neighbors including a fantastically wicked burgess meredith and kinky couple sylvia miles beverly dangelo are a diabolical lot and eli wallach is great fun as a wily police detective the movie is nearly a crosspollination of rosemarys baby and the exorcistbut what a combination based on the bestseller by jeffrey konvitz the sentinel is entertainingly spooky full of shocks brought off well by director michael winner who mounts a thoughtfully downbeat ending with skill 12 from
2,1,a solid if unremarkable film matthau as einstein was wonderful my favorite part and the only thing that would make me go out of my way to see this again was the wonderful scene with the physicists playing badmitton i loved the sweaters and the conversation while they waited for robbins to retrieve the birdie


## Train/val/test split

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
train, test_val = train_test_split(df, test_size=0.4)
val, test = train_test_split(test_val, test_size=0.8)

In [21]:
train.shape, val.shape, test.shape

((30000, 2), (4000, 2), (16000, 2))

In [22]:
X_train, X_val, X_test = train["text"], val["text"], test["text"]
y_train, y_val, y_test = train["class"], val["class"], test["class"]

## Vectorize text

In [23]:
vocab = []
vocab_size = 3000

# Stanford was nice enough to give us the vocab.
with open("../data/aclimdb/imdb.vocab", "r") as f:
    for line in f:
        vocab.append(line.strip())

# Map each vocabulary item to an integer between 1 and vocab_size
vocab_vector_map = {}
for index, word in enumerate(vocab[:vocab_size]): # only use the 4000 most common words
    vocab_vector_map[word] = index + 1 # reserve 0 for padding
    
def map_vectors(reviews):
    mapped_reviews = []
    
    for review in reviews:
        mapped_review = []
        for word in review.split(" "):
            try:
                mapped_review.append(vocab_vector_map[word])
            except:
                pass
        mapped_reviews.append(mapped_review)
    return mapped_reviews

In [24]:
# Map the words in each document to its numerical representation.
from keras.preprocessing import sequence

max_words = 300
X_train2 = sequence.pad_sequences(map_vectors(X_train), maxlen=max_words)
X_val2 = sequence.pad_sequences(map_vectors(X_val), maxlen=max_words)
X_test2 = sequence.pad_sequences(map_vectors(X_test), maxlen=max_words)

y_train = np.array(y_train)
y_val = np.array(y_val)
y_test = np.array(y_test)

Using TensorFlow backend.


In [25]:
# Peek at an encoded document
X_train2[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    8,   24,   85,  138,   13,
          3,  333,  153, 1333,   32,   11,    6,   29,    4, 1285,   15,
          2,  137,   15,   32, 1264,    4,    3,  952,    1,   18,    6,
        522,    2,    4,    3, 1617,  266,  288,  173,   13,   72,   13,
          3,  853,   80,   31,   11,  922,  117,    5,   32,  874,  506,
         13,    3, 1484, 1267, 2014, 1177,    2,   

In [26]:
# Make sure the transformation was correct.
reversed_map = {v: k for k, v in vocab_vector_map.items()}
reversed_map[0] = "0"

decoded = [reversed_map[i] for i in X_train2[0]]
" ".join(decoded)

'0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 in his first go as a hollywood director henry an that is all of issues with and such with an equally of a business the film is directed and of a wonderfully put together cast as well as a screenplay also by that brings life to an otherwise genre as a delivers subtle unexpected charming and very welcome its star whose state is especially given his line of the william h again captures our as alex a husband and father who finds in the most of a young attractive named sarah whom he meets in the at a office where he the of dr john to his growing to the family business that his father donald sutherland built alex whose to lead a new life is by the fear of disappointing his father an for sarah which ultimately leads him to understand the and of being a husband to his wife and more to him a good father to hi

## V1: Simple model (RNN)
LSTM with embedding layer. Each document is reduced to 300 words. Documents with less than this # are zero-padded. An embedding layer of 32 dimensions is added, so this layer's output is (data, 400, 32). This is then fed to an LSTM with output of (data, 100). Then a dense layer with shape (data, 1) is added (binary classification). Word embeddings are being learned during training, so during classification the embedding layer serves as a pre-processer. In `V2` I will use pretrained embeddings to see if performance changes.

In [134]:
# Define the simple model with our custom embedding layer.
from tensorflow.python import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau

embedding_dim=32

model=Sequential()
model.add(Embedding(vocab_size + 1, embedding_dim, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

Model: "sequential_78"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_74 (Embedding)     (None, 300, 32)           96032     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_27 (Dense)             (None, 1)                 101       
Total params: 149,333
Trainable params: 149,333
Non-trainable params: 0
_________________________________________________________________
None


In [135]:
model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

In [None]:
# Define callbacks.
patience = 50
early_stop = EarlyStopping('val_loss', patience=patience)
reduce_lr = ReduceLROnPlateau('val_loss', factor=0.1, patience=int(patience/4), verbose=1)
model_checkpoint = ModelCheckpoint("../models/doc_classif/doc_classif_v1.{epoch:02d}-{val_acc:.2f}.model",
                                   "val_loss", verbose=1, save_best_only=True)
callbacks = [model_checkpoint, early_stop, reduce_lr]
 
model.fit(X_train2, y_train, validation_data=(X_val2, y_val), batch_size=64, epochs=5, callbacks=callbacks)

In [163]:
from tensorflow.keras.models import load_model

# The second epoch of the model (out of 50) was already the best-performing, at 86% accuracy.
model = load_model("../models/doc_classif/doc_classif_v1.02-0.86.model")

score = model.evaluate(X_test2, y_test)



In [164]:
print(f"The model's accuracy is: {score[1]:.3f}")

The model's accuracy is: 0.909


### Make sure the embeddings make sense

In [165]:
# Check out what the word embeddings look like
from tensorflow.keras.models import Model

intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.layers[0].output)
intermediate_output = intermediate_layer_model.predict(X_test2[:3])

In [166]:
# Example of a word embedding
word_id = reversed_map[X_test2[0, -1]]
word_embed_1 = intermediate_output[0, -2]

In [167]:
word_id, word_embed_1

('video',
 array([ 0.00781669, -0.05589531,  0.01132666, -0.0148676 ,  0.03820979,
        -0.02667308, -0.00871681,  0.02055935,  0.00824991,  0.02728681,
        -0.01356855,  0.01657996, -0.0638892 , -0.03288768,  0.00123704,
        -0.05473331,  0.02520202,  0.01562783,  0.05673045,  0.02553965,
         0.01083835, -0.0684996 , -0.05392539, -0.02837973, -0.00661056,
        -0.00187142,  0.05335394,  0.05562951, -0.01034331,  0.00165773,
        -0.01067892, -0.00635923], dtype=float32))

## V2: Simple Model (CNN)
Let's see if I can get a higher accuracy with a CNN

In [207]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, ReLU, MaxPool1D, Dense, Flatten

# Define the model hyperparameters.
kernel_size = 5
stride_size = 2

# Define the model.
model = Sequential()
model.add(Embedding(vocab_size + 1, embedding_dim, input_length=max_words))

model.add(Conv1D(embedding_dim, kernel_size=kernel_size, activation="relu", strides=stride_size))
model.add(MaxPool1D(6))

# model.add(Conv1D(embedding_dim, kernel_size=kernel_size, activation="relu", strides=stride_size))
# model.add(MaxPool1D(5))

model.add(Flatten())
model.add(Dense(embedding_dim, activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.001)))

model.add(Dense(1, activation="sigmoid", kernel_regularizer=tf.keras.regularizers.l2(0.001)))
    
model.summary()

Model: "sequential_93"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_89 (Embedding)     (None, 300, 32)           96032     
_________________________________________________________________
conv1d_128 (Conv1D)          (None, 148, 32)           5152      
_________________________________________________________________
max_pooling1d_96 (MaxPooling (None, 24, 32)            0         
_________________________________________________________________
flatten_33 (Flatten)         (None, 768)               0         
_________________________________________________________________
dense_54 (Dense)             (None, 32)                24608     
_________________________________________________________________
dense_55 (Dense)             (None, 1)                 33        
Total params: 125,825
Trainable params: 125,825
Non-trainable params: 0
_______________________________________________

In [208]:
model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

# Define callbacks.
patience = 50
early_stop = EarlyStopping('val_loss', patience=patience)
reduce_lr = ReduceLROnPlateau('val_loss', factor=0.1, patience=int(patience/4), verbose=1)
model_checkpoint = ModelCheckpoint("../models/doc_classif/doc_classif_v2.{epoch:02d}-{val_acc:.2f}.model",
                                   "val_loss", verbose=1, save_best_only=True)
callbacks = [model_checkpoint, early_stop, reduce_lr]
 
model.fit(X_train2, y_train, validation_data=(X_val2, y_val), batch_size=64, epochs=10, callbacks=callbacks)

Train on 30000 samples, validate on 4000 samples
Epoch 1/10
Epoch 00001: val_loss improved from inf to 0.36244, saving model to ../models/doc_classif/doc_classif_v2.01-0.86.model
Epoch 2/10
Epoch 00002: val_loss improved from 0.36244 to 0.34185, saving model to ../models/doc_classif/doc_classif_v2.02-0.86.model
Epoch 3/10
Epoch 00003: val_loss did not improve from 0.34185
Epoch 4/10
Epoch 00004: val_loss did not improve from 0.34185
Epoch 5/10
Epoch 00005: val_loss did not improve from 0.34185
Epoch 6/10
Epoch 00006: val_loss did not improve from 0.34185
Epoch 7/10
Epoch 00007: val_loss did not improve from 0.34185
Epoch 8/10
Epoch 00008: val_loss did not improve from 0.34185
Epoch 9/10
Epoch 00009: val_loss did not improve from 0.34185
Epoch 10/10
Epoch 00010: val_loss did not improve from 0.34185


<tensorflow.python.keras.callbacks.History at 0x1b114f0080>

In [None]:
# model = load_model("../models/doc_classif/doc_classif_v2.01-0.76.model")

score = model.evaluate(X_test2, y_test)

In [None]:
# This is clearly overfitting!
print(f"The model's accuracy is: {score[1]:.3f}")

In [195]:
intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.layers[0].output)
intermediate_output = intermediate_layer_model.predict(X_test2[:3])

In [196]:
# Example of a word embedding
word_id = reversed_map[X_test2[0, -1]]
word_embed_2 = intermediate_output[0, -2]

In [197]:
# If the embeddings that the RNN and CNN learned are simikar, this should be close to 0. That's not the case!
word_embed_2 - word_embed_1

array([-0.11223558,  0.08183846,  0.01147118, -0.01565666,  0.00983325,
        0.09597315,  0.09680772, -0.02314371,  0.07450233, -0.13953364,
        0.01541835, -0.02070905,  0.10246259,  0.00691553,  0.00536317,
        0.07562193, -0.02554925, -0.05911353, -0.11711088,  0.00868019,
        0.02955585,  0.10348634, -0.0146784 ,  0.0522786 , -0.01883752,
        0.0121856 , -0.01050882, -0.0756413 ,  0.07307447, -0.01438064,
        0.07249837,  0.00597489], dtype=float32)

## V2: Training an embedding layer using Glove
Same as **V1** except that the embeddings are trained using glove. These are then fed to the RNN.

## V3a: Simple model + pretrained embeddings
Same as **V1** except that pretrained embeddings are used. Glove embeddings used.

## V3b: Simple model + pretrained embeddings
Same as **V1** except that pretrained embeddings are used. Elmo embeddings used.