# Document Classification
I'm interested in the task of _document classification_. In this notebook I try out several different techniques for document classification using the [IMDB Large Movie Review dataset](https://ai.stanford.edu/~amaas/data/sentiment/). Each row in the dataset consists of a written movie review along with the rating associated with the review (1 - 10). A positive review is defined as a review > 5. Anything lower than 5 is a negative review. I've reformulated this task as _sentiment classification_ by creating positive/negative ratings for each movie.

After the normal data downloading and cleaning phase I experiment with a few different strategies:
- V1: learning an embedding layer
- V2: training glove embeddings on the dataset.
- V3a: using pretrained embeddings (???)
- V3b: using pretrained embeddings (Google universal embeddings)
- V4:  more complicated deep learning architecture: embed sentences and documents, use CNN over documents.

## Imports

In [1]:
import os
import re
import numpy as np
import pandas as pd
import sklearn
import tensorflow as tf

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.4.1
Eager mode:  True
GPU is NOT AVAILABLE


## Load the data

In [2]:
! ls ../data/aclImdb/train

labeledBow.feat [34mpos[m[m             unsupBow.feat   urls_pos.txt
[34mneg[m[m             [34munsup[m[m           urls_neg.txt    urls_unsup.txt


In [3]:
# Load data and assign to the correct class (positive or negative)
# Ignore the pre-assigned train/test split

data = []
data_dirs = ["../data/aclImdb/train/pos", "../data/aclImdb/test/pos", "../data/aclImdb/train/neg", "../data/aclImdb/test/neg"]

for dir_ in data_dirs:
    for file in os.listdir(dir_):
        with open(os.path.join(dir_, file), "r") as f:
            lines = ""
            for line in f:
                lines += line

            if "pos" in dir_:
                class_ = 1
            elif "neg" in dir_:
                class_ = 0

            # rating = int(file.split("_")[1][0]) # we won't actually use this for the sentiment analysis task
            data.append([class_, lines])

In [4]:
df = pd.DataFrame(data, columns=["class", "text"])

In [5]:
# Look at some bad reviews
pd.set_option('display.max_colwidth', -1)
df[df["class"] == 0].head(3)

  pd.set_option('display.max_colwidth', -1)


Unnamed: 0,class,text
25000,0,"Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.<br /><br />Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form."
25001,0,"Well...tremors I, the original started off in 1990 and i found the movie quite enjoyable to watch. however, they proceeded to make tremors II and III. Trust me, those movies started going downhill right after they finished the first one, i mean, ass blasters??? Now, only God himself is capable of answering the question ""why in Gods name would they create another one of these dumpster dives of a movie?"" Tremors IV cannot be considered a bad movie, in fact it cannot be even considered an epitome of a bad movie, for it lives up to more than that. As i attempted to sit though it, i noticed that my eyes started to bleed, and i hoped profusely that the little girl from the ring would crawl through the TV and kill me. did they really think that dressing the people who had stared in the other movies up as though they we're from the wild west would make the movie (with the exact same occurrences) any better? honestly, i would never suggest buying this movie, i mean, there are cheaper ways to find things that burn well."
25002,0,"Ouch! This one was a bit painful to sit through. It has a cute and amusing premise, but it all goes to hell from there. Matthew Modine is almost always pedestrian and annoying, and he does not disappoint in this one. Deborah Kara Unger and John Neville turned in surprisingly decent performances. Alan Bates and Jennifer Tilly, among others, played it way over the top. I know that's the way the parts were written, and it's hard to blame actors, when the script and director have them do such schlock. If you're going to have outrageous characters, that's OK, but you gotta have good material to make it work. It didn't here. Run away screaming from this movie if at all possible."


In [6]:
# Good reviews.
df[df["class"] == 1].head(3)

Unnamed: 0,class,text
0,1,"For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan ""The Skipper"" Hale jr. as a police Sgt."
1,1,"Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV's ""Flamingo Road"") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina's pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things really start cooking. The neighbors, including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles & Beverly D'Angelo, are a diabolical lot, and Eli Wallach is great fun as a wily police detective. The movie is nearly a cross-pollination of ""Rosemary's Baby"" and ""The Exorcist""--but what a combination! Based on the best-seller by Jeffrey Konvitz, ""The Sentinel"" is entertainingly spooky, full of shocks brought off well by director Michael Winner, who mounts a thoughtfully downbeat ending with skill. ***1/2 from ****"
2,1,"A solid, if unremarkable film. Matthau, as Einstein, was wonderful. My favorite part, and the only thing that would make me go out of my way to see this again, was the wonderful scene with the physicists playing badmitton, I loved the sweaters and the conversation while they waited for Robbins to retrieve the birdie."


## Clean the data

In [7]:
def pre_process_document(document):
    # Normalize the text.
    document = document.lower()

    # Remove non-alphanumeric characters.
    document = re.sub(r"[^a-zA-Z0-9\s]", "", document)
    
    # Remove sequences of whitespace.
    document = re.sub(r" +", r" ", document)
    
    # Strip trailing newline characters.
    document = document.strip()
    
    return document

In [8]:
df["text"] = df["text"].apply(pre_process_document)

In [9]:
df.head(3)

Unnamed: 0,class,text
0,1,for a movie that gets no respect there sure are a lot of memorable quotes listed for this gem imagine a movie where joe piscopo is actually funny maureen stapleton is a scene stealer the moroni character is an absolute scream watch for alan the skipper hale jr as a police sgt
1,1,bizarre horror movie filled with famous faces but stolen by cristina raines later of tvs flamingo road as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the gateway to hell the scenes with raines modeling are very well captured the mood music is perfect deborah raffin is charming as cristinas pal but when raines moves into a creepy brooklyn heights brownstone inhabited by a blind priest on the top floor things really start cooking the neighbors including a fantastically wicked burgess meredith and kinky couple sylvia miles beverly dangelo are a diabolical lot and eli wallach is great fun as a wily police detective the movie is nearly a crosspollination of rosemarys baby and the exorcistbut what a combination based on the bestseller by jeffrey konvitz the sentinel is entertainingly spooky full of shocks brought off well by director michael winner who mounts a thoughtfully downbeat ending with skill 12 from
2,1,a solid if unremarkable film matthau as einstein was wonderful my favorite part and the only thing that would make me go out of my way to see this again was the wonderful scene with the physicists playing badmitton i loved the sweaters and the conversation while they waited for robbins to retrieve the birdie


## Train/val/test split

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
train, test_val = train_test_split(df, test_size=0.4)
val, test = train_test_split(test_val, test_size=0.8)

In [12]:
train.shape, val.shape, test.shape

((30000, 2), (4000, 2), (16000, 2))

In [13]:
X_train, X_val, X_test = train["text"], val["text"], test["text"]
y_train, y_val, y_test = train["class"], val["class"], test["class"]

## Vectorize text

In [14]:
vocab = []
vocab_size = 3000

# Stanford was nice enough to give us the vocab.
with open("../data/aclimdb/imdb.vocab", "r") as f:
    for line in f:
        vocab.append(line.strip())

# Map each vocabulary item to an integer between 1 and vocab_size
vocab_vector_map = {}
for index, word in enumerate(vocab[:vocab_size]): # only use the 4000 most common words
    vocab_vector_map[word] = index + 1 # reserve 0 for padding
    
def map_vectors(reviews):
    mapped_reviews = []
    
    for review in reviews:
        mapped_review = []
        for word in review.split(" "):
            try:
                mapped_review.append(vocab_vector_map[word])
            except:
                pass
        mapped_reviews.append(mapped_review)
    return mapped_reviews

In [15]:
# Map the words in each document to its numerical representation.
from tensorflow.keras.preprocessing import sequence

max_words = 300
X_train2 = sequence.pad_sequences(map_vectors(X_train), maxlen=max_words)
X_val2 = sequence.pad_sequences(map_vectors(X_val), maxlen=max_words)
X_test2 = sequence.pad_sequences(map_vectors(X_test), maxlen=max_words)

y_train = np.array(y_train)
y_val = np.array(y_val)
y_test = np.array(y_test)

In [16]:
# Peek at an encoded document
X_train2[0]

array([  19, 1273,   11,  216,    8,  161,   14,  447,    1,    6,   80,
        108,    8,  321, 2179,    2,    1,  125,   38,    6,    1,  103,
         33,  445, 2996,   31,    1,  330,    4,    3,   40,   13,  249,
          8,    8,  321,    2,    1,  321,    7,   67,  543,    3,  481,
        527,   37,   19,   72, 1568,   17,    8,    1,  202,  930,    1,
       2052,  769,  401,    1, 1277,   29,   23,    1,  212, 2912,  712,
         12,  109,   31,  470,   10,    6,   85, 1459,   22,   59,  138,
         19,    5,  301,   13,   27,    4,    1,   43,   20,    1,    4,
          1,  427,    1, 2064, 2054,  123,   22,    6,   15, 2098,   34,
         59,   26,   19,  942,    2,    1, 2064,   38,    6,   48,   84,
       2725,  223,    8,    1,  376,    8,   27,  316,    1, 1682, 1092,
          4,    3,  103,    6,  596,   13,   22,    8,  998,    4,    3,
       2848, 2536, 2961,    1,  168,  103,   33,    6,   32, 1615,   44,
       1135, 1299,   15,    1, 2274,   19,    1, 16

In [17]:
# Make sure the transformation was correct.
reversed_map = {v: k for k, v in vocab_vector_map.items()}
reversed_map[0] = "0"

decoded = [reversed_map[i] for i in X_train2[0]]
" ".join(decoded)

'on themes that almost in work for example the is also seen in black angel and the man there is the character who becomes mentally by the death of a or as found in in black and the black it can leave a viewer feeling like on well ground but in the right hands the plots sorry dialogue the narrative all are the point fortunately lady was being by sound this is first noir he would go on to himself as one of the if not the of the style the killers cross here he is with woody they would be on christmas and the killers there is some great storytelling done in the camera in one shot the mental state of a character is shown as he in front of a mirror multiple personalities the same character who is an artist has van self with the hanging on the wall in his apartment but what and are really doing in lady is practically creating the look for noir released very early in its all here the the of atmosphere and the the versions of reality when scott in prison the of light etc it is a terrific pictur

## V1: Simple model (RNN)
LSTM with embedding layer. Each document is reduced to 300 words. Documents with less than this # are zero-padded. An embedding layer of 32 dimensions is added, so this layer's output is (data, 400, 32). This is then fed to an LSTM with output of (data, 100). Then a dense layer with shape (data, 1) is added (binary classification). Word embeddings are being learned during training, so during classification the embedding layer serves as a pre-processer. In `V2` I will use pretrained embeddings to see if performance changes.

In [18]:
# Define the simple model with our custom embedding layer.
from tensorflow.python import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau

embedding_dim=32

model=Sequential()
model.add(Embedding(vocab_size + 1, embedding_dim, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 32)           96032     
_________________________________________________________________
lstm (LSTM)                  (None, 100)               53200     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 149,333
Trainable params: 149,333
Non-trainable params: 0
_________________________________________________________________
None


In [19]:
model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

In [20]:
# Define callbacks.
patience = 50
early_stop = EarlyStopping('val_loss', patience=patience)
reduce_lr = ReduceLROnPlateau('val_loss', factor=0.1, patience=int(patience/4), verbose=1)
model_checkpoint = ModelCheckpoint("../models/doc_classif/doc_classif_v1.{epoch:02d}-{val_accuracy:.2f}.model",
                                   "val_loss", verbose=1, save_best_only=True)
callbacks = [model_checkpoint, early_stop, reduce_lr]
 
model.fit(X_train2, y_train, validation_data=(X_val2, y_val), batch_size=64, epochs=5, callbacks=callbacks)

Epoch 1/5

Epoch 00001: val_loss improved from inf to 0.36160, saving model to ../models/doc_classif/doc_classif_v1.01-0.85.model




INFO:tensorflow:Assets written to: ../models/doc_classif/doc_classif_v1.01-0.85.model/assets


INFO:tensorflow:Assets written to: ../models/doc_classif/doc_classif_v1.01-0.85.model/assets


Epoch 2/5

Epoch 00002: val_loss improved from 0.36160 to 0.29753, saving model to ../models/doc_classif/doc_classif_v1.02-0.88.model




INFO:tensorflow:Assets written to: ../models/doc_classif/doc_classif_v1.02-0.88.model/assets


INFO:tensorflow:Assets written to: ../models/doc_classif/doc_classif_v1.02-0.88.model/assets


Epoch 3/5

Epoch 00003: val_loss did not improve from 0.29753
Epoch 4/5

Epoch 00004: val_loss did not improve from 0.29753
Epoch 5/5

Epoch 00005: val_loss did not improve from 0.29753


<tensorflow.python.keras.callbacks.History at 0x7fe9a0fe9040>

In [22]:
! ls ../models/doc_classif/

[34mdoc_classif_v1.01-0.85.model[m[m [34mdoc_classif_v1.02-0.88.model[m[m


In [23]:
from tensorflow.keras.models import load_model

# The second epoch of the model (out of 50) was already the best-performing, at 86% accuracy.
model = load_model("../models/doc_classif/doc_classif_v1.02-0.88.model")

score = model.evaluate(X_test2, y_test)



In [24]:
print(f"The model's accuracy is: {score[1]:.3f}")

The model's accuracy is: 0.882


### Make sure the embeddings make sense

In [25]:
# Check out what the word embeddings look like
from tensorflow.keras.models import Model

intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.layers[0].output)
intermediate_output = intermediate_layer_model.predict(X_test2[:3])

In [26]:
# Example of a word embedding
word_id = reversed_map[X_test2[0, -1]]
word_embed_1 = intermediate_output[0, -2]

In [27]:
word_id, word_embed_1

('a',
 array([-0.0428787 ,  0.04246109, -0.04952083, -0.00935871,  0.01690659,
        -0.03302329,  0.02850872, -0.01944215, -0.02780089,  0.01906136,
         0.0510727 , -0.02535792,  0.03279592,  0.02701469, -0.02171413,
         0.00376951, -0.03193901,  0.00115354, -0.00207962, -0.0245267 ,
         0.03997526,  0.00283179, -0.01656234,  0.01373642,  0.01304472,
         0.02795264, -0.04893907,  0.05676555, -0.03450371, -0.00905003,
        -0.02006874,  0.00302825], dtype=float32))

## V2: Simple Model (CNN)
Let's see if I can get a higher accuracy with a CNN

In [28]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, ReLU, MaxPool1D, Dense, Flatten

# Define the model hyperparameters.
kernel_size = 5
stride_size = 2

# Define the model.
model = Sequential()
model.add(Embedding(vocab_size + 1, embedding_dim, input_length=max_words))

model.add(Conv1D(embedding_dim, kernel_size=kernel_size, activation="relu", strides=stride_size))
model.add(MaxPool1D(6))

# model.add(Conv1D(embedding_dim, kernel_size=kernel_size, activation="relu", strides=stride_size))
# model.add(MaxPool1D(5))

model.add(Flatten())
model.add(Dense(embedding_dim, activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.001)))

model.add(Dense(1, activation="sigmoid", kernel_regularizer=tf.keras.regularizers.l2(0.001)))
    
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 32)           96032     
_________________________________________________________________
conv1d (Conv1D)              (None, 148, 32)           5152      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 24, 32)            0         
_________________________________________________________________
flatten (Flatten)            (None, 768)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                24608     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 125,825
Trainable params: 125,825
Non-trainable params: 0
________________________________________________

In [29]:
model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

# Define callbacks.
patience = 50
early_stop = EarlyStopping('val_loss', patience=patience)
reduce_lr = ReduceLROnPlateau('val_loss', factor=0.1, patience=int(patience/4), verbose=1)
model_checkpoint = ModelCheckpoint("../models/doc_classif/doc_classif_v2.{epoch:02d}-{val_accuracy:.2f}.model",
                                   "val_loss", verbose=1, save_best_only=True)
callbacks = [model_checkpoint, early_stop, reduce_lr]
 
model.fit(X_train2, y_train, validation_data=(X_val2, y_val), batch_size=64, epochs=10, callbacks=callbacks)

Epoch 1/10

Epoch 00001: val_loss improved from inf to 0.33608, saving model to ../models/doc_classif/doc_classif_v2.01-0.86.model
INFO:tensorflow:Assets written to: ../models/doc_classif/doc_classif_v2.01-0.86.model/assets


INFO:tensorflow:Assets written to: ../models/doc_classif/doc_classif_v2.01-0.86.model/assets


Epoch 2/10

Epoch 00002: val_loss improved from 0.33608 to 0.33048, saving model to ../models/doc_classif/doc_classif_v2.02-0.86.model
INFO:tensorflow:Assets written to: ../models/doc_classif/doc_classif_v2.02-0.86.model/assets


INFO:tensorflow:Assets written to: ../models/doc_classif/doc_classif_v2.02-0.86.model/assets


Epoch 3/10

Epoch 00003: val_loss did not improve from 0.33048
Epoch 4/10

Epoch 00004: val_loss did not improve from 0.33048
Epoch 5/10

Epoch 00005: val_loss did not improve from 0.33048
Epoch 6/10

Epoch 00006: val_loss did not improve from 0.33048
Epoch 7/10

Epoch 00007: val_loss did not improve from 0.33048
Epoch 8/10

Epoch 00008: val_loss did not improve from 0.33048
Epoch 9/10

Epoch 00009: val_loss did not improve from 0.33048
Epoch 10/10

Epoch 00010: val_loss did not improve from 0.33048


<tensorflow.python.keras.callbacks.History at 0x7fe991a86790>

In [30]:
score = model.evaluate(X_test2, y_test)



In [31]:
# This is clearly overfitting!
print(f"The model's accuracy is: {score[1]:.3f}")

The model's accuracy is: 0.842


In [32]:
intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.layers[0].output)
intermediate_output = intermediate_layer_model.predict(X_test2[:3])

In [33]:
# Example of a word embedding
word_id = reversed_map[X_test2[0, -1]]
word_embed_2 = intermediate_output[0, -2]

In [34]:
# If the embeddings that the RNN and CNN learned are simikar, this should be close to 0. That's not the case!
word_embed_2 - word_embed_1

array([ 0.07320897, -0.05457532,  0.05022319,  0.1117916 , -0.00034211,
       -0.04083229, -0.0026621 ,  0.02942782, -0.0960895 , -0.1511858 ,
       -0.05323165,  0.03701774,  0.01807283, -0.06696399,  0.02081559,
        0.0259468 , -0.01137567,  0.0957579 ,  0.0161357 ,  0.05464319,
       -0.0994353 , -0.00317309, -0.04847529, -0.00262528, -0.14018223,
       -0.07103251,  0.02541416, -0.1559343 ,  0.13609698,  0.03910544,
       -0.03721023, -0.0457418 ], dtype=float32)

## V2: Training an embedding layer using Glove
Same as **V1** except that the embeddings are trained using glove. These are then fed to the RNN.

## V3a: Simple model + pretrained embeddings
Same as **V1** except that pretrained embeddings are used. Glove embeddings used.

## V3b: Simple model + pretrained embeddings
Same as **V1** except that pretrained embeddings are used. Elmo embeddings used.