# Short text classification bake-off!

Blog posts showing off deep learning approaches for some task are now a dime a dozen. One thing I *haven't* come across too much is blog posts comparing "standard" approaches to text classification.

In this notebook, I'm going to compare logistic regression on bag-of-ngrams features (a standard industry workhorse for text classification) with a ConvNet (the "new kid" baseline) on a simple Twitter sentiment analysis task.

*N.B.* This work was inspired by some experiments I was doing at work on search query classification with some data that I can't share; several Twitter sent-eval datasets are freely available. It's basically structured as a semi-narrated directed exploration.

## The data

We're fetching a set of labeled Twitter sentiment analysis data from http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

In [23]:
import requests
import os

url = "http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip"
f_name = os.path.basename(url)
r = requests.get(url, stream=True)
with open(f_name, "wb") as f_out:
    for chunk in r.iter_content(chunk_size=1024):
        if chunk:
            f_out.write(chunk)

Let's put that in a `data` subdirectory and unzip so we can take a peek.

In [28]:
import errno
# check if the directory exists and handle the relevant OSError (thanks StackOverflow!)
path = "./data"
try:
    os.makedirs(path)
except OSError as exc:
    if exc.errno == errno.EEXIST and os.path.isdir(path):
        pass
    else:
        raise

# extract the zip
from zipfile import ZipFile
with ZipFile("Sentiment-Analysis-Dataset.zip") as myzip:
    myzip.extractall(path)

In [29]:
ls data/

Sentiment Analysis Dataset.csv


OK, let's see what kind of file it is, and what it looks like...

In [35]:
!file data/Sentiment\ Analysis\ Dataset.csv

data/Sentiment Analysis Dataset.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators


Ugh, it's got a [BOM](https://en.wikipedia.org/wiki/Byte_order_mark). Why?! 

In [30]:
!head data/Sentiment\ Analysis\ Dataset.csv

﻿ItemID,Sentiment,SentimentSource,SentimentText
1,0,Sentiment140,                     is so sad for my APL friend.............
2,0,Sentiment140,                   I missed the New Moon trailer...
3,1,Sentiment140,              omg its already 7:30 :O
4,0,Sentiment140,          .. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)...
5,0,Sentiment140,         i think mi bf is cheating on me!!!       T_T
6,0,Sentiment140,         or i just worry too much?        
7,1,Sentiment140,       Juuuuuuuuuuuuuuuuussssst Chillin!!
8,0,Sentiment140,       Sunny Again        Work Tomorrow  :-|       TV Tonight
9,1,Sentiment140,      handed in my uniform today . i miss you already


OK, it's basic CSV; let's try to load it up the naive way and hope that there are no commas in the final field

In [62]:
import csv

f_in = open("data/Sentiment Analysis Dataset.csv")
reader = csv.reader(f_in, delimiter=',', quoting=csv.QUOTE_MINIMAL)
header = reader.next()
header

['\xef\xbb\xbfItemID', 'Sentiment', 'SentimentSource', 'SentimentText']

In [63]:
all_data = []
for row in reader:
    label = int(row[1])
    tweet = row[-1]
    all_data.append((tweet, label))

len(all_data)

1578614

OK, about 1.5M data points; nice. Now let's shuffle it, then split into train/val/test sets.

In [64]:
from random import shuffle

shuffle(all_data)
VAL_SPLIT = 0.1
nb_val_samples = int(VAL_SPLIT * len(all_data))
# doing an 80/10/10 split
train_data = all_data[:-2*nb_val_samples]
dev_data = all_data[-2*nb_val_samples:-nb_val_samples]
test_data = all_data[-nb_val_samples:]

len(train_data)
len(dev_data)
len(test_data)
assert len(all_data) == len(train_data) + len(dev_data) + len(test_data)

Final peek at the raw training data to make sure it mostly looks like what we expect...

In [65]:
from random import sample
sample(train_data, 5)

[('tonight was amazing, but I miss the good old days ', 0),
 ("@volupty it's only cool because tomorrow my blond cousin comes  and his blue eyes too :p",
  1),
 ('@mrskutcher definatley agree  blew flawless out of the runnings with that performance!!',
  1),
 ('@newO_nyboR yeahh haha. Hope u feel better sooon ', 1),
 ('just got off the phone with my Daddy Doodle. I miss him.  http://plurk.com/p/114ons',
  0)]

*N.B.* Given the approach I'm going to take below; we're going to replace all @-mentions with a fixed string (e.g. `"AT_MENTION"`)

In [82]:
# start by separating inputs and labels, then do the @-mention replacements
train_inputs, train_labels = zip(*train_data)


In [83]:
import re
X_train = []
# this is *very* coarse and will surely match things that we don't want it to
mention_patt = re.compile(ur'@\w+', re.UNICODE)
for item in train_inputs:
    try:
        X_train.append(mention_patt.sub("AT_MENTION", item.decode("utf8")))
    except Exception:
        print item
        break

OK, looks like that worked. Onward!

## First contender: old-school LR+BoW

[Insert clever/insightful stuff about LR + BoW approaches here]

In [96]:
# the model
from sklearn.linear_model import LogisticRegressionCV
# using 2 cores (of my 4) for fitting
lr = LogisticRegressionCV(n_jobs=2)

*N.B.* We're using `LogisticRegressionCV` rather than plain `LogisticRegression` because the former does cross-validation on the `C` regularizer for free as part of training.

In [97]:
# the feature extractor
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(analyzer="char", lowercase=False, binary=False, ngram_range=(3, 6))

The `"char"` analyzer extracts character-based ngrams, *including across word boundaries*. I'm keeping casing AS-IS because *a priori* I can imagine it being helpful in deciding whether some piece of text has emotional weight. The range of ngram sizes I'm looking at is not justified beyond it having worked for me in the past in other text classification scenarios.

In [101]:
# put them together in an end-to-end trainable/callable pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([("vec", vec), ("lr", lr)])

# let's fit it on 100k data points and see how long it takes
%time pipeline.fit(X_train[:100000], train_labels[:100000])

CPU times: user 3min 37s, sys: 8.83 s, total: 3min 46s
Wall time: 19min 31s


Pipeline(steps=[('vec', CountVectorizer(analyzer='char', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(3, 6), preprocessor=None, stop_words=None,
        str...2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0))])

20 minutes...so we can ballpark ~4hrs for the full training set. I've seen reports that DL classification really only comes into its own (metrics-wise) with datasets of size *O(1e6)*...let's see what we can get out of this one.

## Second contender: new-school ConvNet/CNN
### (where "new" == "already outdated")

The approach we're taking here is a fairly standard approach to using ConvNets for sentence classification, without much in the way of bells, or whistles. [Kim (2014)](https://arxiv.org/abs/1408.5882) is one of the canonical references for this approach. Our architecture closely resembles the one used there, and pictured here (source: Denny Britz's [blog post on CNNs for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/); one of the clearest expositions of this I've come across):

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM.png" alt="ConvNet for sentence classification" style="width: 800px;"/>

**N.B.** The principle difference in the approach taken here is that I'll be using *character*-based embeddings (trained from scratch), rather than word embeddings, as shown above (and used in Kim 2014). Mostly this is because I used char ngrams for the BoNG+LR classifier.

In [99]:
# lots of stuff to import
from keras.models import Model
from keras.layers import Input, Dense, Dropout, Activation, concatenate, Embedding, Conv1D, GlobalMaxPooling1D
from keras.utils.np_utils import *
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import *
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import *

Using Theano backend.


**N.B.** I'm using the _CPU_ for this. My Mac has a GeForce GT 650M, but I recently upgrade CUDA and drivers, and things are a bit borked (w.r.t. getting Keras to see the GPU).

We'll start with defining some useful bits of stuff here, for easier tweaking later on (should it prove necessary), and then do some data munging to get it into the shape that our model needs.

In [120]:
# DEFINE SOME USEFUL STUFF
# some of these vals are cribbed from other experiments with
# other short text classification tasks; YMMV, caveat emptor, yadda yadda...do some experimenting
VALIDATION_SPLIT = 0.1
EMBEDDING_DIM = 32
# max len in first 100k is 317, but very few longer than 256
MAX_SEQUENCE_LENGTH = 256
BATCH_SIZE = 64
# keeping this small(ish) to keep training time down
FILTERS = 50
HIDDEN_DIMS = 250
P_DROPOUT = 0.25
EPOCHS = 2


# using the tokenizer that comes with Keras
# this fxn also maps the input vocab to sequences of integer indices and returns a char-to-index map
def tokenize_texts(texts):
    print('Tokenizing')
    tokenizer = Tokenizer(char_level=True)
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)
    word_index = tokenizer.word_index
    print('Found %d unique tokens' % (len(word_index),))
    return sequences, word_index


# truncate/pad input sequences so that they're all the same length
def pad_seqs(sequences):
    print('Padding, encoding, train/dev split')
    data = pad_sequences(sequences,
                         maxlen=MAX_SEQUENCE_LENGTH,
                         padding='post',
                         truncating='post')
    return data


sequences, word_index = tokenize_texts(X_train[:100000])
data = pad_seqs(sequences)
import numpy
labels = numpy.asarray(train_labels)

Tokenizing
Found 185 unique tokens
Padding, encoding, train/dev split


I'll be using Keras's [functional API](https://keras.io/getting-started/functional-api-guide/). I've only started using it, and already prefer it (plus it's a bit closer to the [PyTorch](http://pytorch.org) approach, which I've finally started playing with). OK, let's define the main building blocks of the model (again, there are design decisions in here that come from previous experimentation, literature-based suggestions, and Twitter-suggested rules of thumb; normally you'd want to use some kind of hyperparam optimization for this).

In [114]:
x = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
# zeroth index doesn't get used by Keras in embedding layers
emb = Embedding(len(word_index) + 1,
                EMBEDDING_DIM,
                input_length=MAX_SEQUENCE_LENGTH,
                trainable=True)(x)
# we'll set up convolutions of size 3, 4, 5, and 6, like the ngram_range we used above,
# do max-pooling on each, then concatenate the output before running it through a dense layer
conv_3 = Conv1D(FILTERS, 3, padding="valid",
                activation="relu", strides=1)(emb)
maxpool_3 = GlobalMaxPooling1D()(conv_3)
conv_4 = Conv1D(FILTERS, 4, padding="valid",
                activation="relu", strides=1)(emb)
maxpool_4 = GlobalMaxPooling1D()(conv_4)
conv_5 = Conv1D(FILTERS, 5, padding="valid",
                activation="relu", strides=1)(emb)
maxpool_5 = GlobalMaxPooling1D()(conv_5)
conv_6 = Conv1D(FILTERS, 6, padding="valid",
                activation="relu", strides=1)(emb)
maxpool_6 = GlobalMaxPooling1D()(conv_6)
merged = concatenate([maxpool_3, maxpool_4, maxpool_5, maxpool_6], axis=-1)
hidden = Dense(HIDDEN_DIMS)(merged)
dropout = Dropout(P_DROPOUT)(hidden)
dropout = Activation("relu")(dropout)
out = Dense(1, activation="sigmoid")(dropout)
model = Model(inputs=x, outputs=out)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Alright; model defined and data ready. Let's see how long it takes to train this bad boy.

In [122]:
%time hist = model.fit(data[:100000], labels[:100000], batch_size=BATCH_SIZE, epochs=EPOCHS)

Epoch 1/2
Epoch 2/2
CPU times: user 35min 18s, sys: 46 s, total: 36min 4s
Wall time: 9min 9s


Half the time to train! Can't say I expected that...now let's see whether these classifiers are any good at their assigned tasks.

## Eval

One of my pet peeves about posts like this is the lack of detail w.r.t. things like system hardware/architecture,
training time, etc. as well as the simplicity of the eval metrics (usually just raw accuracy).

*Blah blah, more to say here about that...*

Need to munge the dev data the same way we did the training data, for both the LR and CNN use cases (remove @-mentions, turn into ndarrays, sequencify, etc.).

In [137]:
dev_inputs, dev_labels = zip(*dev_data)

X_dev = []
for item in dev_inputs:
    try:
        X_dev.append(mention_patt.sub("AT_MENTION", item.decode("utf8")))
    except Exception:
        print item
        break

y_dev = np.asarray(dev_labels[:10000])
y_dev.shape

(10000,)

In [128]:
from sklearn.metrics import classification_report

%time print classification_report(y_dev[:10000], pipeline.predict(X_dev[:10000]), digits=3)

             precision    recall  f1-score   support

          0      0.801     0.790     0.796      5010
          1      0.792     0.803     0.798      4990

avg / total      0.797     0.797     0.797     10000

CPU times: user 3.07 s, sys: 61.4 ms, total: 3.13 s
Wall time: 3.13 s


In [135]:
dev_sequences, _ = tokenize_texts(X_dev[:10000])
X_dev_cnn = pad_seqs(dev_sequences)

Tokenizing
Found 168 unique tokens
Padding, encoding, train/dev split


In [155]:
%time print classification_report(y_dev[:10000], np.round(model.predict(X_dev_cnn[:10000])).astype("int32"), digits=3)

             precision    recall  f1-score   support

          0      0.785     0.711     0.746      5010
          1      0.735     0.804     0.768      4990

avg / total      0.760     0.757     0.757     10000

CPU times: user 33.5 s, sys: 327 ms, total: 33.8 s
Wall time: 8.5 s


Hmm...four points lower on f-score, AND almost 3 times slower in prediction. We'll need something better than this to take to the boss if we want to convince him to let us play with DL stuff.
Let's try training for twice as long as see what comes of that. If that doesn't help, we can always try with more data...

In [156]:
%time hist = model.fit(data[:100000], labels[:100000], batch_size=BATCH_SIZE, epochs=2*EPOCHS)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
CPU times: user 1h 11min 44s, sys: 1min 30s, total: 1h 13min 14s
Wall time: 18min 26s


Alright, we're basically at the same training time as for the LR pipeline now. The loss kept dropping, which is good, **BUT** do note that I'm checking against validation loss here; `model.fit()` lets you pass in a validation set as an additional param, and you can set up checks against the loss on the validation set---which doesn't get used as training data---to decide when to stop training. Normally I'd have done that.

Alright, let's see how good this baby is.

In [157]:
%time print classification_report(y_dev[:10000], np.round(model.predict(X_dev_cnn[:10000])).astype("int32"), digits=3)

             precision    recall  f1-score   support

          0      0.719     0.864     0.785      5010
          1      0.829     0.661     0.736      4990

avg / total      0.774     0.763     0.760     10000

CPU times: user 35.7 s, sys: 1 s, total: 36.7 s
Wall time: 9.44 s


Hmm, not promising. We squeezed a tiny bit more performance out, in terms of average metrics, but we're still a few points away from the scores the basic system got.

Let's try training on twice as much data...don't forget that we truncated `data` right when we defined it up above, so let's revisit that.

In [160]:
sequences, word_index = tokenize_texts(X_train[:500000])
data = pad_seqs(sequences)
data.shape

Tokenizing
Found 196 unique tokens
Padding, encoding, train/dev split


(500000, 256)

In [161]:
%time hist = model.fit(data[:200000], labels[:200000], batch_size=BATCH_SIZE, epochs=2*EPOCHS)

Epoch 1/4
 13312/200000 [>.............................] - ETA: 507s - loss: 0.4519 - acc: 0.7866

IndexError: index 189 is out of bounds for size 186
Apply node that caused the error: AdvancedSubtensor1(embedding_2/embeddings, Reshape{1}.0)
Toposort index: 48
Inputs types: [TensorType(float32, matrix), TensorType(int32, vector)]
Inputs shapes: [(186, 32), (16384,)]
Inputs strides: [(128, 4), (4,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Reshape{3}(AdvancedSubtensor1.0, MakeVector{dtype='int64'}.0)]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2821, in run_ast_nodes
    if self.run_code(code, result):
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-114-3a8528a90ed6>", line 6, in <module>
    trainable=True)(x)
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/keras/engine/topology.py", line 602, in __call__
    output = self.call(inputs, **kwargs)
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/keras/layers/embeddings.py", line 134, in call
    out = K.gather(self.embeddings, inputs)
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 483, in gather
    y = reference[indices]

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.