# Short text classification bake-off!

Blog posts showing off deep learning approaches for some task are now a dime a dozen. One thing I *haven't* come across too much is blog posts comparing "standard" approaches to text classification.

In this notebook, I'm going to compare logistic regression on bag-of-ngrams features (a standard industry workhorse for text classification) with a ConvNet (the "new kid" baseline) on a simple Twitter sentiment analysis task.

*N.B.* This work was inspired by some experiments I was doing at work on search query classification with some data that I can't share; several Twitter sent-eval datasets are freely available. It's basically structured as a semi-narrated directed exploration.

## Javascript "beep" test

In [1]:
%%javascript
// Source - https://stackoverflow.com/a/23395136

Jupyter.beep = () => {
    var snd = new Audio("data:audio/wav;base64,//uQRAAAAWMSLwUIYAAsYkXgoQwAEaYLWfkWgAI0wWs/ItAAAGDgYtAgAyN+QWaAAihwMWm4G8QQRDiMcCBcH3Cc+CDv/7xA4Tvh9Rz/y8QADBwMWgQAZG/ILNAARQ4GLTcDeIIIhxGOBAuD7hOfBB3/94gcJ3w+o5/5eIAIAAAVwWgQAVQ2ORaIQwEMAJiDg95G4nQL7mQVWI6GwRcfsZAcsKkJvxgxEjzFUgfHoSQ9Qq7KNwqHwuB13MA4a1q/DmBrHgPcmjiGoh//EwC5nGPEmS4RcfkVKOhJf+WOgoxJclFz3kgn//dBA+ya1GhurNn8zb//9NNutNuhz31f////9vt///z+IdAEAAAK4LQIAKobHItEIYCGAExBwe8jcToF9zIKrEdDYIuP2MgOWFSE34wYiR5iqQPj0JIeoVdlG4VD4XA67mAcNa1fhzA1jwHuTRxDUQ//iYBczjHiTJcIuPyKlHQkv/LHQUYkuSi57yQT//uggfZNajQ3Vmz+Zt//+mm3Wm3Q576v////+32///5/EOgAAADVghQAAAAA//uQZAUAB1WI0PZugAAAAAoQwAAAEk3nRd2qAAAAACiDgAAAAAAABCqEEQRLCgwpBGMlJkIz8jKhGvj4k6jzRnqasNKIeoh5gI7BJaC1A1AoNBjJgbyApVS4IDlZgDU5WUAxEKDNmmALHzZp0Fkz1FMTmGFl1FMEyodIavcCAUHDWrKAIA4aa2oCgILEBupZgHvAhEBcZ6joQBxS76AgccrFlczBvKLC0QI2cBoCFvfTDAo7eoOQInqDPBtvrDEZBNYN5xwNwxQRfw8ZQ5wQVLvO8OYU+mHvFLlDh05Mdg7BT6YrRPpCBznMB2r//xKJjyyOh+cImr2/4doscwD6neZjuZR4AgAABYAAAABy1xcdQtxYBYYZdifkUDgzzXaXn98Z0oi9ILU5mBjFANmRwlVJ3/6jYDAmxaiDG3/6xjQQCCKkRb/6kg/wW+kSJ5//rLobkLSiKmqP/0ikJuDaSaSf/6JiLYLEYnW/+kXg1WRVJL/9EmQ1YZIsv/6Qzwy5qk7/+tEU0nkls3/zIUMPKNX/6yZLf+kFgAfgGyLFAUwY//uQZAUABcd5UiNPVXAAAApAAAAAE0VZQKw9ISAAACgAAAAAVQIygIElVrFkBS+Jhi+EAuu+lKAkYUEIsmEAEoMeDmCETMvfSHTGkF5RWH7kz/ESHWPAq/kcCRhqBtMdokPdM7vil7RG98A2sc7zO6ZvTdM7pmOUAZTnJW+NXxqmd41dqJ6mLTXxrPpnV8avaIf5SvL7pndPvPpndJR9Kuu8fePvuiuhorgWjp7Mf/PRjxcFCPDkW31srioCExivv9lcwKEaHsf/7ow2Fl1T/9RkXgEhYElAoCLFtMArxwivDJJ+bR1HTKJdlEoTELCIqgEwVGSQ+hIm0NbK8WXcTEI0UPoa2NbG4y2K00JEWbZavJXkYaqo9CRHS55FcZTjKEk3NKoCYUnSQ0rWxrZbFKbKIhOKPZe1cJKzZSaQrIyULHDZmV5K4xySsDRKWOruanGtjLJXFEmwaIbDLX0hIPBUQPVFVkQkDoUNfSoDgQGKPekoxeGzA4DUvnn4bxzcZrtJyipKfPNy5w+9lnXwgqsiyHNeSVpemw4bWb9psYeq//uQZBoABQt4yMVxYAIAAAkQoAAAHvYpL5m6AAgAACXDAAAAD59jblTirQe9upFsmZbpMudy7Lz1X1DYsxOOSWpfPqNX2WqktK0DMvuGwlbNj44TleLPQ+Gsfb+GOWOKJoIrWb3cIMeeON6lz2umTqMXV8Mj30yWPpjoSa9ujK8SyeJP5y5mOW1D6hvLepeveEAEDo0mgCRClOEgANv3B9a6fikgUSu/DmAMATrGx7nng5p5iimPNZsfQLYB2sDLIkzRKZOHGAaUyDcpFBSLG9MCQALgAIgQs2YunOszLSAyQYPVC2YdGGeHD2dTdJk1pAHGAWDjnkcLKFymS3RQZTInzySoBwMG0QueC3gMsCEYxUqlrcxK6k1LQQcsmyYeQPdC2YfuGPASCBkcVMQQqpVJshui1tkXQJQV0OXGAZMXSOEEBRirXbVRQW7ugq7IM7rPWSZyDlM3IuNEkxzCOJ0ny2ThNkyRai1b6ev//3dzNGzNb//4uAvHT5sURcZCFcuKLhOFs8mLAAEAt4UWAAIABAAAAAB4qbHo0tIjVkUU//uQZAwABfSFz3ZqQAAAAAngwAAAE1HjMp2qAAAAACZDgAAAD5UkTE1UgZEUExqYynN1qZvqIOREEFmBcJQkwdxiFtw0qEOkGYfRDifBui9MQg4QAHAqWtAWHoCxu1Yf4VfWLPIM2mHDFsbQEVGwyqQoQcwnfHeIkNt9YnkiaS1oizycqJrx4KOQjahZxWbcZgztj2c49nKmkId44S71j0c8eV9yDK6uPRzx5X18eDvjvQ6yKo9ZSS6l//8elePK/Lf//IInrOF/FvDoADYAGBMGb7FtErm5MXMlmPAJQVgWta7Zx2go+8xJ0UiCb8LHHdftWyLJE0QIAIsI+UbXu67dZMjmgDGCGl1H+vpF4NSDckSIkk7Vd+sxEhBQMRU8j/12UIRhzSaUdQ+rQU5kGeFxm+hb1oh6pWWmv3uvmReDl0UnvtapVaIzo1jZbf/pD6ElLqSX+rUmOQNpJFa/r+sa4e/pBlAABoAAAAA3CUgShLdGIxsY7AUABPRrgCABdDuQ5GC7DqPQCgbbJUAoRSUj+NIEig0YfyWUho1VBBBA//uQZB4ABZx5zfMakeAAAAmwAAAAF5F3P0w9GtAAACfAAAAAwLhMDmAYWMgVEG1U0FIGCBgXBXAtfMH10000EEEEEECUBYln03TTTdNBDZopopYvrTTdNa325mImNg3TTPV9q3pmY0xoO6bv3r00y+IDGid/9aaaZTGMuj9mpu9Mpio1dXrr5HERTZSmqU36A3CumzN/9Robv/Xx4v9ijkSRSNLQhAWumap82WRSBUqXStV/YcS+XVLnSS+WLDroqArFkMEsAS+eWmrUzrO0oEmE40RlMZ5+ODIkAyKAGUwZ3mVKmcamcJnMW26MRPgUw6j+LkhyHGVGYjSUUKNpuJUQoOIAyDvEyG8S5yfK6dhZc0Tx1KI/gviKL6qvvFs1+bWtaz58uUNnryq6kt5RzOCkPWlVqVX2a/EEBUdU1KrXLf40GoiiFXK///qpoiDXrOgqDR38JB0bw7SoL+ZB9o1RCkQjQ2CBYZKd/+VJxZRRZlqSkKiws0WFxUyCwsKiMy7hUVFhIaCrNQsKkTIsLivwKKigsj8XYlwt/WKi2N4d//uQRCSAAjURNIHpMZBGYiaQPSYyAAABLAAAAAAAACWAAAAApUF/Mg+0aohSIRobBAsMlO//Kk4soosy1JSFRYWaLC4qZBYWFRGZdwqKiwkNBVmoWFSJkWFxX4FFRQWR+LsS4W/rFRb/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////VEFHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAU291bmRib3kuZGUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMjAwNGh0dHA6Ly93d3cuc291bmRib3kuZGUAAAAAAAAAACU=");  
    snd.play();
}

<IPython.core.display.Javascript object>

In [4]:
%%javascript
Jupyter.beep()

<IPython.core.display.Javascript object>

**COOL, THAT WORKED.**

## The data

We're fetching a set of labeled Twitter sentiment analysis data from http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

In [23]:
import requests
import os

url = "http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip"
f_name = os.path.basename(url)
r = requests.get(url, stream=True)
with open(f_name, "wb") as f_out:
    for chunk in r.iter_content(chunk_size=1024):
        if chunk:
            f_out.write(chunk)

Let's put that in a `data` subdirectory and unzip so we can take a peek.

In [28]:
import errno
# check if the directory exists and handle the relevant OSError (thanks StackOverflow!)
path = "./data"
try:
    os.makedirs(path)
except OSError as exc:
    if exc.errno == errno.EEXIST and os.path.isdir(path):
        pass
    else:
        raise

# extract the zip
from zipfile import ZipFile
with ZipFile("Sentiment-Analysis-Dataset.zip") as myzip:
    myzip.extractall(path)

In [29]:
ls data/

Sentiment Analysis Dataset.csv


OK, let's see what kind of file it is, and what it looks like...

In [35]:
!file data/Sentiment\ Analysis\ Dataset.csv

data/Sentiment Analysis Dataset.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators


Ugh, it's got a [BOM](https://en.wikipedia.org/wiki/Byte_order_mark). Why?! 

In [30]:
!head data/Sentiment\ Analysis\ Dataset.csv

﻿ItemID,Sentiment,SentimentSource,SentimentText
1,0,Sentiment140,                     is so sad for my APL friend.............
2,0,Sentiment140,                   I missed the New Moon trailer...
3,1,Sentiment140,              omg its already 7:30 :O
4,0,Sentiment140,          .. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)...
5,0,Sentiment140,         i think mi bf is cheating on me!!!       T_T
6,0,Sentiment140,         or i just worry too much?        
7,1,Sentiment140,       Juuuuuuuuuuuuuuuuussssst Chillin!!
8,0,Sentiment140,       Sunny Again        Work Tomorrow  :-|       TV Tonight
9,1,Sentiment140,      handed in my uniform today . i miss you already


OK, it's basic CSV; let's try to load it up the naive way and hope that there are no commas in the final field

In [4]:
import csv

f_in = open("data/Sentiment Analysis Dataset.csv")
reader = csv.reader(f_in, delimiter=',', quoting=csv.QUOTE_MINIMAL)
header = reader.next()
header

['\xef\xbb\xbfItemID', 'Sentiment', 'SentimentSource', 'SentimentText']

In [5]:
all_data = []
for row in reader:
    label = int(row[1])
    tweet = row[-1]
    all_data.append((tweet, label))

len(all_data)

1578614

OK, about 1.5M data points; nice. Now let's shuffle it, then split into train/val/test sets.

In [6]:
from random import shuffle

shuffle(all_data)
VAL_SPLIT = 0.1
nb_val_samples = int(VAL_SPLIT * len(all_data))
# doing an 80/10/10 split
train_data = all_data[:-2*nb_val_samples]
dev_data = all_data[-2*nb_val_samples:-nb_val_samples]
test_data = all_data[-nb_val_samples:]

len(train_data)
len(dev_data)
len(test_data)
assert len(all_data) == len(train_data) + len(dev_data) + len(test_data)

Final peek at the raw training data to make sure it mostly looks like what we expect...

In [65]:
from random import sample
sample(train_data, 5)

[('tonight was amazing, but I miss the good old days ', 0),
 ("@volupty it's only cool because tomorrow my blond cousin comes  and his blue eyes too :p",
  1),
 ('@mrskutcher definatley agree  blew flawless out of the runnings with that performance!!',
  1),
 ('@newO_nyboR yeahh haha. Hope u feel better sooon ', 1),
 ('just got off the phone with my Daddy Doodle. I miss him.  http://plurk.com/p/114ons',
  0)]

*N.B.* Given the approach I'm going to take below; we're going to replace all @-mentions with a fixed string (e.g. `"AT_MENTION"`)

In [7]:
# start by separating inputs and labels, then do the @-mention replacements
train_inputs, train_labels = zip(*train_data)

In [8]:
import re
X_train = []
# this is *very* coarse and will surely match things that we don't want it to
mention_patt = re.compile(ur'@\w+', re.UNICODE)
for item in train_inputs:
    try:
        X_train.append(mention_patt.sub("AT_MENTION", item.decode("utf8")))
    except Exception:
        print item
        break

OK, looks like that worked. Onward!

## First contender: old-school LR+BoW

[Insert clever/insightful stuff about LR + BoW approaches here]

In [96]:
# the model
from sklearn.linear_model import LogisticRegressionCV
# using 2 cores (of my 4) for fitting
lr = LogisticRegressionCV(n_jobs=2)

*N.B.* We're using `LogisticRegressionCV` rather than plain `LogisticRegression` because the former does cross-validation on the `C` regularizer for free as part of training.

In [97]:
# the feature extractor
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(analyzer="char", lowercase=False, binary=False, ngram_range=(3, 6))

The `"char"` analyzer extracts character-based ngrams, *including across word boundaries*. I'm keeping casing AS-IS because *a priori* I can imagine it being helpful in deciding whether some piece of text has emotional weight. The range of ngram sizes I'm looking at is not justified beyond it having worked for me in the past in other text classification scenarios.

In [101]:
# put them together in an end-to-end trainable/callable pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([("vec", vec), ("lr", lr)])

# let's fit it on 100k data points and see how long it takes
%time pipeline.fit(X_train[:100000], train_labels[:100000])

CPU times: user 3min 37s, sys: 8.83 s, total: 3min 46s
Wall time: 19min 31s


Pipeline(steps=[('vec', CountVectorizer(analyzer='char', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(3, 6), preprocessor=None, stop_words=None,
        str...2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0))])

20 minutes...so we can ballpark ~4hrs for the full training set. I've seen reports that DL classification really only comes into its own (metrics-wise) with datasets of size *O(1e6)*...let's see what we can get out of this one.

## Second contender: new-school ConvNet/CNN
### (where "new" == "already somewhat outdated")

The approach we're taking here is a fairly standard approach to using ConvNets for sentence classification, without much in the way of bells, or whistles. [Kim (2014)](https://arxiv.org/abs/1408.5882) is one of the canonical references for this approach. Our architecture closely resembles the one used there, and pictured here (source: Denny Britz's [blog post on CNNs for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/); one of the clearest expositions of this I've come across):

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM.png" alt="ConvNet for sentence classification" style="width: 800px;"/>

**N.B.** The principle difference in the approach taken here is that I'll be using *character*-based embeddings (trained from scratch), rather than word embeddings, as shown above (and used in Kim 2014). Mostly this is because I used char ngrams for the BoNG+LR classifier.

In [54]:
# lots of stuff to import
from keras.models import Model
from keras.layers import Input, Dense, Dropout, Activation, concatenate, Embedding, Conv1D, GlobalMaxPooling1D
from keras.utils.np_utils import *
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import *
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import *

Using Theano backend.


**N.B.** I'm using the _CPU_ for this. My Mac has a GeForce GT 650M, but I recently upgrade CUDA and drivers, and things are a bit borked (w.r.t. getting Keras to see the GPU).

We'll start with defining some useful bits of stuff here, for easier tweaking later on (should it prove necessary), and then do some data munging to get it into the shape that our model needs.

In [55]:
# DEFINE SOME USEFUL STUFF
# some of these vals are cribbed from other experiments with
# other short text classification tasks; YMMV, caveat emptor, yadda yadda...do some experimenting
VALIDATION_SPLIT = 0.1
EMBEDDING_DIM = 32
# max len in first 100k is 317, but very few longer than 256
MAX_SEQUENCE_LENGTH = 256
BATCH_SIZE = 64
# keeping this small(ish) to keep training time down
FILTERS = 50
HIDDEN_DIMS = 250
P_DROPOUT = 0.25
EPOCHS = 2


# using the tokenizer that comes with Keras
# this fxn also maps the input vocab to sequences of integer indices and returns a char-to-index map
def tokenize_texts(texts):
    print('Tokenizing')
    tokenizer = Tokenizer(char_level=True)
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)
    word_index = tokenizer.word_index
    print('Found %d unique tokens' % (len(word_index),))
    return sequences, word_index


# truncate/pad input sequences so that they're all the same length
def pad_seqs(sequences):
    print('Padding, encoding, train/dev split')
    data = pad_sequences(sequences,
                         maxlen=MAX_SEQUENCE_LENGTH,
                         padding='post',
                         truncating='post')
    return data


sequences, word_index = tokenize_texts(X_train[:100000])
data = pad_seqs(sequences)
import numpy
labels = numpy.asarray(train_labels)

Tokenizing
Found 185 unique tokens
Padding, encoding, train/dev split


I'll be using Keras's [functional API](https://keras.io/getting-started/functional-api-guide/). I've only started using it, and already prefer it (plus it's a bit closer to the [PyTorch](http://pytorch.org) approach, which I've finally started playing with). OK, let's define the main building blocks of the model (again, there are design decisions in here that come from previous experimentation, literature-based suggestions, and Twitter-suggested rules of thumb; normally you'd want to use some kind of hyperparam optimization for this).

In [114]:
x = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
# zeroth index doesn't get used by Keras in embedding layers
emb = Embedding(len(word_index) + 1,
                EMBEDDING_DIM,
                input_length=MAX_SEQUENCE_LENGTH,
                trainable=True)(x)
# we'll set up convolutions of size 3, 4, 5, and 6, like the ngram_range we used above,
# do max-pooling on each, then concatenate the output before running it through a dense layer
conv_3 = Conv1D(FILTERS, 3, padding="valid",
                activation="relu", strides=1)(emb)
maxpool_3 = GlobalMaxPooling1D()(conv_3)
conv_4 = Conv1D(FILTERS, 4, padding="valid",
                activation="relu", strides=1)(emb)
maxpool_4 = GlobalMaxPooling1D()(conv_4)
conv_5 = Conv1D(FILTERS, 5, padding="valid",
                activation="relu", strides=1)(emb)
maxpool_5 = GlobalMaxPooling1D()(conv_5)
conv_6 = Conv1D(FILTERS, 6, padding="valid",
                activation="relu", strides=1)(emb)
maxpool_6 = GlobalMaxPooling1D()(conv_6)
merged = concatenate([maxpool_3, maxpool_4, maxpool_5, maxpool_6], axis=-1)
hidden = Dense(HIDDEN_DIMS)(merged)
dropout = Dropout(P_DROPOUT)(hidden)
dropout = Activation("relu")(dropout)
out = Dense(1, activation="sigmoid")(dropout)
model = Model(inputs=x, outputs=out)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Alright; model defined and data ready. Let's see how long it takes to train this bad boy.

In [122]:
%time hist = model.fit(data[:100000], labels[:100000], batch_size=BATCH_SIZE, epochs=EPOCHS)

Epoch 1/2
Epoch 2/2
CPU times: user 35min 18s, sys: 46 s, total: 36min 4s
Wall time: 9min 9s


Half the time to train! Can't say I expected that...now let's see whether these classifiers are any good at their assigned tasks.

## Third contender: old-school-new-kid (py)FastText

[FastText](https://github.com/facebookresearch/fastText) is a library released by the **Facebook AI Research** (FAIR) group in 2016. It's very specifically designed for the efficient training of high-quality word embeddings and text classifiers, and by all accounts is at or near state-of-the-art on these tasks, and should at the very least be considered a strong baseline to beat for any new models.

We'll be using the [pyfasttest](https://github.com/vrasneur/pyfasttext) Python bindings here.
(*N.B.* Installing via `pip` resulted in some issues for me (on Mac Sierra). I had to `conda install gcc`, then use that compiler to run `python setup.py install`, and now everything seems to be peachy.)

In [1]:
from pyfasttext import FastText

FastText requires training data to be in files, and in a specific format, so we'll handle that here with the training data.

In [67]:
import codecs
with codecs.open("data/sentiment.train.ft.txt", "w") as f_out:
    for label, datum in zip(train_labels[:100000], X_train[:100000]):
        try:
            f_out.write("%s %s" % ("__label__good" if int(label) else "__label__bad", datum.encode("utf8")) + "\n")
        except UnicodeEncodeError:
            print datum
            break

Let's take a peek to make sure that looks right.

In [68]:
!head data/sentiment.train.ft.txt

__label__bad AT_MENTION i feel for you, ? ??? ???????? full ???? ?? ???????????????  it hits the theaters in 9 days tho! i m waiting impatiently!!
__label__good hi iÂ´m from german pleas helf 
__label__good AT_MENTION did I send you wrong state...no he was in MN-Minnesota...my mind is blogged 
__label__good AT_MENTION Hi, Yesterday I finaly mixed with Ina the purple nurple. 
__label__bad Insomnia. D'you know... It's 2am and I still can't keep my eyes closing. Damn! Again? I bet it's 4 or 5 
__label__good ohhh degrassi marathon marryy me  ha playing cards with my sister &amp;; watching degrassi, it's the one where rickk shoots jimmy ://
__label__bad AT_MENTION Haha Ernest! ....buries his mother? You lie! Rest in Peace Jim Varney 
__label__bad waiting for everyone to leave so i can have the house to myself. working tonight 
__label__bad AT_MENTION the clash are too good for their armand van halen ears. 
__label__good AT_MENTION  U leave tomorrow?  I'm so excited for you!!  Have 

Seems OK enough for this. Training is nice and simple, and according to the folks at FAIR, it's lightning fast (FastText is chock full of optimizations that make it quick to train, quick to use, AND space efficient).

In [85]:
ft = FastText()
# we'll do 2 epochs, like we did (to start) for the ConvNet
# 
# I the options on the FastText website and am using values to mimic o|ur previous examples
%time ft.supervised(input="data/sentiment.train.ft.txt", output="sentiment_model", verbose=3, epoch=EPOCHS, lr=0.1, minn=3, maxn=6, wordNgrams=0)

CPU times: user 22.5 s, sys: 1.95 s, total: 24.4 s
Wall time: 15.1 s


Holy cripes **15s**! That's nearly two orders of magnitude faster than our previous models. That gives A LOT more time for experimentation and hyperparam tuning.

Are we sure we trained on the same amount of data?!

In [70]:
!wc -l data/sentiment.train.ft.txt

  100001 data/sentiment.train.ft.txt


Sure enough.

Alright, time to get down to brass tacks and see how these things actually fare, metrics-wise.

## Eval

One of my pet peeves about posts like this is the lack of detail w.r.t. things like system hardware/architecture,
training time, etc. as well as the simplicity of the eval metrics (usually just raw accuracy).

In the spirit of transparency and (one hopes) replicability, here's what I'm working with.

### Specs

I'm on a MacBook Pro with

* 16GB 1600 MHz DDR3 RAM
* 2.6 GHz Intel Core i7 processor
* NVIDIA GeForce 650M GPU (not that I'm using that thus far)
* running macos Sierra.

### Metrics

Rather than just look at raw accurancy (which is distressingly frequent in DL papers I've come across), we're going to be looking at precision and recall (and, incidentally, f-score). These metrics give us a bit more insight into how our models do on each of our classes, with respect to different types of errors (e.g. false positives/negatives). We get these metrics in a nice table from [`scikit-learn`](http://scikit-learn.org).

In [59]:
from sklearn.metrics import classification_report

We need to munge the dev data the same way we did the training data, for the LR, ConvNet, and FastTEXT use cases (remove @-mentions, turn into ndarrays, sequencify, *&c*).

In [71]:
import numpy as np
dev_inputs, dev_labels = zip(*dev_data)

X_dev = []
for item in dev_inputs:
    try:
        X_dev.append(mention_patt.sub("AT_MENTION", item.decode("utf8")))
    except Exception:
        print item
        break

y_dev = np.asarray(dev_labels[:10000])
y_dev.shape

(10000,)

In [128]:
%time print classification_report(y_dev[:10000], pipeline.predict(X_dev[:10000]), digits=3)

             precision    recall  f1-score   support

          0      0.801     0.790     0.796      5010
          1      0.792     0.803     0.798      4990

avg / total      0.797     0.797     0.797     10000

CPU times: user 3.07 s, sys: 61.4 ms, total: 3.13 s
Wall time: 3.13 s


Pretty decent numbers, and decently quick to classify 10k items.

Let's see how the CNN model does.

In [135]:
dev_sequences, _ = tokenize_texts(X_dev[:10000])
X_dev_cnn = pad_seqs(dev_sequences)

Tokenizing
Found 168 unique tokens
Padding, encoding, train/dev split


In [155]:
%time print classification_report(y_dev[:10000], np.round(model.predict(X_dev_cnn[:10000])).astype("int32"), digits=3)

             precision    recall  f1-score   support

          0      0.785     0.711     0.746      5010
          1      0.735     0.804     0.768      4990

avg / total      0.760     0.757     0.757     10000

CPU times: user 33.5 s, sys: 327 ms, total: 33.8 s
Wall time: 8.5 s


Hmm...four points lower on f-score, and almost 3 times slower in prediction. We'll need something better than this to take to the boss if we want to convince him to let us play with DL stuff.

Let's see how FastText does.

In [78]:
with codecs.open("data/sentiment.dev.ft.txt", "w") as f_out:
    for label, datum in zip(dev_labels, X_dev):
        try:
            f_out.write("%s %s" % ("__label__good" if label else "__label__bad", datum.encode("utf8")) + "\n")
        except UnicodeEncodeError:
            print datum
            break

In [79]:
!head -5 data/sentiment.dev.ft.txt

__label__bad AT_MENTION already worked out today 
__label__bad AT_MENTION Ah yes right. I don't know why didn't I think of it  Thanks a bunch!
__label__good AT_MENTION tomorrow I should have some cools pics for you if weather is good 
__label__good AT_MENTION ADD me to your posse  NZKZKL #epicpetwars
__label__bad AT_MENTION AT_MENTION thank you for the shows,thank you very much for everything and for all time here in Brazil. It was amazing.Good bye 


In [86]:
y_ft = [u"good" if x else u"bad" for x in dev_labels]
%time print classification_report(y_ft[:10000], np.array(ft.predict(X_dev[:10000])), digits=3)

             precision    recall  f1-score   support

        bad      0.776     0.788     0.782      4907
       good      0.793     0.781     0.787      5093

avg / total      0.785     0.784     0.785     10000

CPU times: user 510 ms, sys: 1.2 ms, total: 512 ms
Wall time: 512 ms


As expect, it's FAST; 6x faster than LR in prediction. Moreover, those numbers are pretty good; better than the ConvNet out of the box, and very nearly as good as the LR+BoW model (OK, it would have been cool if they were better).

Let's see what we can do to beef our metrics up a bit for the ConvNet and FastText models.

## Squeezing better performance out of our models
### (wherein we try to show that CNNs and FastText are worth pursuing)

Let's start by trying to retrain our ConvNet for twice as long.If that doesn't help, we can always try with more data...

In [156]:
%time hist = model.fit(data[:100000], labels[:100000], batch_size=BATCH_SIZE, epochs=2*EPOCHS)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
CPU times: user 1h 11min 44s, sys: 1min 30s, total: 1h 13min 14s
Wall time: 18min 26s


Alright, we're basically at the same training time as for the LR pipeline now. The loss kept dropping, which is good, **BUT** do note that I'm checking against validation loss here; `model.fit()` lets you pass in a validation set as an additional param, and you can set up checks against the loss on the validation set---which doesn't get used as training data---to decide when to stop training. Normally I'd have done that.

Alright, let's see how good this baby is.

In [157]:
%time print classification_report(y_dev[:10000], np.round(model.predict(X_dev_cnn[:10000])).astype("int32"), digits=3)

             precision    recall  f1-score   support

          0      0.719     0.864     0.785      5010
          1      0.829     0.661     0.736      4990

avg / total      0.774     0.763     0.760     10000

CPU times: user 35.7 s, sys: 1 s, total: 36.7 s
Wall time: 9.44 s


Hmm, not promising. We squeezed a tiny bit more performance out, in terms of average metrics, but we're still a few points away from the scores the basic system got.

What happens when we train the FastText model for twice as long?

In [87]:
%time ft.supervised(input="data/sentiment.train.ft.txt", output="sentiment_model", verbose=3, epoch=(2*EPOCHS), lr=0.1, minn=3, maxn=6, wordNgrams=0)

CPU times: user 36.4 s, sys: 2.46 s, total: 38.9 s
Wall time: 18.3 s


In [88]:
%time print classification_report(y_ft[:10000], np.array(ft.predict(X_dev[:10000])), digits=3)

             precision    recall  f1-score   support

        bad      0.779     0.797     0.788      4907
       good      0.800     0.783     0.791      5093

avg / total      0.790     0.790     0.790     10000

CPU times: user 524 ms, sys: 29.9 ms, total: 554 ms
Wall time: 534 ms


Alright...a teeny bit better. Can we do more? Well, one thing we did here to make things "fair" was train everything under the same regime, namely using on `{3,4,5,6}`-grams.
From what I understand, FastText by default includes both word unigrams and character ngrams in its models. Let's let it do that (of course, to make this apples to apples, we'll want to retrain the LR+BoW model to allow this, as well).

In [92]:
%time ft.supervised(input="data/sentiment.train.ft.txt", output="sentiment_model", verbose=3, epoch=(2*EPOCHS), lr=0.1, minn=3, maxn=6, wordNgrams=2)

CPU times: user 38.5 s, sys: 2.35 s, total: 40.9 s
Wall time: 17.6 s


In [93]:
%time print classification_report(y_ft[:10000], np.array(ft.predict(X_dev[:10000])), digits=3)

             precision    recall  f1-score   support

        bad      0.781     0.807     0.794      4907
       good      0.808     0.782     0.795      5093

avg / total      0.795     0.794     0.794     10000

CPU times: user 558 ms, sys: 30.7 ms, total: 588 ms
Wall time: 568 ms


Let's try training the ConvNet on twice as much data...don't forget that we truncated `data` right when we defined it up above, so let's revisit that.

In [160]:
sequences, word_index = tokenize_texts(X_train[:500000])
data = pad_seqs(sequences)
data.shape

Tokenizing
Found 196 unique tokens
Padding, encoding, train/dev split


(500000, 256)

In [161]:
%time hist = model.fit(data[:200000], labels[:200000], batch_size=BATCH_SIZE, epochs=2*EPOCHS)

Epoch 1/4
 13312/200000 [>.............................] - ETA: 507s - loss: 0.4519 - acc: 0.7866

IndexError: index 189 is out of bounds for size 186
Apply node that caused the error: AdvancedSubtensor1(embedding_2/embeddings, Reshape{1}.0)
Toposort index: 48
Inputs types: [TensorType(float32, matrix), TensorType(int32, vector)]
Inputs shapes: [(186, 32), (16384,)]
Inputs strides: [(128, 4), (4,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Reshape{3}(AdvancedSubtensor1.0, MakeVector{dtype='int64'}.0)]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2821, in run_ast_nodes
    if self.run_code(code, result):
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-114-3a8528a90ed6>", line 6, in <module>
    trainable=True)(x)
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/keras/engine/topology.py", line 602, in __call__
    output = self.call(inputs, **kwargs)
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/keras/layers/embeddings.py", line 134, in call
    out = K.gather(self.embeddings, inputs)
  File "/Users/fredmailhot/anaconda/envs/fast-ai/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 483, in gather
    y = reference[indices]

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.