# Stochastic Gradient Descent and Vowpal Wabbit

The first thing to note is that this is **NOT** a regular ipython notebook.  It uses the bash kernel that you need to install from [here](https://github.com/takluyver/bash_kernel) to execute normal linux commands rather than just regular python.  You need ipython version 3 for it to work.  It's pretty sweet though--you can switch back and forth between executing linux commands and python commands by going to the "Change kernel" option in the "Kernel" menu above.

We'll be looking at a dataset of display ad click logs from online display advertising from Criteo.  It's a much larger version of a [Kaggle competition dataset](https://www.kaggle.com/c/criteo-display-ad-challenge).  The dataset is a terabyte and has records for 24 days, but we'll only be looking at a single day of data.  You can get the data [here](http://labs.criteo.com/downloads/download-terabyte-click-logs/).  There's also a nice [blog post](http://fastml.com/vowpal-wabbit-eats-big-data-from-the-criteo-competition-for-breakfast/) about using VW in the Criteo contest.

Let's see how large the file is:

In [None]:
ls -lh data/day_0

46 GB is pretty gigantic, and clearly to large to load into my laptop's RAM.  So VW is a good option here.  Let's see how many lines (examples there are in the file):

In [None]:
# THIS WILL TAKE A LONG TIME
wc -l data/day_0
#195,841,983

Let's take a look at the first two lines of the file:

In [None]:
head -2 data/day_0

Here's a description of the dataset from the Criteo website:

The columns are tab separated with the following schema:<br>
&lt;label&gt; &lt;integer feature 1&gt; … &lt;integer feature 13&gt; &lt;categorical feature 1&gt; … &lt;categorical feature 26&gt;
When a value is missing, the field is just empty.

So the first field is the target value (a 1 when someone clicked on the ad, a 0 when the didn't).  Then we have numeric features, and a bunch of categorical features which we would normally need to expand out into dummies.

We'll be using vowpal wabbit, which you can get from github [here](https://github.com/JohnLangford/vowpal_wabbit).  We'll be working with version 7.7, though I doubt the particular version is crucial.  Windows installation instructions here [here](https://github.com/JohnLangford/vowpal_wabbit/blob/master/README.windows.txt), though I've never tried to get it working on Windows, so YMMV.

In [None]:
vw --version

VW has a huge array of commandline options, which you can read about from the help menu:

In [None]:
vw --help

The first thing that we need to do is to convert the log file into the input format that VW expects.  VW comes with a utility called `vw-csv2bin`, but we're going to write some simple python code to do that:

In [None]:
import re

In [None]:
def tsv_to_vw(tsv_file, vw_file, skip_lines, num_lines):
    print "\nTurning %s into %s..." % (tsv_file, vw_file)

    # open our input file and an output file to write to
    with open(tsv_file, 'r') as infile, open(vw_file, 'w') as outfile:        
        lines_read=0
        lines_skipped=0
        # read the file line by line
        for line in infile:
            # we want to skip the first skip_lines lines of the file
            if skip_lines!=0 and lines_skipped<skip_lines:
                lines_skipped += 1
                continue
  
            # if we've converted num_lines already, stop
            if lines_read>= num_lines: return

            # othewise, convert the line
            out_line = ""
            # get rid of the newline at the end of the line
            line = re.sub('\n', '', line)
            # split the file on tabs
            data = re.split('\t', line)

            # pop off our target/label column and write the label | for vw
            target = data.pop(0)        
            out_line += "1 | " if target == "1" else "-1 | "

            # write the 13 integer features in a form like feature:val, e.g. f0:124
            for i in range(13):
                out_line += "f%s:" % i
                if data[i] == "":
                    out_line += "0 "
                else:
                    out_line += "%s " % data[i]

            # all the rest are the categorical features, so we just write these directly
            # and vw will interpet them as F:1 when they're present, F:0 when they're not
            for i in range(13, len(data)):
                if data[i] == "": continue
                out_line += "f%s_%s " % (i, data[i])

            out_line += "\n"
            outfile.write(out_line)
            lines_read += 1

In [None]:
# DON'T RE-RUN THIS, BECAUSE IT WILL TAKE FOREVER...
# also, only write out 2mm lines because my hard drive fills up...
tsv_to_vw("data/day_0", "data/day_0.vw", skip_lines=0, num_lines=2000000)

In [None]:
tsv_to_vw("data/day_0", "data/day_0.test.vw", skip_lines=2000000, num_lines=2000000)

In [None]:
head -10 data/day_0.vw

We can validate the format with the super useful [VW format validator tool](http://hunch.net/~vw/validate.html
).

Next, we'll train a logistic regression model.  The `-f` flag tells it where to store the final model, and the `-d` option is for the input data.  (Note, this provides realtime feedback in the terminal that you can't see in the notebook...)

In [None]:
vw --loss_function logistic -f data/day_0.model -d data/day_0.vw

Instead of displaying the logistic loss function values, we can have it display the binary accuracy instead:

In [None]:
vw --loss_function logistic --binary -f data/day_0.model -d data/day_0.vw

The dataset is 94% 0's, so getting an accuracy of 97% means that it is indeed learning something about the 1's.

We can have VW make multiple passes over the data.  In general, the more passes the better:

In [None]:
vw --loss_function logistic --binary -f data/day_0.model -d data/day_0.vw -c --passes 2

And we can apply the model to new data with the `-t` flag.  Predictions will be written to the file specified by the `-p` flag and raw predictions to the file specified by the `-r` flag:

In [None]:
vw -t --binary -i data/day_0.model -d data/day_0.test.vw -p data/day_0.test.preds -r data/day_0.test.raw.preds

In [None]:
head -10 data/day_0.test.preds

In [None]:
head -10 data/day_0.test.raw.preds

Let's write a file with only the true test predictions:

In [None]:
cut -d "|" -f 1 data/day_0.test.vw > data/day_0.test.true.labels

In [None]:
head -10 data/day_0.test.true.labels

In [None]:
%matplotlib inline

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_curve, roc_auc_score

In [None]:
labels = pd.read_csv("data/day_0.test.true.labels", header=None)
labels.columns = ["label"]
labels.head()

In [None]:
preds = pd.read_csv("data/day_0.test.raw.preds", header=None)
preds.columns = ["pred"]
preds.head()

In [None]:
fpr, tpr, thresholds = roc_curve(labels.label, preds.pred)
fpr_rand = tpr_rand = np.linspace(0, 1, 10)

plt.plot(fpr, tpr)
plt.plot(fpr_rand, tpr_rand, linestyle='--')
plt.show()

In [None]:
roc_auc_score(labels.label, preds.pred)

We can tell VW to generate all quadratic interaction features and use them in the model:

In [None]:
vw --loss_function logistic --binary -q aa -b 24 -d data/day_0.vw

Same with cubic features:

In [None]:
vw --loss_function logistic --binary --cubic aaa -d data/day_0.vw

We can add lasso or ridge penalties to the model:

In [None]:
vw --loss_function logistic --binary --l1 0.1 -d data/day_0.vw

In [None]:
vw --loss_function logistic --binary --l2 0.1 -d data/day_0.vw

If we had defined feature namespaces, we can mask entire chunks of features in and out of the model with the `--ignore` and `--keep` options.

We can get an idea of feature importances with the `vw-varinfo` script.  Let's first generate a tiny version of our training dataset so that this will go quickly:

In [None]:
head -1000 data/day_0.vw > data/day_0.small.vw

In [None]:
vw-varinfo -d data/day_0.small.vw

Finally, we can also train a support vector machine by using the "hinge" loss function:

In [None]:
vw --loss_function hinge --binary -f data/day_0.model -d data/day_0.vw