# **VowpalWabbit Text Classification**

A movie production company wants to build a real time IMDB review extraction and prediction system. The task is to predict whether a particular review is positive or negative. The data provided has only the actual reviews as text. 

## **Install Vowpal Wabbit**

Installing vowpal-wabbit package on Ubuntu is as easy as:

1. Follow the instructions here: https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Dependencies

2. Run the following commands on terminal:

    `sudo apt-get update`
    
    `sudo apt-get install vowpal-wabbit`

In [1]:
import vowpal_wabbit 

## **CHECK IF THE INSTALLATION IS WORKING**

In [2]:
#!vw --help

![alt text](fig/ex.png "Title")

## **LOAD DATA**

Download the data from: http://ai.stanford.edu/~amaas/data/sentiment/

In [3]:
import os
import numpy as np
import re
from sklearn.datasets import load_files
from sklearn.metrics import accuracy_score, roc_auc_score

# Read the train Data
path_to_movies = os.getcwd() + '/data/'
reviews_train = load_files(os.path.join(path_to_movies, 'train'))
text_train, y_train = reviews_train.data, reviews_train.target

In [4]:
print("Number of documents in training data: %d" % len(text_train))
print(np.bincount(y_train))

Number of documents in training data: 75000
[12500 12500 50000]


In [5]:
text_train[74000]

b"i seen this poor done comic film.but it would been better if they used a family that wasn't based on nothing.the movie starts as Denise finds a dinosaur bone in his back yard,and a fake<br /><br />who claims to be a dinosaur hunter comes to stay with the family.and he is real annoying too.the film did not deliver much laughs.the only two funny things in the movie was when Denise tries to sneak that fudge candy to his room and the scene where Mr Wilson's washer overflowed.it would be a better film if it involved a different character.the 93 version was a real one.this one wasn't a Denise the menace at all. and i saw this one 2 or 3 times a month when it was on HBO."

In [6]:
print(np.bincount(y_train))

[12500 12500 50000]


## **Treatment**

Basic removal of all the words with less than three characters (like ‘is’, ‘in’ etc.) because these do not add any valuable information to the model.

In [7]:
import re
def to_vw_format(document, label=None):
      return str(label or '') + ' |text ' + ' '.join(re.findall('\w{3,}', document.lower())) + '\n'

In [8]:
to_vw_format(str(text_train[0]), 1 if y_train[0] == 1 else -1)

'-1 |text full then unknown actors tsf great big cuddly romp film the idea bunch bored teenagers ripping off the local sink factory odd enough but add the black humour that forsyth are good and your for real treat the comatose van driver itself worth seeing and the canal side chase just too real anything but funny and for anyone who lived glasgow great know where that film\n'

## **Divide the data**

In [9]:
# Splitting train data to train and validation sets
train_size = int(0.7 * len(text_train))
train, train_labels = text_train[:train_size], y_train[:train_size]
valid, valid_labels = text_train[train_size:], y_train[train_size:]

In [10]:
# Convert and save in vowpal wabbit format
with open('movie_reviews_train.vw', 'w') as vw_train_data:
    for text, target in zip(train, train_labels):
        vw_train_data.write(to_vw_format(str(text), 1 if target == 1 else -1))
with open('movie_reviews_valid.vw', 'w') as vw_train_data:
    for text, target in zip(valid, valid_labels):
        vw_train_data.write(to_vw_format(str(text), 1 if target == 1 else -1))

In [11]:
len(train_labels), len(valid_labels)

(52500, 22500)

In [12]:
!head -2 movie_reviews_train.vw

-1 |text full then unknown actors tsf great big cuddly romp film the idea bunch bored teenagers ripping off the local sink factory odd enough but add the black humour that forsyth are good and your for real treat the comatose van driver itself worth seeing and the canal side chase just too real anything but funny and for anyone who lived glasgow great know where that film
-1 |text amount disappointment getting these days seeing movies like partner jhoom barabar and now heyy babyy gonna end habit seeing first day shows the movie utter disappointment because had the potential become laugh riot only the xc3 xa9butant director sajid khan hadn tried too many things only saving grace the movie were the last thirty minutes which were seriously funny elsewhere the movie fails miserably first half was desperately been tried look funny but wasn next minutes were emotional and looked totally artificial and illogical when you are out for movie like this you don expect much logic but all the flaws 

## **TRAINING**

* **-d** This is used to include the  path to train the data for the command line.
* **–loss_function** This is used to declare the loss function (e.g. squared, logistic, hinge etc.).
* **-f**  This is used to save the model for using it on a different dataset.


In [13]:
# Fitting a logistic regression for predicting the sentiment of a review
#logistic|hinge
!vw -d movie_reviews_train.vw --loss_function hinge -f movie_reviews_model.vw --quiet

## **TESTING**

* **-i** Read the model from the given file
* **-t -d** Declare that we are dealing with dataset without labels (test) at this path
* **-p** Save predictions to this file
* **–quiet** Do not print any steps taken for prediction

In [14]:
!vw -i movie_reviews_model.vw -t -d movie_reviews_valid.vw -p movie_valid_pred.txt --quiet

In [15]:
with open('movie_valid_pred.txt') as pred_file:
    valid_prediction = [float(label) for label in pred_file.readlines()]
    print("Accuracy: {}".format(round(accuracy_score(valid_labels, [int(pred_prob > 0) for pred_prob in valid_prediction]), 5)))
    #print("AUC: {}".format(round(roc_auc_score(valid_labels, valid_prediction), 5)))

Accuracy: 0.2128


## **N-GRAMS** 

In [16]:
!vw -d movie_reviews_train.vw --loss_function logistic --ngram 2 -f movie_reviews_model_bigram.vw

Generating 2-grams for all namespaces.
final_regressor = movie_reviews_model_bigram.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = movie_reviews_train.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.693147 0.693147            1            1.0  -1.0000   0.0000      132
0.492365 0.291584            2            2.0  -1.0000  -1.0831      350
0.715313 0.938260            4            4.0   1.0000  -1.6409      224
0.698638 0.681964            8            8.0  -1.0000  -0.9543      468
0.448672 0.198706           16           16.0  -1.0000  -1.9253      350
0.604503 0.760333           32           32.0  -1.0000  -1.1030      336
0.570287 0.536072           64           64.0  -1.0000  -1.9983      374
0.498139 0.425990          128          128.0  -1.0000  -1.5878      172
0.474052 0.449966          256          2

In [17]:
!vw -i movie_reviews_model.vw -t -d movie_reviews_valid.vw -p movie_reviews_model_bigram.vw --quiet

In [18]:
with open('movie_reviews_model_bigram.vw') as pred_file:
    valid_prediction = [float(label) for label in pred_file.readlines()]
    print("Accuracy: {}".format(round(accuracy_score(valid_labels, [int(pred_prob > 0) for pred_prob in valid_prediction]), 5)))
    #print("AUC: {}".format(round(roc_auc_score(valid_labels, valid_prediction), 5)))

Accuracy: 0.2128


## **Model Interpretability**

In [19]:
!vw -d movie_reviews_train.vw --loss_function logistic --ngram 2 --invert_hash movie_reviews_readable_model_bigram.vw

Generating 2-grams for all namespaces.
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = movie_reviews_train.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.693147 0.693147            1            1.0  -1.0000   0.0000      132
0.492365 0.291584            2            2.0  -1.0000  -1.0831      350
0.715313 0.938260            4            4.0   1.0000  -1.6409      224
0.698638 0.681964            8            8.0  -1.0000  -0.9543      468
0.448672 0.198706           16           16.0  -1.0000  -1.9253      350
0.604503 0.760333           32           32.0  -1.0000  -1.1030      336
0.570287 0.536072           64           64.0  -1.0000  -1.9983      374
0.498139 0.425990          128          128.0  -1.0000  -1.5878      172
0.474052 0.449966          256          256.0  -1.0000  -0.8328      308
0.467471 0.46089

## **REGULARIZATION**

In [20]:
!vw -d movie_reviews_train.vw --l1 0.00005 --l2 0.00005 --loss_function logistic --ngram 2 -f movie_reviews_model_bigram.vw

Generating 2-grams for all namespaces.
using l1 regularization = 5e-05
using l2 regularization = 5e-05
final_regressor = movie_reviews_model_bigram.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = movie_reviews_train.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.693147 0.693147            1            1.0  -1.0000   0.0000      132
0.492415 0.291683            2            2.0  -1.0000  -1.0827      350
0.715097 0.937779            4            4.0   1.0000  -1.6396      224
0.698451 0.681805            8            8.0  -1.0000  -0.9516      468
0.448968 0.199485           16           16.0  -1.0000  -1.9166      350
0.603455 0.757942           32           32.0  -1.0000  -1.1053      336
0.569885 0.536316           64           64.0  -1.0000  -1.9675      374
0.497651 0.425417          128          128.0  -1.