29 MAR 2016<br/>
source: [Out-of-core Learning and Model Persistence](https://github.com/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb)

[**Out-of-core**](https://en.wikipedia.org/wiki/Out-of-core_algorithm) or external memory algorithms are algorithms that are designed to process data that is too large to fit into a computer's main memory at one time.

## The IMDb Movie Dataset

In [8]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')

df.tail()

Unnamed: 0,review,sentiment,set
49995,"Towards the end of the movie, I felt it was to...",0,train
49996,This is the kind of movie that my enemies cont...,0,train
49997,I saw 'Descent' last night at the Stockholm Fi...,0,train
49998,Some films that you pick up for a pound turn o...,0,train
49999,"This is one of the dumbest films, I've ever se...",0,train


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 3 columns):
review       50000 non-null object
sentiment    50000 non-null int64
set          50000 non-null object
dtypes: int64(1), object(2)
memory usage: 1.1+ MB


In [9]:
df.sentiment.unique()

array([1, 0])

In [7]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df[['review', 'sentiment']].to_csv('out/shuffled_movie_data.csv', index=False)

## Preprocessing Text Data

Now, let us define a simple tokenizer that splits the text into individual word tokens. Furthermore, we will use some simple regular expression to remove HTML markup and all non-letter characters but "emoticons," convert the text to lower case, remove stopwords, and apply the Porter stemming algorithm to convert the words into their root form.

In [2]:
import numpy as np
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return tokenized

In [3]:
tokenizer('This :) is a <a> test! :-)</br>')

[u'test', u':)', u':)']

In [4]:
re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', 'This :) is a <a> test! :-)</br>'.lower())

[':)', ':-)']

In [5]:
re.findall('<[^>]*>', 'This :) is a <a> test! :-)</br>'.lower())

['<a>', '</br>']

## Out-of-core learning

In [7]:
%%bash

head -n 2 out/shuffled_movie_data.csv

review,sentiment
"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70's, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The power

First, we define a generator that returns the document body and the corresponding class label:

In [6]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To conform that the `stream_docs` function fetches the documents as intended, let us execute the following code snippet before we implement the `get_minibatch` function:

In [11]:
next(stream_docs('out/shuffled_movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

After we confirmed that our `stream_docs` functions works, we will now implement a `get_minibatch` function to fetch a specified number (size) of documents:

In [7]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in xrange(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y

Next, we will make use of the "hashing trick" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a **bag-of-words model** of our documents.<br>
More info: [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329)

In [8]:
from sklearn.feature_extraction.text import HashingVectorizer

vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

Using the [SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) from scikit-learn, we will instantiate a logistic regression classifier that learns from the documents incrementally using stochastic gradient descent.<br/>
More info about the algorithm: [Artificial neurons](http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html#Online-Learning-via-Stochastic-Gradient-Descent)

In [9]:
from sklearn.linear_model import SGDClassifier

clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='out/shuffled_movie_data.csv')

In [10]:
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in xrange(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:02:32


Executing the preceding code, we used the first 45,000 movie reviews to train the classifier, which means that we have 5,000 reviews left for testing:

In [11]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: {:.3f}'.format(clf.score(X_test, y_test)))

Accuracy: 0.873


The predictive performance, an accuarcy of ~87%, is quite "reasonable" given that we "only" used the default parameters and didn't do any hyperparameter optimization.

After we estimated the model perfomance, let us use those last 5,000 test samples to update our model.

In [12]:
clf = clf.partial_fit(X_test, y_test)

## Model Persistence

In the previous section, we successfully trained a model to predict the sentiment of a movie review. Unfortunately, if we'd close this IPython notebook at this point, we'd have to go through the whole learning process again and again if we'd want to make a prediction on "new data."

So, to reuse this model, we could use the `pickle` module to "serialize a Python object structure". Or even better, we could use the [`joblib`](https://pypi.python.org/pypi/joblib) library, which handles large NumPy arrays more efficiently.

In [14]:
import joblib
import os

if not os.path.exists('./pkl_objects'):
    os.mkdir('./pkl_objects')
    
joblib.dump(vect, 'pkl_objects/vectorizer.pkl')
joblib.dump(clf, 'pkl_objects/clf.pkl')

['pkl_objects/clf.pkl',
 'pkl_objects/clf.pkl_01.npy',
 'pkl_objects/clf.pkl_02.npy',
 'pkl_objects/clf.pkl_03.npy',
 'pkl_objects/clf.pkl_04.npy']

Using the code above, we "pickled" the `HashingVectorizer` and the `SGDClassifier` so that we can re-use those objects later. However, `pickle` and `joblib` have a **known issue** with pickling objects or functions from a `__main__` block and we'd get an `AttributeError: Can't get attribute [x] on <module '__main__'>` if we'd unpickle it later. Thus, to pickle the tokenizer function, we can write it to a file and import it to get the namespace "right".

In [15]:
%%writefile tokenizer.py

import numpy as np
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return tokenized

Writing tokenizer.py


In [17]:
from tokenizer import tokenizer

joblib.dump(tokenizer, 'pkl_objects/tokenizer.pkl')

['pkl_objects/tokenizer.pkl']

Now, let us restart this IPython notebook and check if the we can load our serialized objects:

In [1]:
import joblib

tokenizer = joblib.load('pkl_objects/tokenizer.pkl')
vect = joblib.load('pkl_objects/vectorizer.pkl')
clf = joblib.load('pkl_objects/clf.pkl')

After loading the `tokenizer`, `HashingVectorizer` and the tranined logistic regression model, we can use it to make predictions on new data, which can be useful, for example, if we'd want to embed our classifier into a web application.

In [2]:
example = ['I did not like this movie']
X = vect.transform(example)
clf.predict(X)

array([0])

In [3]:
example = ['I loved this movie']
X = vect.transform(example)
clf.predict(X)

array([1])