# MLOps - Movieclassifier

Applying Machine Learning To Sentiment Analysis using IMDB sentiment dataset

## Import Relevant Libraries

In [2]:
import numpy as np
import pandas as pd
import tarfile
import os
import pyprind

## Data

We will  assemble the individual
text documents from the decompressed download archive into a single CSV file.
In the following code section, we will be reading the movie reviews into a pandas DataFrame object

In [3]:
# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = '../data/raw/aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:17


In the preceding code, we first initialized a new progress bar object, pbar, with 50,000 iterations, which was the number of documents we were going to read in. Using the nested for loops, we iterated over the train and test subdirectories in the main aclImdb directory and read the individual text files from the pos and neg subdirectories that we eventually appended to the df pandas DataFrame, together with an integer class label (1 = positive and 0 = negative).

Since the class labels in the assembled dataset are sorted, we will now shuffle DataFrame using the permutation function from the np.random submodule—
this will be useful to split the dataset into training and test datasets in later sections, when we will stream the data from our local drive directly.

In [4]:
seed = 28
np.random.seed(seed)
df = df.reindex(np.random.permutation(df.index))

Lets see the dataframe

In [5]:
df.head()

Unnamed: 0,review,sentiment
18924,Stu Ungar is considered by many to be the grea...,0
5868,"This movie kicks ass, bar none. Bam and his cr...",1
33809,I got to see this on the plane to NZ last week...,1
1768,i consider this movie as one of the most inter...,1
44063,This was the only time I ever walked out on a ...,0


For our own convenience, we will also store the assembled and shuffled movie review dataset as a CSV file:

In [6]:
df.to_csv('../data/interim/movie_data.csv', index=False)

## Remove special charactars

In [7]:
df.loc[15, 'review'][-50:]

'A sequel of sorts is in the works.<br /><br />7/10'

As you can see here, the text contains HTML markup as well as punctuation and other non-letter characters. While HTML markup does not contain many useful semantics, punctuation marks can represent useful, additional information in certain NLP contexts. However, for simplicity, we will now remove all punctuation marks except for emoticon characters, such as :), since those are certainly useful for sentiment analysis. To accomplish this task, we will use Python's regular expression (regex) library, re, as shown here:

In [8]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [9]:

preprocessor(df.loc[15, 'review'][-50:])

'a sequel of sorts is in the works 7 10'

In [10]:

preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [11]:
df['review'] = df['review'].apply(preprocessor)

## Tokenize

In [12]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [13]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [14]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In order to remove stop-words from the movie reviews, we will use the set of 127 English stop-words that is available from the NLTK library, which can be obtained by calling the nltk.download function:

In [15]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/fallou/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

## Training

### Online Learning

In [17]:
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [18]:
next(stream_docs(path='../data/interim/movie_data.csv'))

('"Stu Ungar is considered by many to be the greatest poker / gin player of all time - an extraordinary self-destructive force of nature - tiny in stature, but a huge heart for the game.<br /><br />What we have here is a kind of Hallmark film about the dangers of gambling. Sure, he wins, he loses, he blows it all on sex, drugs, and more gambling we get it, but where is the real play - where is what made him the greatest card player of all time.<br /><br />Much too flat, and frankly boring in places, this gets a four because we get to learn something about Stu the man, but Stu the card player, nada.<br /><br />Nicely shot and presented up to a point this is the perfect example of how not to make a film about cards: honestly, ESPN\'s coverage of the World Series is more watchable than this.<br /><br />A waste of a great chance."',
 0)

In [19]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

In [23]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

clf = SGDClassifier(loss='log', random_state=1, n_iter_no_change=1)
doc_stream = stream_docs(path='../data/interim/movie_data.csv')

In [24]:
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:30


In [25]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.885


In [26]:

clf = clf.partial_fit(X_test, y_test)

## Experiment tracking with MLflow

In [36]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
import mlflow
import mlflow.sklearn
import pyprind


pbar = pyprind.ProgBar(45)

# Set tracking uri
mlflow.set_tracking_uri("file://" + "/Volumes/partition/projects/learning/movieclassifier/mlruns")
# Set experiment
mlflow.set_experiment(experiment_name="baselines")

# Tracking
with mlflow.start_run() as run:
    vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

    clf = SGDClassifier(loss='log', random_state=1, n_iter_no_change=1)
    doc_stream = stream_docs(path='../data/interim/movie_data.csv')

    classes = np.array([0, 1])
    for _ in range(45):
        X_train, y_train = get_minibatch(doc_stream, size=1000)
        if not X_train:
            break
        X_train = vect.transform(X_train)
        clf.partial_fit(X_train, y_train, classes=classes)
        pbar.update()


    X_test, y_test = get_minibatch(doc_stream, size=5000)
    X_test = vect.transform(X_test)
    score = clf.score(X_test, y_test)
    print('Accuracy: %.3f' % score)

    mlflow.log_metric("score", score)
    mlflow.sklearn.log_model(clf, "model")

INFO: 'baseline' does not exist. Creating a new experiment
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:30
Accuracy: 0.885
