In [1]:
#Chapter 8 Applying Machine Learning to Sentiment Analysis

#How to classify documents based on their polarity:
#the attitude of the writer

#In particular, we are going to work with a dataset of 50000 movie reviews from IMDb
# and build a predictor that can distinguish between positive and negative reviews

# We will cover:
# 1. Cleaning and preparing text data
# 2. Building feature vectors from text documents
# 3. Training a machine learning model to classify positive and negative movie reviews
# 4. Working with large text datasets using out-of-core learning
# 5. Inferring topics from document collections for categorization


#Preparing the IMDb movie review data for text processing

#Sentiment analysis, sometimes also called opinion mining
# analyzing the polarity of documents
# classification of documents based on the expressed opinions or emotions of the authors

# We will work on a large dataset of movie reviews from the IMDb
# consist of 50,000 polar movie reviews that are labeled as either positive or negative
# Positive: a movie was rated with more than six stars
# Negative: a movie was rated with fewer than five stars

# We will:
# 1. Download the dataset
# 2. Preprocess it into a useable format for machine learning tools
# 3. Extract meaningful information from a subset of these movie reviews to build a machine learning model
# Goal: predict whether a certain reviewer liked or disliked a movie

#Preparing the IMDb movie review data for text processing

#Obtaining the IMDb moview review dataset

import os 
import sys
import tarfile
import time

source = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
target = 'aclImdb_v1.tar.gz'

#code to download the dataset
def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size
    sys.stdout.write("\r%d%% | %d MB | %.2f MB/s | %d sec elapsed" %
                     (percent, progress_size / (1024.**2), speed, duration))
    sys.stdout.flush()
    
if not os.path.isdir('aclImdb') and not os.path.isfile('aclImdb_v1.tar.gz'):
    if (sys.version_info < (3,0)):
        import urllib
        urllib.urlretrieve(source,target,reporthook)
    else:
        import urllib.request
        urllib.request.urlretrieve(source,target,reporthook)









In [2]:
#code to unzip the dataset
# directly unpack the Gzip-compressed tarball archive direcly in Python

if not os.path.isdir('aclImdb'):
    with tarfile.open(target, 'r:gz') as tar:
        tar.extractall()

In [3]:
#read the movies intoa pandas DataFrame object
#take upto 10 miinutes
#To visualize the progress and estimated time until completion
# Python Progress Indicator (PyPrind): pip install pyprind

import pyprind
import pandas as pd
import os

#change the `basepath` to the directory of the unzipped movie dataset

basepath = 'aclImdb'

labels = {'pos':1, 'neg':0}
#initialized a new progress bar object pbar
#with 50,000 iterations, which is the # of docs we will read in
pbar = pyprind.ProgBar(50000) 
df = pd.DataFrame()

for s in ('test','train'): #iterate over the train and test subdirectories
    for l in ('pos','neg'): #read the individual text files from the pos and neg subdirectories
        path = os.path.join(basepath,s,l)
        for file in os.listdir(path):
            with open(os.path.join(path,file),
                      'r', encoding = 'utf-8') as infile:
                txt = infile.read()
                #append text files to the df pandas DataFrame
            df = df.append([[txt,labels[l]]],ignore_index=True)
            pbar.update()

df.columns = ['review','sentiment']


0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:35


In [4]:
# Since the class labels in the assembled dataset are sorted
# need shuffle the DataFrame using the permutation function from the np.random submodule
# useful to split the dataset into training and test sets

# stream the data from our local drive directly
# store the assembled and shuffled movie review dataset as CSV file



import numpy as np
# Shuffle
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

In [5]:
# Saving the assembled data as CSV file

df.to_csv('movie_data.csv', index = False, encoding = 'utf-8')

In [6]:
import pandas as pd

df = pd.read_csv('movie_data.csv',encoding = 'utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,This tearful movie about a sister and her batt...,1
1,"It's too kind to call this a ""fictionalized"" a...",0
2,Truly bad and easily the worst episode I have ...,0


In [7]:
# Introducing the bag-of-words model
# represent text as numerical feature vectors
# Idea:
# 1. Create a vocabulary unique tokens
# 2. Construct a feature vector from each documents 
#    that contains the frequency of the words in each documents

# Since the unique words in each documents represent only a small subset
# of all the words in the bag-of-words vocabulary
# the feature vectors will mostly consist of zeros ==> Sparse

# Transforming documents into feature vectors

# By calling the fit_transform method on CountVectorizer
# we just constructed the vocabulary of the bag-of-words model 
# and transformed the following 3 sentences into sparse feature vectors

# 1. The sun is shining
# 2. The weather is sweet
# 3. The sun is shining, the weather is sweet, and one and one is two

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [8]:
# The vocabularty is stored in a Python dictionary 
# that maps the unique words to integer indices (not count) 

print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [9]:
# Next let us print the feature vectors that we just created:

# Each index position in the feature vectors shown here corresponds to 
# the integer values that are stored as dictionary items in the CountVectorizer vocabulary. 
# For example, the rst feature at index position 0 resembles the count of the word and, 
# which only occurs in the last document, 
# and the word is at index position 1 (the 2nd feature in the document vectors) occurs in all three sentences. 
# Those values in the feature vectors are also called 
# the raw term frequencies: tf (t,d)
# —the number of times a term t occurs in a document d.

print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


In [10]:
#Assessing word relevancy via term frequency-inverse doument frequency

np.set_printoptions(precision=2)

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. 

Term frequency-inverse document frequency (tf-idf):
- Downweight those frequently occurring words in the feature vectors. 

The tf-idf can be defined as the product of the term frequency and the inverse document frequency:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

tf(t, d) is the term frequency
inverse document frequency *idf(t, d)* can be calculated as:

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and *df(d, t)* is the number of documents *d* that contain the term *t*. 

Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that low document frequencies are not given too much weight.

Scikit-learn implements yet another transformer, the `TfidfTransformer`, that takes the raw term frequencies from `CountVectorizer` as input and transforms them into tf-idfs:

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf = True,
                         norm = 'l2',
                         smooth_idf = True)

print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


As we saw in the previous subsection, the word 'is' had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word 'is' is now associated with a relatively small tf-idf (0.45) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.

However, `TfidfTransformer` calculates the tf-idfs slightly differently compared to the standard textbook equations that we defined earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

The tf-idf equation that was implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

While it is also more typical to normalize the raw term frequencies before calculating the tf-idfs, the `TfidfTransformer` normalizes the tf-idfs directly.

By default (`norm='l2'`), scikit-learn's TfidfTransformer applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector *v* by its L2-norm:

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

To make sure that we understand how TfidfTransformer works, let us walk
through an example and calculate the tf-idf of the word is in the 3rd document.

The word is has a term frequency of 3 (tf = 3) in document 3, and the document frequency of this term is 3 since the term is occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:

$$\text{idf}("is", d3) = log \frac{1+3}{1+3} = 0$$

Now in order to calculate the tf-idf, we simply need to add 1 to the inverse document frequency and multiply it by the term frequency:

$$\text{tf-idf}("is",d3)= 3 \times (0+1) = 3$$

In [12]:
tf_is = 3
n_docs = 3
idf_is = np.log((n_docs +1)/(3+1))
tfidf_is = tf_is * (idf_is +1)

print('tf-idf of term "is" = %.2f' % tfidf_is)

tf-idf of term "is" = 3.00


If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf vectors: [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]. However, we notice that the values in this feature vector are different from the values that we obtained from the TfidfTransformer that we used previously. The  nal step that we are missing in this tf-idf calculation is the L2-normalization, which can be applied as follows:

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}}$$

$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$

$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

As we can see, the results match the results returned by scikit-learn's `TfidfTransformer` (below). Since we now understand how tf-idfs are calculated, let us proceed to the next sections and apply those concepts to the movie review dataset.

In [13]:
#replicate the 'TfidfTransformer''s result

tfidf = TfidfTransformer(use_idf=True, norm = None, smooth_idf = True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf

array([3.39, 3.  , 3.39, 1.29, 1.29, 1.29, 2.  , 1.69, 1.29])

In [15]:
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

array([0.5 , 0.45, 0.5 , 0.19, 0.19, 0.19, 0.3 , 0.25, 0.19])

In [17]:
# Clean text data

# Before building our bag-of-words model,
# we need to clean the text data by stripping it of all unwanted characters

# Display the last 50 characters from the first document in the reshuffled movie review dataset
# the text contains HTML markup and punctuation and other non-letter characters

df.loc[0,'review'][-50:]

' wonderful journey from life to death.<br /><br />'

In [18]:
# Though punctuation marks can represent useful, additional information in NLP contexts
# for simplicity, we will remove all punctuation marks 
# except for emoticon characters such as :) as those are useful for sentiment analysis
# we will use Python's regular expression (regex) library, `re`

import re

def preprocessor(text):
    
    #remove all pf the HTML markup from the movie reviews
    text = re.sub('<[^>]*>','',text)
    
    #find emoticons, temporarily stored as emoticons
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',text)
    
    #remove all non-word characters from the text via the regex [\W]+
    #and convert the text into lowercase characters
    text = (re.sub('[\W]+',' ',text.lower()) + 
            ' '.join(emoticons).replace('-','')) # add the temporarily stored emoticons to the end of the processed document string
    #also, :-) is also converted to :) for consistency (removing the '-')

    # although the addition of the emotiocon characters to the end of the cleaned document string
    # may not look like the most elegant approach,
    # the order of the words does not matter in our bag-of-words model 
    # if our vocabulary consists of only one-word tokens
    
    return text

In [20]:
# test the preprocessor work correctly
preprocessor(df.loc[0,'review'][-50:])

' wonderful journey from life to death '

In [21]:
# test the preprocessor work correctly
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [23]:
#apply the preprocessor function to all the movie reviews in our DataFrame
df['review'] = df['review'].apply(preprocessor)

In [25]:
#Processing documents into tokens

#split the text corpora into individual elements

#Natural Language Toolkit (NLTK)
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    # tokenize the document by splitting the clenaned documents
    # at its whitespace characters
    return text.split()

def tokenizer_porter(text):
    #word stemming + tokenization
    # transform a word into its root form
    # Porter stemmer algorithm
    return [porter.stem(word) for word in text.split()]

In [26]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [28]:
tokenizer_porter('runners like running and thus they run')

# stemming may create non-real words.
# Lemmatization aims to obtain the canonical (grammatically correct)
# forms of individual words - the so-called lemmas
# but computationally more difficult and expensive compared to stemming
# and in practice, it has been observed that stemming and lemmatization 
# have little impact on the performance of text classfication


['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [30]:
# Stop-word removal
# Stop-words: words that are extremely common in all sorts of texts and probably bear no (or little) useful information
# e.g. is, and, has, and like
# Removing stop-words can be useful if we are working with raw or normalized term frequencies 
# rather than tf-idfs whih are already downweighting frequently occuring words

# use the set of 127 English stop-words that is available from the NLTK library
import nltk

nltk.download('stopwords')

#load and apply the English stop-word set as follows:

from nltk.corpus import stopwords

stop = stopwords.words('english')

[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

[nltk_data] Downloading package stopwords to /home/Rex/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['runner', 'like', 'run', 'run', 'lot']

In [31]:
# Training a logistic regression model for document classification
# strip HTML and punctuation to speed up the GridSearch later:

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [34]:
# logistic regression
# `GridSearchCV object to find the optimal set of parameters 
# for our logistic regreesion model
# using 5-fold stratified cross-validation

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

# TfidfVectorizer = CountVectorizer + TfidfTransformer
tfidf = TfidfVectorizer(strip_accents = None,
                        lowercase = False,
                        preprocessor = None)

# 2 parameter dictionaries: 
#   1) TfidVectorizer with its default setting
#   2) set use_idf = False, smooth_idf = False, and norm = None
#    to train a model based on raw term frequencies
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'], # regularization
               'clf__C': [1.0, 10.0, 100.0]}, # regularization strength
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'], # regularization
               'clf__C': [1.0, 10.0, 100.0]} # regularization strength
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                            scoring = 'accuracy',
                            cv = 5,
                            verbose = 1,
                            n_jobs = -1)

**Important Note about `n_jobs`**

Please note that it is highly recommended to use `n_jobs=-1` (instead of `n_jobs=1`) in the previous code example to utilize all available cores on your machine and speed up the grid search. However, some Windows users reported issues when running the previous code with the `n_jobs=-1` setting related to pickling the tokenizer and tokenizer_porter functions for multiprocessing on Windows. Another workaround would be to replace those two functions, `[tokenizer, tokenizer_porter]`, with `[str.split]`. However, note that the replacement by the simple `str.split` would not support stemming.

**Important Note about the running time**

Executing the following code cell **may take up to 30-60 min** depending on your machine, since based on the parameter grid we defined, there are 2*2*2*3*5 + 2*2*2*3*5 = 240 models to fit.

If you do not wish to wait so long, you could reduce the size of the dataset by decreasing the number of training samples, for example, as follows:

    X_train = df.loc[:2500, 'review'].values
    y_train = df.loc[:2500, 'sentiment'].values
    
However, note that decreasing the training set size to such a small number will likely result in poorly performing models. Alternatively, you can delete parameters from the grid above to reduce the number of models to fit -- for example, by using the following:

    param_grid = [{'vect__ngram_range': [(1, 1)],
                   'vect__stop_words': [stop, None],
                   'vect__tokenizer': [tokenizer],
                   'clf__penalty': ['l1', 'l2'],
                   'clf__C': [1.0, 10.0]},
                  ]

In [35]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  8.0min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 47.8min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 65.8min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's...se_idf': [False], 'vect__norm': [None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', refit=True, return_tr

In [36]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7fab90e59ea0>} 
CV Accuracy: 0.894


In [37]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.901


####  Start comment:
    
Please note that `gs_lr_tfidf.best_score_` is the average k-fold cross-validation score. I.e., if we have a `GridSearchCV` object with 5-fold cross-validation (like the one above), the `best_score_` attribute returns the average score over the 5-folds of the best model. To illustrate this with an example:

In [40]:
from sklearn.linear_model import LogisticRegression
import numpy as np

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

#create a simple data set of random integers that shall represent out class labels
np.random.seed(0)
np.set_printoptions(precision = 6)
y = [np.random.randint(3) for i in range(25)]
X = (y+np.random.randn(25)).reshape(-1,1)

#feed the indices of 5 cross-validation folds `cv5_idx) to the scorer
#returning 5 accuracy scores
cv5_idx = list(StratifiedKFold(n_splits = 5, shuffle = False, random_state=0).split(X,y))

cross_val_score(LogisticRegression(random_state=123),X,y,cv=cv5_idx)

array([0.6, 0.4, 0.6, 0.2, 0.6])

By executing the code above, we created a simple data set of random integers that shall represent our class labels. Next, we fed the indices of 5 cross-validation folds (`cv5_idx`) to the `cross_val_score` scorer, which returned 5 accuracy scores -- these are the 5 accuracy values for the 5 test folds.  

Next, let us use the `GridSearchCV` object and feed it the same 5 cross-validation sets (via the pre-generated `cv3_idx` indices):

In [41]:
from sklearn.model_selection import GridSearchCV

gs = GridSearchCV(LogisticRegression(),{},cv=cv5_idx,verbose=3).fit(X,y)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV]  ................................................................
[CV] ...................................... , score=0.6, total=   0.0s
[CV]  ................................................................
[CV] ...................................... , score=0.4, total=   0.0s
[CV]  ................................................................
[CV] ...................................... , score=0.6, total=   0.0s
[CV]  ................................................................
[CV] ...................................... , score=0.2, total=   0.0s
[CV]  ................................................................
[CV] ...................................... , score=0.6, total=   0.0s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


As we can see, the scores for the 5 folds are exactly the same as the ones from `cross_val_score` earlier.

Now, the best_score_ attribute of the `GridSearchCV` object, which becomes available after `fit`ting, returns the average accuracy score of the best model:

In [42]:
gs.best_score_

0.48

As we can see, the result above is consistent with the average score computed the `cross_val_score`.

In [43]:
cross_val_score(LogisticRegression(), X, y, cv=cv5_idx).mean()

0.48

#### End comment.

<br>

In [49]:
#Working with Bigger data - online algorithms and out-of-core learning

# This cell is added for convenience so that the notebook can
# be executed starting here, without executing prior code in
# notebook

import os
import gzip

if not os.path.isfile('movie_data.csv'):
    if not os.path.isfile('movie_data.csv.gz'):
        print('Please place a copy of the movie_data.csv.gz'
              'in this directory. You can obtain it by'
              'a) executing the code in the beginning of this'
              'notebook or b) by downloading it from GitHub:'
              'https://github.com/rasbt/python-machine-learning-'
              'book-2nd-edition/blob/master/code/ch08/movie_data.csv.gz')
    else:
        in_f = gzip.open('movie_data.csv.gz','rb')
        out_f = open('movie_data.csv','wb')
        out_f.write(in_f.read())
            


In [55]:
# Out-of-core learning:
# work with large datasets by fitting the classifier incrementally
# on smaller bathces of the datasets

# Stochastic gradient descent
# optimization algorithm that updates the model's weight using one sample at a time

# make use of the `partial_fit` function of the `SGDClassifier` in scikit-learn
# to steam the documents directly from our local drive
# and train a logistic regression model using small mini-batches of documents

import numpy as np 
import re
from nltk.corpus import stopwords

# The `stop` is defined as earlier in this chapter
# Added it here for convenience, so that this section
# can be run as standalone without executing prior code
# in the directory

stop = stopwords.words('english')

def tokenizer(text):
# clearn the unprocessed text data from the movie_data.csv file
# separate it into word tokens while removing stop words
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

def stream_docs(path):
    # reads in and returns one documents at a time
    with open(path,'r',encoding='utf-8') as csv:
        next(csv) #skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])           
            yield text, label            
            

In [58]:
# to verify that our stream_docs function works correctly
# read in the first document from the `movie_data.csv`
# return a tuple consisting of the review text and the corresponding class label

next(stream_docs(path='movie_data.csv'))

('This tearful movie about a sister and her battle to save as many souls as she can is very moving. The film does well in picking up the characters and showing how Sister Helen deals with each.<br /><br />A wonderful journey from life to death.<br /><br />',
 1)

In [66]:
def get_minibatch(doc_stream,size):

# take a document stream from the stream_docs function and
# return a particular documents specified by the size parameter
    
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

In [67]:
# Cannot use CountVectorizer or TfidfVectorizer
# CountVectorizer requires holding the complete vocabulary in memory
# TfidVectorizer needs to keep all the feature vectors of the training dataset in memeory to calculate the inverse document frequencies

# But we can use HashingVectorizer in scikie-learn
# data-independent, makes uses of the hashing trick via the 32-bit MurmurHash3 function

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore',
                         n_features = 2**21, # large number of features to reduce the change of causing hash collisions
                         #but also increase the number of coefficients in our logistic regression model
                         preprocessor = None,
                         tokenizer = tokenizer)

from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

if Version(sklearn_version) < '0.18':
    clf = SGDClassifier(loss='log', random_state = 1, n_iter = 1)
    #reinitialse a logistic regression classifier by setting the loss parameter to `log`
else:
    clf = SGDClassifier(loss='log', random_state = 1, max_iter = 1)
    
doc_stream = stream_docs(path='movie_data.csv')

In [68]:
import pyprind # estimate the progress of our learning algorithm

pbar = pyprind.ProgBar(45) # initial the progress bar object with 45 iterations

classes = np.array([0,1])

for _ in range(45): #iterate over 45 mini-batches of documents
    X_train, y_train = get_minibatch(doc_stream,size =1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes = classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:17


In [69]:
X_test, y_test = get_minibatch(doc_stream, size = 5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.874


In [73]:
#use the last 5,000 docs to update our model

clf = clf.partial_fit(X_test,y_test)

# word2vec, a modern alternative to the bag-of-words model
# unsupervised learning algorithm based on neural networks
# attempts to automatically learn the relationship between words
# idea: put words that have similar meanings into similar clusters
# and via clever vector-spacing, the model can reproduce certain words

In [74]:
# Topic modelling:
# describe the broad task of assigning topics to unlabelled text documents
# categorization, topic-assigning, clustering
# unsupervised learning

# Decomposing text documents with Latent Dirichlet Allocation (LDA)
# NOTE: Latent Dirichlet Allocation (LDA) NOT Linear Discriminant Analysis (LDA)
# LDA is a generative probabilistic model that tries to find groups of words
# that appear frequently together across different documents
# These frequently appearing words, represent our topics, 
# assuming each documents is a mixture of different words

# input: bag-of-words matrix
# LDA decomposes the bag-of-words matrix into two new matrices
# 1) A document-to-topic matrix, 2) A word-to-topic matrix
# if we multiply those 2 matrices together, we would be able to reproduce the input, the bag-of-words
# with the lowest possible error
# in practice, we are interested in those topics define the number of topics beforehand
# the number of topics is a hyperparameter of LDA that has to be specified manually.


# Latent Dirchlet Allocation with scikit-learn

# Decompose the movie review dataset and categorize it into different topics
# restrict the analysis to 10 different topics

# First: load the dataset into a pandas DataFrame using the local movie_data.csv

import pandas as pd
df = pd.read_csv('movie_data.csv',encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,This tearful movie about a sister and her batt...,1
1,"It's too kind to call this a ""fictionalized"" a...",0
2,Truly bad and easily the worst episode I have ...,0


In [75]:
# Use the CountVectorizer to create the bag-of-words matrix as input to the LDA
# we use scikit-learn's built-in English stop word library via stop_words = 'english'

from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words = 'english',
                        max_df = .1, #maximum document frequency, exclude words that occur too frequently across documents (but set arbitrarily)
                        max_features = 5000) #most frequently occuring 5,000 words only (but set arbitrarily), limit the dimensionality
X = count.fit_transform(df['review'].values)

In [76]:
# fit a LatentDirchletAllocation estimator to the bag-of-words matrix
# and infer the 10 different topics from the documents

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components = 10,
                                random_state = 123,
                                learning_method = 'batch') # do its estimation based on all available training data (the bag-of-words matrix) in one iteration
# slower than `online` (analogous to online or mini-batch learning) but more accurate results

X_topics = lda.fit_transform(X)



In [78]:
lda.components_.shape

(10, 5000)

In [80]:
n_top_words = 5
feature_names = count.get_feature_names()

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx +1))
    print(" ".join([feature_names[i]
                    for i in topic.argsort() [:-n_top_words-1:-1]]))
    
    

Topic 1:
worst minutes script awful stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art feel
Topic 5:
police guy car dead murder
Topic 6:
horror house sex blood gore
Topic 7:
role performance comedy actor performances
Topic 8:
series episode episodes war tv
Topic 9:
book version original effects fi
Topic 10:
action fight guy kids fun


Based on reading the 5 most important words for each topic, we may guess that the LDA identified the following topics:
    
1. Generally bad movies (not really a topic category)
2. Movies about families
3. War movies
4. Art movies
5. Crime movies
6. Horror movies
7. Comedies
8. Movies somehow related to TV shows
9. Movies based on books
10. Action movies

To confirm that the categories make sense based on the reviews, let's plot 5 movies from the horror movie category (category 6 at index position 5):

In [82]:
horror = X_topics[:,5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nHorror movie #%d:' % (iter_idx +1))
    print(df['review'][movie_idx][:300], '...')


Horror movie #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horror movie #2:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...

Horror movie #3:
This film marked the end of the "serious" Universal Monsters era (Abbott and Costello meet up with the monsters later in "Abbott and Costello Meet Frankentstein"). It was a somewhat desparate, yet fun attempt to revive the classic monsters of the Wolf Man, Frankenstein's monster, and Dracula one "la ...
