[Sebastian Raschka](http://sebastianraschka.com), 2015

https://github.com/rasbt/python-machine-learning-book

# Python Machine Learning - Code Examples

# Chapter 8 - Applying Machine Learning To Sentiment Analysis

Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).

In [1]:
# %load_ext watermark
# %watermark -a 'Sebastian Raschka' -u -d -v -p numpy,pandas,matplotlib,scikit-learn,nltk

In [2]:
# to install watermark just uncomment the following line:
#%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py

<br>
<br>

### Overview

- [Obtaining the IMDb movie review dataset](#Obtaining-the-IMDb-movie-review-dataset)
- [Introducing the bag-of-words model](#Introducing-the-bag-of-words-model)
  - [Transforming words into feature vectors](#Transforming-words-into-feature-vectors)
  - [Assessing word relevancy via term frequency-inverse document frequency](#Assessing-word-relevancy-via-term-frequency-inverse-document-frequency)
  - [Cleaning text data](#Cleaning-text-data)
  - [Processing documents into tokens](#Processing-documents-into-tokens)
- [Training a logistic regression model for document classification](#Training-a-logistic-regression-model-for-document-classification)
- [Working with bigger data – online algorithms and out-of-core learning](#Working-with-bigger-data-–-online-algorithms-and-out-of-core-learning)
- [Summary](#Summary)

<br>
<br>

# Obtaining the IMDb movie review dataset

The IMDB movie review set can be downloaded from [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).
After downloading the dataset, decompress the files.

A) If you are working with Linux or MacOS X, open a new terminal windowm `cd` into the download directory and execute 

`tar -zxf aclImdb_v1.tar.gz`

B) If you are working with Windows, download an archiver such as [7Zip](http://www.7-zip.org) to extract the files from the download archive.

### Compatibility Note:

I received an email from a reader who was having troubles with reading the movie review texts due to encoding issues. Typically, Python's default encoding is set to `'utf-8'`, which shouldn't cause troubles when running this IPython notebook. You can simply check the encoding on your machine by firing up a new Python interpreter from the command line terminal and execute

    >>> import sys
    >>> sys.getdefaultencoding()
    
If the returned result is **not** `'utf-8'`, you probably need to change your Python's encoding to `'utf-8'`, for example by typing `export PYTHONIOENCODING=utf8` in your terminal shell prior to running this IPython notebook. (Note that this is a temporary change, and it needs to be executed in the same shell that you'll use to launch `ipython notebook`.

Alternatively, you can replace the lines 

    with open(os.path.join(path, file), 'r') as infile:
    ...
    pd.read_csv('./movie_data.csv')
    ...
    df.to_csv('./movie_data.csv', index=False)

by 

    with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
    ...
    pd.read_csv('./movie_data.csv', encoding='utf-8')
    ...
    df.to_csv('./movie_data.csv', index=False, encoding='utf-8')
    
in the following cells to achieve the desired effect.

In [3]:
import pandas as pd
import os

# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = './aclImdb_short'

labels = {'pos':1, 'neg':0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
df.columns = ['review', 'sentiment']

Shuffling the DataFrame:

In [4]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
print("df = \n", df)

df = 
                                                review  sentiment
22  Bromwell High is nothing short of brilliant. E...          1
20  Bromwell High is a cartoon comedy. It ran at t...          1
25  I came in in the middle of this film so I had ...          1
4   This movie was sadly under-promoted but proved...          1
10  Once again Mr. Costner has dragged out a movie...          0
15  I wish I knew what to make of a movie like thi...          0
28  Very good drama although it appeared to have a...          1
11  This is a pale imitation of 'Officer and a Gen...          0
18  I'm not a big fan of musicals, although this t...          0
29  Working-class romantic drama from director Mar...          1
27  Although I didn't like Stanley & Iris tremendo...          1
35  I basically skimmed through the movie but just...          0
37  This is really a new low in entertainment. Eve...          0
2   My yardstick for measuring a movie's watch-abi...          1
39  This is one of

Optional: Saving the assembled data as CSV file:

In [5]:
df.to_csv('./movie_data.csv', index=False)

In [6]:
import pandas as pd
df = pd.read_csv('./movie_data.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,Bromwell High is nothing short of brilliant. E...,1
1,Bromwell High is a cartoon comedy. It ran at t...,1
2,I came in in the middle of this film so I had ...,1


<hr>
### Note

If you have problems with creating the `movie_data.csv` file in the previous chapter, you can find a download a zip archive at 
https://github.com/rasbt/python-machine-learning-book/tree/master/code/datasets/movie
<hr>

<br>
<br>

# Introducing the bag-of-words model

...

## Transforming documents into feature vectors

In [7]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(ngram_range=(1,1))
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)
print("bag = \n", bag)

bag = 
   (0, 2)	1
  (0, 1)	1
  (0, 3)	1
  (0, 5)	1
  (1, 4)	1
  (1, 6)	1
  (1, 1)	1
  (1, 5)	1
  (2, 0)	1
  (2, 4)	1
  (2, 6)	1
  (2, 2)	1
  (2, 1)	2
  (2, 3)	1
  (2, 5)	2


In [8]:
print(count.vocabulary_)

{'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}


In [9]:
print(bag.toarray())

[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]


<br>

## Assessing word relevancy via term frequency-inverse document frequency

In [10]:
np.set_printoptions(precision=2)

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.56 0.56 0.   0.43 0.  ]
 [0.   0.43 0.   0.   0.56 0.43 0.56]
 [0.4  0.48 0.31 0.31 0.31 0.48 0.31]]


In [12]:
tf_is = 2 
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1) )
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)

tf-idf of term "is" = 2.00


In [13]:
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf 
raw_tfidf_all = tfidf.fit_transform(count.fit_transform(docs)).toarray()
print ("raw_tfidf_all =\n", raw_tfidf_all)

raw_tfidf_all =
 [[0.   1.   1.29 1.29 0.   1.   0.  ]
 [0.   1.   0.   0.   1.29 1.   1.29]
 [1.69 2.   1.29 1.29 1.29 2.   1.29]]


In [14]:
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

array([0.4 , 0.48, 0.31, 0.31, 0.31, 0.48, 0.31])

<br>

## Cleaning text data

In [15]:
df.loc[0, 'review'][-50:]


'maginable, then Bromwell High will not disappoint!'

In [16]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + \
           ' '.join(emoticons).replace('-', '')
    return text

In [17]:
preprocessor(df.loc[0, 'review'][-50:])

'maginable then bromwell high will not disappoint '

In [18]:
preprocessor("</a>This :) is :( a test :-)!, Я його з'їв.")

'this is a test я його з їв :) :( :)'

In [19]:
print ("before preprocessing:\n", df.head())
df['review'] = df['review'].apply(preprocessor)
print ("\nafter preprocessing")
df.head()

before preprocessing:
                                               review  sentiment
0  Bromwell High is nothing short of brilliant. E...          1
1  Bromwell High is a cartoon comedy. It ran at t...          1
2  I came in in the middle of this film so I had ...          1
3  This movie was sadly under-promoted but proved...          1
4  Once again Mr. Costner has dragged out a movie...          0

after preprocessing


Unnamed: 0,review,sentiment
0,bromwell high is nothing short of brilliant ex...,1
1,bromwell high is a cartoon comedy it ran at th...,1
2,i came in in the middle of this film so i had ...,1
3,this movie was sadly under promoted but proved...,1
4,once again mr costner has dragged out a movie ...,0


<br>

## Processing documents into tokens

In [20]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [21]:
tokenizer('runners like running and thus they run')
ukrText = "</a>This :) is :( a test :-)!, Я його з'їв."
print (tokenizer(ukrText))

['</a>This', ':)', 'is', ':(', 'a', 'test', ':-)!,', 'Я', 'його', "з'їв."]


In [22]:
tokenizer_porter('runners like running and thus they run')
print (tokenizer_porter(ukrText))

['</a>thi', ':)', 'is', ':(', 'a', 'test', ':-)!,', 'Я', 'його', "з'їв."]


In [23]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\makov\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
print("len =", len(stop))
print("stop =\n", stop)

myPorter = tokenizer_porter('a runner likes running and runs a lot')
print ("myPorter =\n", myPorter)

lexemmas = []
for word in myPorter:
    if word not in stop:
        lexemmas.append(word)
print ("full cycle lexemmas =\n", lexemmas)
        
lexemmasMinus10 = [w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]
print ("lexemmasMinus10 =", lexemmasMinus10)

print ("no -10 =")
[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

len = 179
stop =
 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same

['runner', 'like', 'run', 'run', 'lot']

<br>
<br>

# Training a logistic regression model for document classification

Strip HTML and punctuation to speed up the GridSearch later:

In [25]:
X_train = df.loc[:20, 'review'].values
y_train = df.loc[:20, 'sentiment'].values
X_test = df.loc[20:, 'review'].values
y_test = df.loc[20:, 'sentiment'].values

In [26]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None, 
                        lowercase=False, 
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1,1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
             {'vect__ngram_range': [(1,1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
             ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0, solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, 
                           scoring='accuracy',
                           cv=5, verbose=1,
                           n_jobs=-1)

In [27]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    5.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   17.3s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:   21.3s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...nalty='l2', random_state=0, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's...se_idf': [False], 'vect__norm': [None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', refit=T

In [28]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__C': 1.0, 'clf__penalty': 'l1', 'vect__ngram_range': (1, 1), 'vect__norm': None, 'vect__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'a

In [29]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.400


# Working with bigger data - online algorithms and out-of-core learning

In [30]:
import numpy as np
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [31]:
next(stream_docs(path='./movie_data.csv'))

('"Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It\'s vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three ""protagonists"" for want of a better term, the show doesn\'t shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren\'t afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint!"',
 1)

In [32]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

In [33]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

clf = SGDClassifier(loss='log', random_state=1, max_iter=1, tol=None)
doc_stream = stream_docs(path='./movie_data.csv')

In [34]:
classes = np.array([0, 1])
for _ in range(3):
    X_train, y_train = get_minibatch(doc_stream, size=10)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)



In [35]:
X_test, y_test = get_minibatch(doc_stream, size=10)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.500


In [36]:
clf = clf.partial_fit(X_test, y_test)

# Summary