In [1]:
'''
Applying Machine Learning to Sentiment Analysis

In this chapter, we will delve into a subfield of natural language processing (NLP) called sentiment analysis 
and learn how to use machine learning algorithms to classify documents based on their polarity: the attitude of the writer.
The topics that we will cover in the following sections include:
1.Cleaning and preparing text data
2.Building feature vectors from text documents
3.Training a machine learning model to classify positive and negative movie reviews
4.Working with large text datasets using out-of-core learning

we will be working with a large dataset of movie reviews from the Internet Movie Database (IMDb) that has been collected by 
Maas et al. (A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning Word Vectors for Sentiment Analysis. 
In the proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 
pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics). 
The movie review dataset consists of 50,000 polar movie reviews that are labeled as either positive or negative; 
here, positive means that a movie was rated with more than six stars on IMDb, and negative means that a movie was rated with fewer than five
stars on IMDb. In the following sections, we will learn how to extract meaningful information from a subset of these movie reviews to build 
a machine learning model that can predict whether a certain reviewer liked or disliked a movie.
A compressed archive of the movie review dataset (84.1 MB) can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/ 
as a gzip-compressed tarball archive:
'''
import pyprind
import os
import pandas as pd
pbar = pyprind.ProgBar(50000)
labels = {'pos':1, 'neg':0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path ='./aclImdb/%s/%s' % (s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r') as infile:
                txt = infile.read()
                df = df.append([[txt, labels[l]]], ignore_index=True)
                pbar.update()
df.columns = ['review', 'sentiment']

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:09:35


In [3]:
'''
Since the class labels in the assembled dataset are sorted, we will now shuffle DataFrame using the permutation function from the np.random
submodule—this will be useful to split the dataset into training and test sets in later sections when we will stream the data from our local drive directly. 
For our own convenience, we will also store the assembled and shuffled movie review dataset as a CSV file:
'''
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('./movie_data.csv', index=False)
df = pd.read_csv('./movie_data.csv')

Unnamed: 0,review,sentiment
0,"Election is a Chinese mob movie, or triads in ...",1
1,I was just watching a Forensic Files marathon ...,0
2,Police Story is a stunning series of set piece...,1
3,"Dear Readers,<br /><br />The final battle betw...",1
4,I have seen The Perfect Son about three times....,1
5,A brilliant portrait of a traitor (Victor McLa...,1
6,If ever a potential movie must've sounded like...,1
7,I'd always wanted David Duchovney to go into t...,1
8,Perhaps if only to laugh at the way my favorit...,0
9,"Even though the story is light, the movie flow...",1


# Cleaning text data

In [6]:
'''
As we can see here, the text contains HTML markup as well as punctuation and other non-letter characters. 
While HTML markup does not contain much useful semantics, punctuation marks can represent useful, 
additional information in certain NLP contexts. However, for simplicity, 
we will now remove all punctuation marks but only keep emoticon characters such as ":)" 
since those are certainly useful for sentiment analysis. 
To accomplish this task, we will use Python's regular expression (regex) library
'''
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + \
           ' '.join(emoticons).replace('-', '')
    return text
df['review'] = df['review'].apply(preprocessor)

# Processing documents into tokens

In [13]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
tokenizer('runners like running and thus they run')
tokenizer_porter('runners like running and thus they run')

'''
Before we jump into the next section where will train a machine learning model using the bag-of-words model, 
let us briefly talk about another useful topic called stop-word removal. 
Stop-words are simply those words that are extremely common in all sorts of texts and likely bear no (or only little) useful information 
that can be used to distinguish between different classes of documents. Examples of stop-words are is, and, has, and the like.
Removing stop-words can be useful if we are working with raw or normalized term frequencies rather than tf-idfs,
which are already downweighting frequently occurring words.
In order to remove stop-words from the movie reviews, we will use the set of 127 English stop-words that is available from the NLTK library,
which can be obtained by calling the nltk.download function:
'''
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Training a logistic regression model for document classification

In [None]:
'''
In this section, we will train a logistic regression model to classify the movie reviews into positive and negative reviews.
First, we will divide the DataFrame of cleaned text documents into 25,000 documents for training and 25,000 documents for testing:
'''
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(strip_accents=None,lowercase=False,preprocessor=None)
param_grid = [{'vect__ngram_range': [(1,1)], 'vect__stop_words': [stop, None], 'vect__tokenizer': [tokenizer, tokenizer_porter], \
                'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}, {'vect__ngram_range': [(1,1)], 'vect__stop_words': [stop, None],\
                'vect__tokenizer': [tokenizer, tokenizer_porter], 'vect__use_idf':[False], 'vect__norm':[None], 'clf__penalty': ['l1', 'l2'],\
                'clf__C': [1.0, 10.0, 100.0]}]
lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(random_state=0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)
print 'Best parameter set: %s ' % gs_lr_tfidf.best_params_
clf = gs_lr_tfidf.best_estimator_
print 'Test Accuracy: %.3f' % clf.score(X_test, y_test)
