# Obtaining the IMDb movie review dataset

In [1]:
#Before reading data from files, make sure that the directory is set up correctly
import os
os.getcwd()

'/Users/Jaan/Documents/gitHubCode/PythonMachineLearningBook/Chapter_08'

In [6]:
#change path to point to data folder
path = r'/Users/Jaan/Documents/gitHubCode/PythonMachineLearningBook/Chapter_08/data/'
os.chdir(path)
os.getcwd()

'/Users/Jaan/Documents/gitHubCode/PythonMachineLearningBook/Chapter_08/data'

In [7]:
#check contents of current working directory
os.listdir(path)

['aclImdb', 'aclImdb_v1.tar.gz']

In [10]:
#read contents of all text files into a dataframe
import pandas as pd
import os
labels = {'pos': 1, 'neg': 0}
df = pd.DataFrame()
for cur_folder in ('test', 'train'):
    for cur_lab in ('pos', 'neg'):
        path = os.getcwd() + '/aclImdb/%s/%s' %(cur_folder, cur_lab)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[cur_lab]]], ignore_index = True)
df.columns = ['review', 'sentiment']

/Users/Jaan/Documents/gitHubCode/PythonMachineLearningBook/Chapter_08/data/aclImdb/test/pos
/Users/Jaan/Documents/gitHubCode/PythonMachineLearningBook/Chapter_08/data/aclImdb/test/neg
/Users/Jaan/Documents/gitHubCode/PythonMachineLearningBook/Chapter_08/data/aclImdb/train/pos
/Users/Jaan/Documents/gitHubCode/PythonMachineLearningBook/Chapter_08/data/aclImdb/train/neg


In [12]:
#write the data collected above into a csv file
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
path = os.getcwd() 
df.to_csv(path + '/movie_data.csv', index = False)

In [13]:
#make sure that the file we created exists in the directory
os.listdir(os.getcwd())

['aclImdb', 'aclImdb_v1.tar.gz', 'movie_data.csv']

In [14]:
#read some contents of the file to ensure it's correct
df = pd.read_csv('movie_data.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


# Introducing the bag-of-words model

Bag-of-words model allows us to represent text as numerical feature vectors.

### Transforming words into feature vectors

The CountVectorizer class in scikit-learn allows us to use different n-gram models via its ngram_range parameter. While a 1-gram representation is used by default, we could switch to a 2-gram representation by initializing a new CountVectorizer instance with ngram_range=(2,2).

In [19]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array(['The sun is shining', 
                'The weather is sweet', 
                'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)
print(count.vocabulary_)

{u'and': 0, u'weather': 6, u'sweet': 4, u'sun': 3, u'is': 1, u'the': 5, u'shining': 2}
None


In [17]:
#create a raw term frequency array of the documents -the number of times a term t occurs in a document d.
print (bag.toarray())

[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]


### Assessing word relevancy via term frequency-inverse document frequency

Term frequency-inverse document frequency
(tf-idf) can be used to downweight frequently occurring words in the feature vectors that occur in documents across different class labels and hence, do not provide any dicriminatory information. The tf-idf can be defined as the product of the term frequency and the inverse document frequency

In [27]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_idf = TfidfTransformer()
np.set_printoptions(precision = 2)
print (tf_idf.fit_transform(bag).toarray())

[[ 0.    0.43  0.56  0.56  0.    0.43  0.  ]
 [ 0.    0.43  0.    0.    0.56  0.43  0.56]
 [ 0.4   0.48  0.31  0.31  0.31  0.48  0.31]]


### Cleaning text data

You can find a great tutorial on the Google Developers portal at https://developers.google.com/edu/python/regular-expressions or check out the official documentation of Python's re module at https://docs.python.org/3.4/library/re.html.

In [28]:
#explore first few lines of data
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [33]:
#remove HTML markers and all punctuations using regular expressions
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ''.join(emoticons).replace('-', '')
    return text
print(preprocessor(df.loc[0, 'review'][-50:]))
print(preprocessor("</a>This :) is :( a test :-)!"))

is seven title brazil not available
this is a test :):(:)


In [34]:
#clean all the reviews of our movie review dataset
df['review'] = df['review'].apply(preprocessor)

### Processing documents into tokens

One way to tokenize documents is to split them into individual words by splitting the cleaned document at its whitespace characters.In the context of tokenization, another useful technique is word stemming, which is the process of transforming a word into its root form that allows us to map related words to the same stem

In [35]:
#split a sentence into words 
def tokenizer(text):
    return text.split()
print (tokenizer('runners like running and hence, they run'))

['runners', 'like', 'running', 'and', 'hence,', 'they', 'run']


In [37]:
#stem words using Porter stemmer
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenzier_porter(sentence):
    return [porter.stem(word) for word in sentence.split()]
print(tokenzier_porter('runners like running and hence, they run'))


[u'runner', u'like', u'run', u'and', u'hence,', u'they', u'run']


A technique called lemmatization aims to obtain the canonical (grammatically correct) forms of individual words—the so-called lemmas. However, lemmatization is computationally more difficult and expensive compared to stemming and, in practice, it has been observed that stemming and lemmatization have little impact on the performance of text classification.

In [38]:
#remove stop words from reviews in our dataset
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/Jaan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [40]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenzier_porter('a runner likes running and runs a lot') if w not in stop]

[u'runner', u'like', u'run', u'run', u'lot']

# Training a logistic regression model for document classification

In [41]:
#create training and test sets from original dataset
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [49]:
#find optimal set of params using CV
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

#create tf-idf matrix
tf_idf = TfidfVectorizer(strip_accents = None, lowercase = False, preprocessor = None)

#create a set of parameters 
param_grid = [
    {'vect__ngram_range': [(1, 1)], 'vect__stop_words':[stop, None], 
    'vect__tokenizer':[tokenizer, tokenzier_porter], 'clf__penalty':['l1', 'l2'],
    'clf__C': [1.0, 10.0, 100.0]}, 
    
    {'vect__ngram_range': [(1, 1)], 'vect__stop_words':[stop, None], 
    'vect__tokenizer':[tokenizer, tokenzier_porter], 'vect__use_idf': [False],
    'vect__norm': [None], 'clf__penalty':['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}, 
]

#create pipeline to compute tfidf of reviews and then fit LogisticRegression
lr_tfidf = Pipeline([
        ('vect', tf_idf), 
        ('clf', LogisticRegression(random_state = 0))
    ])

#run crossvalidation on the pipeline with various options set in param_grid
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring = 'accuracy', 
                          cv = 5, verbose = 1, n_jobs = -1)
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  8.4min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 43.8min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 56.9min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('vect', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=Tru...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__tokenizer': [<function tokenizer at 0x106324e60>, <function tokenzier_porter at 0x10ee69d70>], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0], 'vect__stop_words': [[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'y...x10ee69d70>], 'vect__use_idf': [False], 'clf__C': [1.0, 10.0, 100.0], 'clf__penalty': ['l1', 'l2']}],
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=1

In [50]:
print('Best parameter set: %s' %gs_lr_tfidf.best_params_)
print ('Best CV accuracy: %.3f' %gs_lr_tfidf.best_score_)
clf = gs_lr_tfidf.best_estimator_
print ('Test accuracy: %.3f' %clf.score(X_test, y_test))

Best parameter set: {'vect__ngram_range': (1, 1), 'vect__tokenizer': <function tokenizer at 0x106324e60>, 'clf__penalty': 'l2', 'clf__C': 10.0, 'vect__stop_words': None}
Best CV accuracy: 0.897
Test accuracy: 0.898


# Working with bigger data – online algorithms and out-of-core learning

In [62]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')

#define method to tokenize sentences and remove all HTML markups and other punctuations
def tokenizer(sentence):
    sentence = re.sub('<[^>]*>', '', sentence)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', sentence.lower())
    sentence = re.sub('[\W]', ' ', sentence.lower()) + ''.join(emoticons).replace('-', '')
    tokenized = [w for w in sentence.split() if w not in stop]
    return tokenized

#function to read and return one document at a time
def stream_docs(path):
    with open(path, 'r') as file:
        next(file) #skip header
        for line in file:
            text, label = line[:-3], int(line[-2])
#             print (text)
#             print(label)
#             print
            yield text, label
stream_docs(path = os.getcwd() + '/movie_data.csv')

#function that returns only specified number of documents from document stream
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration: 
        return None, None
    return docs, y

In [64]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error = 'ignore', n_features = 2 ** 21, 
                        preprocessor = None, tokenizer = tokenizer)
clf = SGDClassifier(loss = 'log', random_state = 1, n_iter = 1)
doc_stream = stream_docs(path = os.getcwd()+'/movie_data.csv')

In [65]:
#train the model
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size = 1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes = classes)

In [66]:
#test the model
X_test, y_test = get_minibatch(doc_stream, size = 5000)
X_test = vect.transform(X_test)
print ('Accuracy: %.3f' %clf.score(X_test, y_test))

Accuracy: 0.868


A more modern alternative to the bag-of-words model is word2vec, an algorithm that Google released in 2013 (T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013). The word2vec algorithm is an unsupervised learning algorithm based on neural networks that attempts to automatically learn the relationship between words. The idea behind word2vec is to put words that have similar meanings into similar clusters; via clever vector-spacing, the model can reproduce certain words using simple vector math, for example, king – man + woman = queen.
The original C-implementation, with useful links to the relevant papers and alternative implementations, can be found at https://code.google.com/p/word2vec/.