# Chapter 8 - Applying Machine Learning To Sentiment Analysis
<center><img src="./images/sentiment.png" alt="Sentiment analysis" style="width: 400px;"/></center>

### Overview

- [Preparing the IMDb movie review data for text processing](#Preparing-the-IMDb-movie-review-data-for-text-processing)
  - [Obtaining the IMDb movie review dataset](#Obtaining-the-IMDb-movie-review-dataset)
  - [Preprocessing the movie dataset into more convenient format](#Preprocessing-the-movie-dataset-into-more-convenient-format)
- [Introducing the bag-of-words model](#Introducing-the-bag-of-words-model)
  - [Transforming words into feature vectors](#Transforming-words-into-feature-vectors)
  - [Assessing word relevancy via term frequency-inverse document frequency](#Assessing-word-relevancy-via-term-frequency-inverse-document-frequency)
  - [Cleaning text data](#Cleaning-text-data)
  - [Processing documents into tokens](#Processing-documents-into-tokens)
- [Training a logistic regression model for document classification](#Training-a-logistic-regression-model-for-document-classification)


## Sentiment analysis
- A subfield of Natural Language Processing (NLP)
- Classify documents based on their polarity
    - the attitude of the writer
    - sometimes called "opinion mining"
- Example from the Internet Movie Database (IMDb)
    - 50000 movie reviews
    - Predictor for positive and negative reviews <6 / >=6 stars (out of 10)
- Similar examples with discussion fora
    - Predictor for ideas and non-ideas (Lego, beer brewing, ...)

## Topics
- Cleaning and preparing text data
- Building feature vectors from text documents
- Training a machine learning model to classify positive and negative movie reviews
- Working with large text datasets using out-of-core learning
- Inferring topics from document collections for categorization

# Preparing the IMDb movie review data for text processing 

## Obtaining the IMDb movie review dataset

The IMDb movie review set can be downloaded from [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).
After downloading the dataset, decompress the files.

0) Use the code in the following cells to retreive and extact automatically.

A) If you are working with Linux or MacOS X, open a new terminal window `cd` into the download directory and execute 

`tar -zxf aclImdb_v1.tar.gz`

B) If you are working with Windows, download an archiver such as [7Zip](http://www.7-zip.org) to extract the files from the download archive.

**Optional code to download and unzip the dataset via Python:**

In [1]:
import os
import sys
import tarfile
import time


source = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
target = 'aclImdb_v1.tar.gz'


def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    if duration == 0:
        duration = 10**-3
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size
    sys.stdout.write("\r%d%% | %d MB | %.2f MB/s | %d sec elapsed" %
                    (percent, progress_size / (1024.**2), speed, duration))
    sys.stdout.flush()

In [2]:
# This download takes around 5-60 seconds at NMBU
if not os.path.isdir('aclImdb') and not os.path.isfile('aclImdb_v1.tar.gz'):
    
    if (sys.version_info < (3, 0)):
        import urllib
        urllib.urlretrieve(source, target, reporthook)
    
    else:
        import urllib.request
        urllib.request.urlretrieve(source, target, reporthook)

100% | 80 MB | 1.55 MB/s | 51 sec elapsed

In [3]:
# The extraction can take several minutes as all 50,000 reviews are stored as separate text files
# (101,111 files). 
# Extracting to a synced folder (Dropbox, Google Drive, OneDrive, ...) may slow the process further.
if not os.path.isdir('aclImdb'):

    with tarfile.open(target, 'r:gz') as tar:
        tar.extractall()

## Preprocessing the movie dataset into more convenient format
Read all review files and append them sequentially into a Pandas dataframe.

In [6]:
import pyprind       # pip install pyprind, if you haven't used it before
import pandas as pd
import os

# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], 
                           ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:06:48


### Shuffling the DataFrame
- The data were read systematically: test, train; pos, neg.
- Shuffling before storage means we can stream the data from file and obtain a random flow of reviews

In [7]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

### Save the assembled data as CSV file
We will later be streaming from this file

In [8]:
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [9]:
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [10]:
df.shape

(50000, 2)

<hr>
### Note

If you have problems with creating the `movie_data.csv`, you can find a download a zip archive at 
https://github.com/rasbt/python-machine-learning-book-2nd-edition/tree/master/code/ch08/
<hr>

# Introducing the bag-of-words model

- Represent documents as counts of words
- Vocabulary across all documents
- Sparse representation
    - Only part of the vocabulary used in each text
- Many ways to implement this:
    - Potential for crazy overhead
    - Here: scikit-learn CountVectorizer

## Transforming documents into feature vectors

By calling the fit_transform method on CountVectorizer, we will construct the vocabulary of the bag-of-words model and transform the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [2]:
# Vocabulary with ordering (as dictionary)
print(count.vocabulary_)
print(sorted(count.vocabulary_))

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}
['and', 'is', 'one', 'shining', 'sun', 'sweet', 'the', 'two', 'weather']


### Bag of words
Per document, array representation of the word counts used as feature vectors. The values in the feature vectors are also called the raw term frequencies: *tf (t,d)* — the number of times a term *t* occurs in a document *d*.

In [14]:
print(type(bag))
print(bag.shape)

<class 'scipy.sparse.csr.csr_matrix'>
(3, 9)


In [15]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


In [16]:
print(sorted(count.vocabulary_))

['and', 'is', 'one', 'shining', 'sun', 'sweet', 'the', 'two', 'weather']


### n-grams and K-mers
- Single word counts => 1-gram (what we did above)
- Counts of word sequences
    - 2-gram: Sentence: "DAT300 is great and I love it."
            -DAT300 is, is great, great and, and I, I love, love it
    - 3-gram: Sentence: "DAT300 is great and I love it."
            -DAT300 is great, is great and, great and I, and I love, I love it
- Spam filters showed good performance with 3-grams and 4-grams (in 2007)
- Parameter to the CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


In [3]:
count2 = CountVectorizer(ngram_range=[2,2])
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag2 = count2.fit_transform(docs)

In [4]:
print(sorted(count2.vocabulary_))
print(bag2.toarray())

['and one', 'is shining', 'is sweet', 'is two', 'one and', 'one is', 'shining the', 'sun is', 'sweet and', 'the sun', 'the weather', 'weather is']
[[0 1 0 0 0 0 0 1 0 1 0 0]
 [0 0 1 0 0 0 0 0 0 0 1 1]
 [2 1 1 1 1 1 1 1 1 1 1 1]]


- For instance, the bigram 'and one' appears:

    - 0 times in the first sentence
    - 0 times in the second sentence
    - 2 times in the third sentence.

## Assessing word relevancy via term frequency-inverse document frequency

- Words occuring frequently in multiple documents from both/all classes should be downweighted
    - term frequency-inverse document frequency (tf-idf)

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

- tf(t, d): term frequency
- *idf(t, d)*: inverse document frequency

- *idf(t, d)*: inverse document frequency:

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

- $n_d$ = #documents, *df(d, t)* is the number of documents *d* that contain the term *t*
- optional 1 in denominator (omni-present words would get 0 without)
- log is used to ensure that low document frequencies are not given too much weight

In [19]:
# Transform tf to tf-idf:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)

np.set_printoptions(precision=2)
print(bag.toarray())
print(sorted(count.vocabulary_))
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray()) # Word in many documents => less variation in tf-idf

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]
['and', 'is', 'one', 'shining', 'sun', 'sweet', 'the', 'two', 'weather']
[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


### scikit-learn implementation of tf-idf
- differs a bit from the text book version
$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$
  
$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

- tfs are often normalized before computing tf-idfs, while scikit-learn normalizes tf-idfs instead (L2 per document):
$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

### Using scikit-learn implementation of tf-idf when "smooth_idf=False"
- *idf(t, d)*: inverse document frequency (The equation is used  when the parameter smooth_idf=False in some scikit-learn implementations):

$$\text{idf}(t,d) = \text{log}[\frac{n_d}{\text{df}(d, t)}] + 1,$$
$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

- $n_d$ = #documents, *df(d, t)* is the number of documents *d* that contain the term *t*
- The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored.

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

###  "Continue Reading: 'Sentiment Analysis' (PDF)"

## Cleaning text data

In [90]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

- We want to remove HTML tags and punctuation (retaining smileys).
- This should be done before generating the bag-of-words
- Regular Expressions: https://medium.com/analytics-vidhya/regular-expression-in-python-5ab2e8b707f1

In [21]:
import re
def preprocessor(text):
    # Regular expression for HTML tags The re.sub method is used to replace substrings. Here, the pattern <[^>]*> 
    # searches for anything that starts with < and ends with >, which is typically the structure of HTML tags. 
    # This pattern matches all HTML tags and replaces them with an empty string '', effectively removing them.
    text = re.sub('<[^>]*>', '', text)
    
    # Most typical emoticons (smileys) The given pattern attempts to identify typical emoticons (smileys) in the text. 
    # For example, it would match :), :-D, ;P, and so on.
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    
    # Remove all non-word characters, convert to lower-case and add possible emoticons to the end.
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [91]:
# Effect of preprocessor on example
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [93]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [24]:
# Synthetic example:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

### Apply the preprocessor

In [25]:
# This takes a few seconds
df['review'] = df['review'].apply(preprocessor)

## Processing documents into tokens
- Raw text can be converted to words in several ways
    - Basic: Splitting at blank spaces
- Often useful to remove variations of a word
    - Word stemming looks for the stem of a word
    - Porter stemmer (published 1979/80) still used a lot
    - Snowball stemmer (Porter2/English), Lancaster (Paice/Husk) are faster, but more aggressive
    - Part of the Natural Language Toolkit (conda install nltk / pip install nltk)

In [28]:
# !pip install nltk # first time
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

# Define basic tokenizer and Porter stemmer version
def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [29]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [30]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

### Stop-words, Stemming and Lemmatization
- Some words are so common, they are usually removed before analysis
    - is, and, has, like, ...
    - 127 such in NLTK library
    - tf-idfs are robust against stop words
- Stemming : Some types of language processing need the stop words too.- Stemming is a technique in natural language processing and information retrieval where words are reduced to their base or root form, e.g., "running" -> "run".
- Lemmatization is a technique in natural language processing that involves reducing words to their base or dictionary form. Unlike stemming, which may produce a root form of a word that is not necessarily a valid word in the language, lemmatization ensures that the resulting word (called the lemma) is a valid word, e.g., "better" -> "good".


In [31]:
import nltk

# Update to most resent stop-words
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kristl\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [32]:
from nltk.corpus import stopwords

# Combine tokenizer with Porter stemmer and stop-word removal
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')
if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

# Training a logistic regression model for document classification
- We will train an LR classifyer on half the reviews and test on the remaining 25,000.
- Preprocessing of HTML was done earlier
    - including lower-case conversion and emoticon handling.
- Use GridSearch to test the effect of stemming, stop-words, L1/L2 and C-parameter

In [79]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [34]:
# NB! NB! NB!    
# This code worked in 2018.
# Now the stop words are handled differently, hence a new version below.


from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

# TfidfVectorizer combines CountVectorizer and TfidTransformer with a single function.
tfidf = TfidfVectorizer(strip_accents=None, # Already preprocessed
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None], # Not this time, but use idf with normalization
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None], # Not this time
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],       # Raw counts without normalization 
               'vect__norm':[None],           # --------------||----------------
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0, solver='saga'))])
# Solver specified to silence warning and to enable l1 regularization

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=2,
                           n_jobs=1) # Number of jobs different from 1 sometimes crashes on Windows.

### Tokenizing of stop words
- In the newer scikit-learn (as of 2019.09.12), stop words need to be preprocessed before entering the TfidfVectorizer

In [35]:
stops = []
for s in stop:
    stops.append(tokenizer(s)[0])
stopsPorter = []
for s in stop:
    stopsPorter.append(tokenizer_porter(s)[0])

In [36]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

# TfidfVectorizer combines CountVectorizer and TfidTransformer with a single function.
tfidf = TfidfVectorizer(strip_accents=None, # Already preprocessed
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stops, None], # Not this time, but use idf with normalization
               'vect__tokenizer': [tokenizer],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stops, None], # Not this time
               'vect__tokenizer': [tokenizer],
               'vect__use_idf':[False],       # Raw counts without normalization 
               'vect__norm':[None],           # --------------||----------------
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
             {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stopsPorter, None], # Not this time, but use idf with normalization
               'vect__tokenizer': [tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stopsPorter, None], # Not this time
               'vect__tokenizer': [tokenizer_porter],
               'vect__use_idf':[False],       # Raw counts without normalization 
               'vect__norm':[None],           # --------------||----------------
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]}]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0, solver='saga'))])
# Solver specified to silence warning and to enable l1 regularization

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=1) # Number of jobs different from 1 sometimes crashes on Windows.

### Fitting
The fitting of 2*2*2*3*5*2 models took around 30-60 minutes to fit in 2018. From 2019 it takes several hours. :(.  
Lowering the number of samples or parameters will make it quicker, but may reduce the performance greatly.  
The NLTK toolkit is for educational purposes, giving transparent solutions but at the expense of speed!

In [None]:
# Put the tea kettle on, take a warm bath, brush your teeth, go to sleep, check back next morning ...
#gs_lr_tfidf.fit(X_train, y_train)

In [38]:
# Pickle (store to disk) the Grid Search CV object
# import pickle
# with open('gs_lr_tfidf.pickle', 'wb') as f:
#    pickle.dump(gs_lr_tfidf, f, pickle.HIGHEST_PROTOCOL)

In [43]:
# To open an object that has been pickled, you need to import the object's dependencies and local functions
import pickle
with open('gs_lr_tfidf.pickle', 'rb') as f:
    gs_lr_tfidf = pickle.load(f)

In [40]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x0000026F45750E50>} 
CV Accuracy: 0.897


In [41]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.899


### Other options
- Our choices of preprocessing, counting, etc. were not tested in all possible variations
- Logistic Regression was tested, but
- Naïve Bayes is popular for text classification
    - Good performance on small datasets
    - Variants used for K-mer classifications of nucleotides

In [46]:
# Naïve Bayes (no grid search this time)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from mlxtend.preprocessing import DenseTransformer
tfidfOpt = TfidfVectorizer(strip_accents=None, # Already preprocessed
                           lowercase=False,
                           preprocessor=None,
                           ngram_range=[1,1],
                           stop_words=None,
                           tokenizer=tokenizer_porter,
                           max_features=1000)
nb_tfidf = Pipeline([('vect', tfidfOpt),
                    ('clf', MultinomialNB())])

In [47]:
nb_tfidf.fit(X_train, y_train)

Pipeline(steps=[('vect',
                 TfidfVectorizer(lowercase=False, max_features=1000,
                                 ngram_range=[1, 1],
                                 tokenizer=<function tokenizer_porter at 0x0000026F45750DC0>)),
                ('clf', MultinomialNB())])

In [48]:
nb_tfidf.score(X_test, y_test)

0.83412

### word2vec - "Continue Reading: 'Sentiment Analysis' (PDF)"
- Google release in 2013
- Unsupervised learning based on neural networks
- Attempts to automatically learn the relationship between words
    - Similar words in similar clusters
- Can reproduce certain words using vector math
    - Example: king - man + woman = queen