## Part 3: Text Data

Apply the following text representation techniques (and any variations, such as stopword removals in BoW) on the Movies Review dataset (​http://ai.stanford.edu/~amaas/data/sentiment/​). Consider experimenting with the following: n-grams, stopword removal, punctuation removal, lemmatization, etc.
1. Bag of Words (BoW)
2. Term Frequency - Inverse Document Frequency (TF-IDF)
3. Feature hashing
4. Apply sentiment analysis (two-class text classification) machine learning
algorithm(s) to compare performance of models across these text representation techniques. Apply algorithm(s) of your choice (use any Python library of your choice) and compare results (such as number of features, model performance)

## Business Problem


"Sentiment analysis, sometimes also called opinion mining, is a popular subdiscipline of the broader field of NLP; it is concerned with analyzing the polarity of documents. A popular task in sentiment analysis is the classification of documents based on the expressed opinions or emotions of the authors with regard to a particular topic. " - "Python Machine Learning"

Team X was hired to tackle another tough problem faced by an entertainment company (confidential) who is interested in understanding how movie reviews could impact future performance of the movies and box office sales.  Specifically, they want to know the overall sentiment of the movie reviews from their customer base.

The client provided Team X with a dataset that consists of reviews for over 50,000 polar movies.  The reviews are either positive (six stars or above on IMDb) or negative (fewer than five stars on IMDb).
Team X conducted extensive data cleaning and preparation and used various tools to extract meaningful information from these movie reviews.  Eventually they built a machine learning model that can predict whether a certain reviewer liked or disliked a movie.

In the following sections, Team X will download the dataset, preprocess it into a useable format for machine learning tools, and extract meaningful information from a subset of these movie reviews to build a machine learning model that can predict whether a certain reviewer liked or disliked a movie.

Today, informed by the discussions and presentations at the recent Sentiment Analysis Symposium, let’s examine the business case for sentiment analysis in the movie industry.

##### What is the purpose of the sentimen tanalysis on the movie reviews?

1. Companies could use the result to forecast the box office sales.

2. The movie company (for example, Disney or Netflix) could decide the overall sentiment of the review and similarly understand the aggregate picture, the voice of the market rather than just of individuals? 

3. Can you discover relationships between sentiments and the characteristics of the people who expressed them as well trends over time and how opinions propagate through social networks?

#### Import Library

In [1]:
import pyprind
import pandas as pd
import os

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
import re

import matplotlib.pyplot as plt  # data visualization
import seaborn as sns # data visualization
%matplotlib inline

In [39]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/unoyiyi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/unoyiyi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/unoyiyi/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/unoyiyi/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

### 1. Load Dataset 

In [2]:
# change the `basepath` to the directory of the
# unzipped movie dataset

#basepath = '/Users/Sebastian/Desktop/aclImdb/'
basepath = '/Users/unoyiyi/Desktop/MSBASpring/423/Final Project/aclImdb/'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:02


In [3]:
## shuffling the dataframe
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

In [4]:
##Optional: Saving the assembled data as CSV file:
        
df.to_csv('./movie_data.csv', index=False)


In [5]:
df = pd.read_csv('./movie_data.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0


In [174]:
df.shape

(50000, 2)

### 2. Cleaning Text Data

The first important step before we build our bag-of words or any other model is to clean the text data by stripping it of all unwanted characters

What we did here is to remove

- a. Lemmatization (normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma)
- b. Text Proeprocessor 
    - punctuations, line breaks
    - convert all character into lowercase
    - remove all of the HTLML markup 
- c. Remove the stopwords (is, the, for, to...etc;Stop Words are words which do not contain important significance to be used in Search Queries. Usually these words are filtered out from search queries because they return vast amount of unnecessary information.)

In [7]:
df.loc[0,'review'][-50:]

'to Star Cinema!! Way to go, Jericho and Claudine!!'

In [64]:
##a. lemmatization
def lemma(words):
    words_tag = pos_tag(words,tagset='universal')
    lemmatiser = WordNetLemmatizer()
    for i, word_tag in enumerate(words_tag):
        if word_tag[1] == "NOUN":
                words[i] = lemmatiser.lemmatize(words[i],pos='n')
        if word_tag[1] == "VERB":
                words[i] = lemmatiser.lemmatize(words[i],pos='v')
        if word_tag[1] == "ADV":
                words[i] = lemmatiser.lemmatize(words[i],pos='r')
        if word_tag[1] == "ADJ":
                words[i] = lemmatiser.lemmatize(words[i],pos='a')            
    return words

In [43]:
##b. text preprocessor (include stopword removeval and lemmatization)
def preprocessor(text, punc_num = True, root = True, stopword = True):
    text = text.lower()
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '') 
    text = re.sub('[0-9]','',text.lower())
    if root:
        words = word_tokenize(text)
        words = lemma(words)
        text = ' '.join(words)
    
    if stopword:
        custom_stopwords = stopwords.words('english')
        words = [word for word in text.split() if word not in custom_stopwords]
        text = ' '.join(words)
    return text

In [44]:
preprocessor(df.loc[0, 'review'][-50:])

'ons star cinema way go jericho claudine'

In [175]:
## to test how the preprocessor works - the sentence include lemmatization, stopwords, and punctuations
preprocessor("</a>This :) is :( a 50 tests :-)!")

'test : ) : ( : )'

In [47]:
df['review'] = df['review'].apply(preprocessor)

In [48]:
df.head()

Unnamed: 0,review,sentiment
0,family normally watch local movie simple reaso...,1
1,believe one time bad movie ever see since time...,0
2,internet surf find homefront series dvd ioffer...,0
3,one unheralded great work animation though mak...,1
4,sixty anyone long hair hip distant attitude co...,0


### Divide dataset into train and test

we want divide the dataset into train and test dataset for future machine learning. 

In [149]:
X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values

In [105]:
X = df['review'].values
y = df['sentiment'].values

In [176]:
## We can also use the train_test_split to randomly split dataset to test and train. 
## In future practice, we could try different combination of the split ratio.

#from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 0)


### 3 Vectorization

#### 3.1. Bag of Words (BoW)

bag-of-words, which allows us to represent text as numerical feature vectors. The idea behind the bag-of-words model is quite simple and can be summarized as follows:
1. We create a vocabulary of unique tokens—for example, words—from the entire set of documents.
2. We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.
Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will mostly consist of zeros, which is why we call them sparse. Do not worry if this sounds too abstract; in the following subsections, we will walk through the process of creating a simple bag- of-words model step-by-step.


- "python machine learning"

In [150]:
## By calling the fit_transform method on CountVectorizer, 
## we just constructed the vocabulary of the bag-of-words 
## model and transformed the data into sparse feature vectors:

docs = df["review"]
count = CountVectorizer().fit(docs)
bag_train = count.transform(X_train)
bag_test = count.transform(X_test)

In [151]:
## check wether the train and test has the same shape. We need to focus on the number of features.
bag_train.shape

(25001, 90420)

In [152]:
bag_test.shape

(25000, 90420)

In [153]:
## Print the contents of the vocabulary 
print(count.vocabulary_)



"As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary, which maps the unique words that are mapped to integer indices. Next let us print the feature vectors that we just created:

Each index position in the feature vectors shown here corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary. For example, the  rst feature at index position 0 resembles the count of the word and, which only occurs in the last document, and the word is at index position 1 (the 2nd feature in the document vectors) occurs in all three sentences. Those values in the feature vectors are also called the raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*. "

- from Python Machine Learning Book Chap 08

In [157]:
feature_names = count.get_feature_names()
print(feature_names)



In [158]:
number_of_features = len(feature_names)
print(number_of_features)

90420


#### 3.2. Term Frequency - Inverse Document Frequency (TF-IDF)

"When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweight those frequently occurring words in the feature vectors. " - chap 8 "Python Machine Learning"

The tf-idf is the product of the term frequency and the inverse document frequency:
 
$$tf-idf(t,d)=tf (t,d)×idf(t,d)$$
 
Here the tf(t, d) is the term frequency and the inverse document frequency idf(t, d) is: 

$$idf(t,d)=log\frac{n_{d}}{1+df(d,t)}$$
 
where  $n_{d}$  is the total number of documents, and df(d, t) is the number of documents d that contain the term t. 

Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that low document frequencies are not given too much weight.

In [159]:
## use TFidfTransformer to take raw term frequencies from CountVectorizer
## and transforms them into tf-idfs

from sklearn.feature_extraction.text import TfidfVectorizer

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

The tf-idf equation that was implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

While it is also more typical to normalize the raw term frequencies before calculating the tf-idfs, the `TfidfTransformer` normalizes the tf-idfs directly.

By default (`norm='l2'`), scikit-learn's TfidfTransformer applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector *v* by its L2-norm:

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

To make sure that we understand how TfidfTransformer works, let us walk
through an example and calculate the tf-idf of the word is in the 3rd document.

The word is has a term frequency of 3 (tf = 3) in document 3, and the document frequency of this term is 3 since the term is occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:

$$\text{idf}("is", d3) = log \frac{1+3}{1+3} = 0$$

Now in order to calculate the tf-idf, we simply need to add 1 to the inverse document frequency and multiply it by the term frequency:

$$\text{tf-idf}("is",d3)= 3 \times (0+1) = 3$$

In [160]:
vect = TfidfVectorizer()
vect.fit(docs)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [161]:
print(len(vect.vocabulary_))

90420


In [162]:
print(vect.vocabulary_)



In [163]:
## transform the datasets after tfidf techniques to test and train datasets
tfidf_test = vect.transform(X_test)
tfidf_train = vect.transform(X_train)

In [164]:
## check the shape of the test and train datasets
tfidf_test.shape

(25000, 90420)

In [165]:
tfidf_train.shape

(25001, 90420)

In [166]:
print(tfidf.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


#### 3.3. Feature Hashing

- BoW models need to maintain in-memory vocabulary for encoding documents, which may require more memory for large datasets and slow down processing
- An alternative to this is to use feature hashing or hashing trick to use a hash function and to find the token string name to feature integer index mapping
- The HashingVectorizer class (in scikit-learn) implements this approach that can be used to hash words, then tokenize and encode documents as needed
- This strategy has a key advantage of using low memory for large datasets, but the downside is that there is no way to convert the encoding back to a word

"class slide session 5 - page 19"

In [167]:
from sklearn.feature_extraction.text import HashingVectorizer

In [169]:
## By calling the transform method on HashingVectorizer, 
## we just constructed the vocabulary of the feature hashing
## model and transformed the data into sparse feature vectors:
## and put them into train and test datasets
vectorizer = HashingVectorizer(n_features=20)
feature_hashing_test = vectorizer.transform(X_test)
feature_hashing_train = vectorizer.transform(X_train)

### 4. Sentiment Analysis
#### 4.1. Using Bag of words

In [156]:
#logistic regression
logreg = LogisticRegression()
logreg.fit(bag_train,y_train)
y_pred = logreg.predict(bag_test)
print(classification_report(y_test,y_pred))

print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(bag_test, y_test)))

             precision    recall  f1-score   support

          0       0.88      0.87      0.88     12527
          1       0.87      0.88      0.88     12473

avg / total       0.88      0.88      0.88     25000

Accuracy of logistic regression classifier on test set: 0.88


In [155]:
#Decision Tree
dtree = DecisionTreeClassifier()
dtree.fit(bag_train,y_train)
pred_test = dtree.predict(bag_test)
print(classification_report(y_test,pred_test))

             precision    recall  f1-score   support

          0       0.72      0.72      0.72     12527
          1       0.72      0.72      0.72     12473

avg / total       0.72      0.72      0.72     25000



#### 4.2. Use TF-IDF Vector

In [170]:
#logistic regression
logreg = LogisticRegression()
logreg.fit(tfidf_train,y_train)
y_pred = logreg.predict(tfidf_test)
print(classification_report(y_test,y_pred))

print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(tfidf_test, y_test)))

             precision    recall  f1-score   support

          0       0.90      0.87      0.89     12527
          1       0.88      0.90      0.89     12473

avg / total       0.89      0.89      0.89     25000

Accuracy of logistic regression classifier on test set: 0.89


In [171]:
#Decision Tree
dtree = DecisionTreeClassifier()
dtree.fit(tfidf_train,y_train)
pred_test = dtree.predict(tfidf_test)
print(classification_report(y_test,pred_test))

             precision    recall  f1-score   support

          0       0.71      0.71      0.71     12527
          1       0.71      0.71      0.71     12473

avg / total       0.71      0.71      0.71     25000



#### 4.3. Feature hashing Vector

In [172]:
#logistic regression
logreg = LogisticRegression()
logreg.fit(feature_hashing_train,y_train)
y_pred = logreg.predict(feature_hashing_test)
print(classification_report(y_test,y_pred))

print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(feature_hashing_test, y_test)))

             precision    recall  f1-score   support

          0       0.62      0.61      0.61     12527
          1       0.61      0.62      0.62     12473

avg / total       0.61      0.61      0.61     25000

Accuracy of logistic regression classifier on test set: 0.61


In [173]:
#Decision Tree
dtree = DecisionTreeClassifier()
dtree.fit(feature_hashing_train,y_train)
pred_test = dtree.predict(feature_hashing_test)
print(classification_report(y_test,pred_test))

             precision    recall  f1-score   support

          0       0.54      0.54      0.54     12527
          1       0.54      0.54      0.54     12473

avg / total       0.54      0.54      0.54     25000



#### 4.4 GRID SEARCHCV

In [42]:
# split into train and test
X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values

In [43]:
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [44]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed: 13.7min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 75.2min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 101.6min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's...se_idf': [False], 'vect__norm': [None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', refit=True, return_tr

In [45]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x116dd3400>} 
CV Accuracy: 0.893


### 5. Result Comparison for different Methods and Models

Text Representation Techniques  | F1-Score for Logistic Regression|F1-Score for Decision Tree| F1-Score for GRID SEARCHCV
------------- | -------------| ------| ---
Bag of Words  | 0.88| 0.72| NA
TF-IDF  | 0.89| 0.71|0.893
Feature Hashing|0.61|0.54| NA


### Summary of the Solution & Key Highlights

For this exercise, we were able to apply sentiment analysis (two-class text classification) machine learning algorithms to compare performance of models across these text representation techniques. We were able to apply different algorithms for different machine learning models and compare the results of model performance.

Based on the result, we should choose the Grid SearchCV method for the TF-IDF vectorization method to do the sentiment analysis because it has the highest accuracy score - 0.893.

#### Key highlight (Running Time):

During the process, we found out that the running time for feature hashing technique is the fastest and the running time for the TF-IDF technique is the slowest. In addition, the most accurate GridSearchCV logistic regression on TF-IDF takes three hours to run. 

If the company cares about the running time, maybe the TF-IDF Grid SearchCV might not be the optimal choice. Since the GRIDSEARCHCV TFIDF takes hours to run and the logistic regression of TF-IDF only takes a few minutes, the company should choose TF-IDF Logistic Regression given that the two have similar accuracy scores - both around 0.89.


### Key Learnings

There are three key learning in this part of the project

1. During the machine learning process, when creating the train and test dataset for different methods, Team X need to apply the text representation techniques and vectorize the whole dataset, rather than vectorize the train and test dataset separately. Because vectorizing the train and test dataset separately will create different features, which would hinder and kill the logistics regression model.
   - For example, when Team X was doing the logistic regression model with the bag of words vectorization method, Team X noticed that the logistic regression model wouldn't run because the number of features in the bag of words test dataset is different from the number of features in the bag of words train dataset. Then, Team X went back to the vectorization part, and realized that the text vectorization was done separately for the train and test dataset. In the end, Team X solved the problem by first vectorize the full review dataset, then transform the features into the train and test dataset to ensure they have the same number of features for further machine learning modeling.
   
2. It takes a long time to reopen the python file after the vectorization part. If the company wants to  speed up the process of reopening files, Team X should remove or not display the features.

3. Variations of the text representation techniques are crucial to the model performance and the accuracy of the features.In the exercise, Team X use several ways to make the dataset more accurate to process. Cleaning the text data by stripping it of all unwanted characters should be the first important step before we build our bag-of words or any other model. 
   - a. Lemmatization (normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma)
    - b. Text Preprocessor 
        - punctuations, line breaks
        - convert all character into lowercase
        - remove all of the HTML markup 
    - c. Remove the stopwords (is, the, for, to...etc;Stop Words are words which do not contain important significance to be used in Search Queries. Usually these words are filtered out from search queries because they return vast amount of unnecessary information.)
