# <center> Milestone 2: Feature Engineering, Baseline Model & Interpretability </center>

### <center> Authors: Albina Cako & Joshua Dalphy </center>

# Table of Contents

* [1. Getting Started](#1)
* [2. Data Transformation](#2)
 * [2.1) Stemming the Dataset](#2.1)
 * [2.2) Removing Numerical Values](#2.2)
* [3. Feature Extraction & Baseline Modelling](#3)
 * [3.1) Bag of Words - Count](#3.1)
 * [3.2) Bag of Words - TF-IDF](#3.2)
 * [3.3) Bag of Bi-Grams - Count](#3.3)
 * [3.4) Bag of Bi-Grams - TF-IDF](#3.3)
* [4. Model Interpretation](#4)
 * [4.1) Confusion Matrices Comparison and Model Time Performance](#4.1)
 * [4.2) Feature Importance Using Decision Trees](#4.2)
* [5. Conclusions](#5)

# 1. Getting Started<a class="anchor" id="1"></a>

In this project we will be extracting features in our movie dataset using Bag of Words (count and TF-IDF methods) and Bag of Bi-Grams (count and TF-IDF methods). Logistic regression classification model was chosen as the baseline model to run the sentiment prediction. The best version of the feature extraction will be evaluated using confusion matrix and runtime of the logistic regression model. 

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
import model_evaluation_utils as meu

# Set print options
np.set_printoptions(precision=2, linewidth=80)

# Load the cleaned preprocessed dataset
dataset = pd.read_csv('Movie_Reviews_Clean.csv')

# Remove extra column
dataset = dataset.drop(columns=['Unnamed: 0'])

# replace positive with 1 and negative with 0
dataset.Sentiments = dataset.Sentiments.replace('positive',1)
dataset.Sentiments = dataset.Sentiments.replace('negative',0)

dataset = shuffle(dataset)

# Retrieve the reviews
dataset.head(3)

Unnamed: 0,Reviews,Sentiments
2932,not understand comment focus mcconaughey never...,1
5537,let us say simple word even maker film may cha...,0
1414,year lose gorgeous jane parker maureen osulliv...,1


In [3]:
# Create a reivew and sentiment array
reviews = np.array(dataset['Reviews'])
sentiment = np.array(dataset['Sentiments'])

# Split the reviews into testing and training 70/30
index = round(0.7*len(reviews))

train_reviews = reviews[:index]
test_reviews  = reviews[index:]

train_sentiments = sentiment[:index]
test_sentiments  = sentiment[index:]

# 2. Data Transformation<a class="anchor" id="2"></a>

## 2.1) Stemming the Dataset<a class="anchor" id="2.1"></a>

In [4]:
#import nltk
#nltk.download()

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
porter = PorterStemmer()

def stemSentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

In [5]:
# Apply the stemming function on train_reviews
stem_train_reviews = [None]*len(train_reviews)
idx = 0
for sentence in train_reviews:
    stem_train_reviews[idx] = stemSentence(sentence)
    idx = idx + 1   

# Apply the stemming function on test_reviews
stem_test_reviews = [None]*len(test_reviews)
idx = 0
for sentence in test_reviews:
    stem_test_reviews[idx] = stemSentence(sentence)
    idx = idx + 1  

    
train_reviews = stem_train_reviews  
test_reviews  = stem_test_reviews 

## 2.2) Removing Numerical Values<a class="anchor" id="2.2"></a>

In [6]:
# Define a function to remove digits from a sentence
import re
def remove_digits(text):
    text = re.sub(r'[0-9]+', '', text)
    #text = re.sub('[^a-zA-Z\s]', '', text)
    return text

In [8]:
# Apply the function to train_reviews
remove_digits_train = []
for doc in train_reviews:
    doc = remove_digits(doc)
    remove_digits_train.append(doc)

# Apply the function to test_reviews
remove_digits_test = []
for doc in test_reviews:
    doc = remove_digits(doc)
    remove_digits_test.append(doc)

    
train_reviews = remove_digits_train
test_reviews = remove_digits_test

# 3. Feature Extraction & Baseline Modelling<a class="anchor" id="3"></a>

In [20]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression

## 3.1) Bag of Words - Count<a class="anchor" id="3.1"></a>

### 3.1.1) Feature Extraction<a class="anchor" id="3.1.1"></a>

In [11]:
# build BOW features on train reviews
BOW_cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,1))
BOW_cv_train_features = BOW_cv.fit_transform(train_reviews)

# transform test reviews into features
BOW_cv_test_features = BOW_cv.transform(test_reviews)

In [12]:
print('BOW model:> Train features shape:', BOW_cv_train_features.shape, ' Test features shape:', BOW_cv_test_features.shape)

BOW model:> Train features shape: (4900, 24959)  Test features shape: (2100, 24959)


In [15]:
# Visualize some of the features
bow_feature_names = BOW_cv.get_feature_names()
bow_feature_names[200:220]

['adher',
 'adhes',
 'adibah',
 'adieu',
 'adio',
 'aditiya',
 'aditya',
 'adject',
 'adjoin',
 'adjust',
 'adkin',
 'adler',
 'administ',
 'administr',
 'admir',
 'admireranyon',
 'admiss',
 'admit',
 'admitedli',
 'admitt']

In [16]:
BOW_matrix = BOW_cv_train_features.toarray()
bow_df = pd.DataFrame(BOW_matrix, columns=bow_feature_names)
bow_df.iloc[0:3,2000:2010]

Unnamed: 0,bernard,bernhard,berni,bernic,bernier,bernsen,bernstein,berri,berrisford,berryman
0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0


### 3.1.2) Baseline Modelling - Logistic Regression<a class="anchor" id="3.1.2"></a>

In [22]:
%%time
lr_bow_count = LogisticRegression(penalty='l2', max_iter=200, C=1)

# Logistic Regression model on BOW-count features
lr_bow_predictions = meu.train_predict_model(classifier=lr_bow_count, 
                                             train_features=BOW_cv_train_features, train_labels=train_sentiments,
                                             test_features=BOW_cv_test_features, test_labels=test_sentiments)

Wall time: 851 ms


In [23]:
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_bow_predictions,
                                      classes=[1,0])

Model Performance metrics:
------------------------------
Accuracy: 0.8743
Precision: 0.8747
Recall: 0.8743
F1 Score: 0.8743

Model Classification report:
------------------------------
              precision    recall  f1-score   support

           1       0.86      0.89      0.87      1038
           0       0.89      0.86      0.87      1062

    accuracy                           0.87      2100
   macro avg       0.87      0.87      0.87      2100
weighted avg       0.87      0.87      0.87      2100


Prediction Confusion Matrix:
------------------------------
          Predicted:     
                   1    0
Actual: 1        922  116
        0        148  914


## 3.2) Bag of Words - TF-IDF<a class="anchor" id="3.2"></a>

### 3.2.1) Feature Extraction<a class="anchor" id="3.2.1"></a>

In [13]:
# build TFIDF features on train reviews
BOW_tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(1,1),
                     sublinear_tf=True)
BOW_tv_train_features = BOW_tv.fit_transform(train_reviews)

# transform test reviews into features
BOW_tv_test_features = BOW_tv.transform(test_reviews)

In [14]:
print('TFIDF model:> Train features shape:', BOW_tv_train_features.shape, ' Test features shape:', BOW_tv_test_features.shape)

TFIDF model:> Train features shape: (4900, 24959)  Test features shape: (2100, 24959)


In [19]:
# Visualize some of the features
bow_tv_feature_names = BOW_tv.get_feature_names()

BOW_tv_matrix = BOW_tv_train_features.toarray()
bow_tv_df = pd.DataFrame(BOW_tv_matrix, columns=bow_tv_feature_names)
bow_tv_df.iloc[0:3,2000:2010]

Unnamed: 0,bernard,bernhard,berni,bernic,bernier,bernsen,bernstein,berri,berrisford,berryman
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.2.2) Baseline Modelling - Logistic Regression<a class="anchor" id="3.2.2"></a>

In [24]:
%%time
lr_bow_tfidf = LogisticRegression(penalty='l2', max_iter=200, C=1)

# Logistic Regression model on BOW-TF-IDF features
lr_bow_tfidf_predictions = meu.train_predict_model(classifier=lr_bow_tfidf, 
                                             train_features=BOW_tv_train_features, train_labels=train_sentiments,
                                             test_features=BOW_tv_test_features, test_labels=test_sentiments)

Wall time: 293 ms


In [25]:
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_bow_tfidf_predictions,
                                      classes=[1,0])

Model Performance metrics:
------------------------------
Accuracy: 0.8833
Precision: 0.8839
Recall: 0.8833
F1 Score: 0.8833

Model Classification report:
------------------------------
              precision    recall  f1-score   support

           1       0.87      0.90      0.88      1038
           0       0.90      0.87      0.88      1062

    accuracy                           0.88      2100
   macro avg       0.88      0.88      0.88      2100
weighted avg       0.88      0.88      0.88      2100


Prediction Confusion Matrix:
------------------------------
          Predicted:     
                   1    0
Actual: 1        935  103
        0        142  920


## 3.3) Bag of Bi-Grams - Count<a class="anchor" id="3.3"></a>

### 3.3.1) Feature Extraction<a class="anchor" id="3.2.1"></a>

In [26]:
# build BOW features on train reviews
BOBG_cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(2,2))
BOBG_cv_train_features = BOBG_cv.fit_transform(train_reviews)

# transform test reviews into features
BOBG_cv_test_features = BOBG_cv.transform(test_reviews)

In [31]:
print('BOBG-count model:> Train features shape:', BOBG_cv_train_features.shape, ' Test features shape:', BOBG_cv_test_features.shape)

BOBG-count model:> Train features shape: (4900, 385586)  Test features shape: (2100, 385586)


In [28]:
# Visualize some of the features
BOBG_feature_names = BOBG_cv.get_feature_names()
BOBG_feature_names[200:220]

['abil endless',
 'abil everi',
 'abil expert',
 'abil express',
 'abil feel',
 'abil figur',
 'abil film',
 'abil find',
 'abil fine',
 'abil full',
 'abil give',
 'abil good',
 'abil help',
 'abil hold',
 'abil immedi',
 'abil inflict',
 'abil job',
 'abil lie',
 'abil like',
 'abil limit']

In [29]:
BOBG_matrix = BOBG_cv_train_features.toarray()
bobg_df = pd.DataFrame(BOBG_matrix, columns=BOBG_feature_names)
bobg_df.iloc[0:3,2000:2010]

Unnamed: 0,acid lay,acid morph,acid movi,acid mr,acid mushroom,acid never,acid poptart,acid quickli,acid seem,acid start
0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0


### 3.3.2) Baseline Modelling - Logistic Regression<a class="anchor" id="3.3.2"></a>

In [37]:
%%time
lr_bobg_count = LogisticRegression(penalty='l2', max_iter=100, C=1)

# Logistic Regression model on BOBG-count features
lr_bobg_count_predictions = meu.train_predict_model(classifier=lr_bobg_count, 
                                             train_features=BOBG_cv_train_features, train_labels=train_sentiments,
                                             test_features=BOBG_cv_test_features, test_labels=test_sentiments)

Wall time: 3.08 s


In [38]:
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_bobg_count_predictions,classes=[1,0])

Model Performance metrics:
------------------------------
Accuracy: 0.8295
Precision: 0.833
Recall: 0.8295
F1 Score: 0.8292

Model Classification report:
------------------------------
              precision    recall  f1-score   support

           1       0.80      0.88      0.84      1038
           0       0.87      0.78      0.82      1062

    accuracy                           0.83      2100
   macro avg       0.83      0.83      0.83      2100
weighted avg       0.83      0.83      0.83      2100


Prediction Confusion Matrix:
------------------------------
          Predicted:     
                   1    0
Actual: 1        911  127
        0        231  831


## 3.4) Bag of Bi-Grams - TF-IDF<a class="anchor" id="3.4"></a>

### 3.4.1) Feature Extraction<a class="anchor" id="3.4.1"></a>

In [30]:
# build TFIDF features on train reviews
BOBG_tv_tfidf = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(2,2),
                     sublinear_tf=True)
BOBG_tv_tfidf_train_features = BOBG_tv_tfidf.fit_transform(train_reviews)

# transform test reviews into features
BOBG_tv_tfidf_test_features = BOBG_tv_tfidf.transform(test_reviews)

In [32]:
print('BOBG-tfidf model:> Train features shape:', BOBG_tv_tfidf_train_features.shape, ' Test features shape:', BOBG_tv_tfidf_test_features.shape)

BOBG-tfidf model:> Train features shape: (4900, 385586)  Test features shape: (2100, 385586)


In [34]:
# Visualize some of the features
BOBG_tfidf_feature_names = BOBG_tv_tfidf.get_feature_names()

BOBG_tfidf_matrix = BOBG_tv_tfidf_train_features.toarray()
bobg_tfidf_df = pd.DataFrame(BOBG_tfidf_matrix, columns=BOBG_tfidf_feature_names)
bobg_tfidf_df.iloc[0:3,2000:2010]

Unnamed: 0,acid lay,acid morph,acid movi,acid mr,acid mushroom,acid never,acid poptart,acid quickli,acid seem,acid start
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.4.2) Baseline Modelling - Logistic Regression<a class="anchor" id="3.4.2"></a>

In [39]:
%%time
lr_bobg_tfidf = LogisticRegression(penalty='l2', max_iter=100, C=1)

# Logistic Regression model on BOBG-count features
lr_bobg_tfidf_predictions = meu.train_predict_model(classifier=lr_bobg_tfidf, 
                                             train_features=BOBG_tv_tfidf_train_features, train_labels=train_sentiments,
                                             test_features=BOBG_tv_tfidf_test_features, test_labels=test_sentiments)

Wall time: 2.16 s


In [40]:
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_bobg_count_predictions,classes=[1,0])

Model Performance metrics:
------------------------------
Accuracy: 0.8295
Precision: 0.833
Recall: 0.8295
F1 Score: 0.8292

Model Classification report:
------------------------------
              precision    recall  f1-score   support

           1       0.80      0.88      0.84      1038
           0       0.87      0.78      0.82      1062

    accuracy                           0.83      2100
   macro avg       0.83      0.83      0.83      2100
weighted avg       0.83      0.83      0.83      2100


Prediction Confusion Matrix:
------------------------------
          Predicted:     
                   1    0
Actual: 1        911  127
        0        231  831


# 4. Model Interpretation<a class="anchor" id="4"></a>

## 4.1 Confusion Matrices Comparison and Model Time Performance<a class="anchor" id="4.1"></a>

Confusion matrix was used to compare the 4 baseline models. Logistic regression was used as the baseline model for this milestone. 

### Bag of Words: Count vs. TF-IDF
Looking at the confusion matrix, the model with the highest accuracy was the model where the features were extracted using Bag of Words - TF-IDF. Model trained with TF-IDF performed better then that with count, with a slightly higher accuracy of 1 %. The precision was also higher in both positive and negative sentiments. The TF-IDF model also had a lower false positive and false negative rate then the count method. However, both models showed a higher false positive rate, then false negative rate. 

Both logistic regression models ran quickly, 851 ms and 293 ms, for Count vs. TF-IDF respectively. However, the TF-IDF model ran about 3 times faster then the count model. For larger dataset, this model would be preferred due to faster run time. 

Overall, the TF-IDF trained model outperformed the count model. 

### Bag of Bi-Grams Count vs. TF-IDF
Looking at the confusion matrix, both models that used feature extraction using Bag of N-Grams count and TF-IDF performed equally, with the exact same values in the confusion matrix. This shows that applying TF-IDF did not make a difference in feature extraction. 

The logistic regression model that was trained using count for feature extraction, ran slower then the model using TF-IDF, 3.08 s and 2.61 s, respectively. Thus, for performance time the TF-IDF model outperforms the Count Bag of Bi-Grams model. This model would have been chosen due to performance time. 


## 4.2 Feature Importance Using Random Forest<a class="anchor" id="4.2"></a>

In [49]:
# Import required library for decision trees
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import seaborn as sns

### 4.1.1 Bag of Words - Count<a class="anchor" id="4.1.1"></a>

In [42]:
# Use a decision tree for feature selection
bow_count_RF = RandomForestClassifier(n_estimators=100,random_state = 40)
bow_count_RF = bow_count_RF.fit(BOW_cv_train_features,train_sentiments)

In [51]:
feature_imp_bow_count = pd.Series(bow_count_RF.feature_importances_,index=bow_feature_names).sort_values(ascending=False)
print("The number of features are:",bow_count_RF.n_features_)
feature_imp_bow_count[:10]

The number of features are: 24959


bad       0.022325
wast      0.009658
great     0.009370
not       0.007198
love      0.006977
beauti    0.005598
bore      0.005547
excel     0.005399
aw        0.004676
movi      0.004617
dtype: float64

#### Identifying Unimportant Features

In [56]:
bow_feature_remove = []
for name in feature_imp_bow_count.index:
    if feature_imp_bow_count[name] == 0:
        bow_feature_remove.append(name)

print("Number of features to remove are:",len(bow_feature_remove))
print("\nDisplay the first ten features that will be deleted:")
bow_feature_remove[:10]

Number of features to remove are: 12955

Display the first ten features that will be deleted:


['aavjo',
 'gangstermovi',
 'aaja',
 'gam',
 'frutti',
 'abet',
 'entwistl',
 'gard',
 'frenchwoman',
 'enrol']

### 4.1.2 Bag of Words - TF-IDF<a class="anchor" id="4.1.2"></a>

In [44]:
# Use a decision tree for feature selection
bow_tfidf_RF = RandomForestClassifier(n_estimators=100,random_state = 40)
bow_tfidf_RF = bow_count_RF.fit(BOW_tv_train_features,train_sentiments)

In [52]:
feature_imp_bow_tfidf = pd.Series(bow_tfidf_RF.feature_importances_,index=bow_feature_names).sort_values(ascending=False)
print("The number of features are:",bow_tfidf_RF.n_features_)
feature_imp_bow_tfidf[:10]

The number of features are: 24959


bad       0.022325
wast      0.009658
great     0.009370
not       0.007198
love      0.006977
beauti    0.005598
bore      0.005547
excel     0.005399
aw        0.004676
movi      0.004617
dtype: float64

### 4.1.3 Bag of Bi-Grams - Count<a class="anchor" id="4.1.3"></a>

In [46]:
# Use a decision tree for feature selection
bobg_count_RF = RandomForestClassifier(n_estimators=100,random_state = 40)
bobg_count_RF = bobg_count_RF.fit(BOBG_cv_train_features,train_sentiments)

In [54]:
feature_imp_bobg_count = pd.Series(bobg_count_RF.feature_importances_,index=BOBG_feature_names).sort_values(ascending=False)
print("The number of features are:",bobg_count_RF.n_features_)
feature_imp_bobg_count[:10]

The number of features are: 385586


bad movi     0.009489
not even     0.006314
wast time    0.005406
one bad      0.004259
bad film     0.003904
bad act      0.003404
not wast     0.003240
look like    0.002857
movi bad     0.002468
must see     0.002430
dtype: float64

### 4.1.4 Bag of Bi-Grams - TF-IDF<a class="anchor" id="4.1.4"></a>

In [48]:
# Use a decision tree for feature selection
bobg_tfidf_RF = RandomForestClassifier(n_estimators=100,random_state = 40)
bobg_tfidf_RF = bobg_tfidf_RF.fit(BOBG_tv_tfidf_train_features ,train_sentiments)

In [55]:
feature_imp_bobg_tfidf = pd.Series(bobg_tfidf_RF.feature_importances_,index=BOBG_feature_names).sort_values(ascending=False)
print("The number of features are:",bobg_tfidf_RF.n_features_)
feature_imp_bobg_tfidf[:10]

The number of features are: 385586


bad movi            0.009668
not even            0.006154
wast time           0.006069
bad film            0.005183
one bad             0.004875
not wast            0.004077
look like           0.003653
movi bad            0.003521
bad act             0.003065
highli recommend    0.002675
dtype: float64

#### Identifying Unimportant Features

In [58]:
bobg_feature_remove = []
for name in feature_imp_bobg_count.index:
    if feature_imp_bobg_count[name] == 0:
        bobg_feature_remove.append(name)

print("Number of features to remove are:",len(bobg_feature_remove))
print("\nDisplay the first ten features that will be deleted:")
bobg_feature_remove[:10]

Number of features to remove are: 300998

Display the first ten features that will be deleted:


['franc dalen',
 'armi get',
 'franci matthew',
 'art master',
 'fortun get',
 'franci may',
 'franci drake',
 'fulci bava',
 'armi garrison',
 'frasier make']

# 5. Conclusions<a class="anchor" id="5"></a>

In Milestone 2 of the project, 4 models were ran by using Bag of Words (Count and TF-IDF) and Bag of Bi-Grams (Count and TF-IDF) for feature extraction. Logistic regression was chosen as the baseline model. Model performance was evaluated based on confusion matrix and runtime. The best performing model was using the features extracted through Bag of Words - TF-IDF. This model outperformed in confusion matrix metrics and in runtime. Random forest was used to identify important and unimportant features in the dataset. Exploration of these features is to be done in further milestones. 

The next step would be to tune the logistic regression model trained with features extracted from Bag of Words TF-IDF. The goal would be to get a higher model accuracy and improvement in all the confusion matrix parameters, especially reduction of false positive values, which was high in this model.  