# Import Libraries and Data

In [1]:
import pandas as pd # dataframe/data cleaning/manipulation
import numpy as np # array computations
from matplotlib import pyplot as plt # plotting/graphing
import matplotlib.patches as mpatches
from sklearn.tree import plot_tree, export_text, DecisionTreeClassifier # Decision tree algorithm and plotting functions for the Decision tree
from sklearn.naive_bayes import MultinomialNB # Naive Bayes algorithm
from sklearn.model_selection import train_test_split, cross_val_score # train test split and cross validation accuracy functions
from sklearn.metrics import accuracy_score, roc_auc_score # Various model evaluation metrics
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2 # filter method functions
from sklearn.feature_selection import SequentialFeatureSelector as SFS, RFE # wrapper method function
from sklearn.feature_extraction.text import CountVectorizer # text data conversion function

Note: If you are using Google Colab, you must upload the `movie.csv` file from Canvas by doing the following:

* On the left-side bar, click the folder icon.
* Click the 'Upload to session storage' button.
* Upload the CSV file; it will appear below the 'sample_data' folder.

**Unfortunately, this process must be done every time the runtime is disconnected - just a quirk with Google Colab.**

If you are using Jupyter notebook, just make sure the CSV file is in the same folder location as this .ipynb file.

In [2]:
movie_df = pd.read_csv('movie.csv',index_col=0)

# Text Conversion

The dataset we are working with is a list of 2000 reviews made for a particular movie, and we are trying to predict (based on the words in each review), if the overall sentiment for each review is positive or negative. The first 5 rows of the dataset are shown below.

In [3]:
movie_df.head()

Unnamed: 0,text,@@class@@
0,"plot : two teen couples go to a church party ,...",neg
1,the happy bastard's quick movie review \ndamn ...,neg
2,it is movies like these that make a jaded movi...,neg
3,""" quest for camelot "" is warner bros . ' firs...",neg
4,synopsis : a mentally unstable man undergoing ...,neg


Before going further, it's clear we need to do some data manipulation in order to be able to convert this unstructured text into a dataset suitable for training a predictive model.

While we already have assigned an outcome/dependent variable to each review in the dataset (whether the review was positive or negative), we need to generate a set of independent variables that encapsulate the essence of the story.

To do this, we will use the **"bag of words"** methodology, where we take each word from every review as a feature/independent variable in order to represent the contents of the movie reviews holistically. The result will be a structured `(X)` dataset with a massive amount of columns but the same number of rows, where each row represents a movie review, and each column is one of the many words that appeared in the movie reviews.

First, let's get a list of each word from every movie review and store it as `text_data`. We can achieve this by using the `.to(list)` method from Pandas on the text column from `movie_df`. The result will be a nested list, where each list within the main list has all the words from a movie review.

**Note:** `text_data` will be too large to preview if we call the object.

In [4]:
text_data = movie_df['text'].tolist()

Now, in order to create a structured binary dataset that we can use for training, we need to use the [CountVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) function from Scikit-Learn, which will convert our list of words from every movie review into a Numpy matrix of token counts.

There are two hyperparameters of interest we will set:

- `binary = True` - This will tell the count vectorizer to check whether a particular word (token) is present in the document (1 if present, 0 if not) instead of counting the frequency of each word.
- `max_features = 1000` - This will limit the number of features (unique words) to the top 1000, ranked by how often each unique word shows up across the entire list of words.

Once we've configured the settings for our count vectorizer, we simply call a `.fit()` on our text data to have the count vectorizer analyze the data to build a vocabulary (bag) of words!

In [5]:
vectorizer = CountVectorizer(binary = True, max_features = 1000).fit(text_data)

Like any model object in Scikit-Learn, printing `vectorizer` will only return the settings used for the model. The code below [transforms](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transform.html) our `text_data` into our desired bag-of-words model, where each nested list is transformed into a vector.

**Note:** `bag_of_words` will also be too large to preview if we call the object.

In [6]:
bag_of_words = vectorizer.transform(text_data)

Each vector in `bag_of_words` has a length of 1000 due to `max_features = 1000`, and each element in the vector corresponds to a word in the vocabulary. The elements in the vector are binary (1 or 0) indicating the presence or absence of the word in each movie review. This is better visualized with the code below.

In [7]:
print(bag_of_words.toarray())
print('\n Our bag of words object has', bag_of_words.shape[0], 'rows, each representing a movie review, and', bag_of_words.shape[1], 'columns.')

[[1 0 0 ... 1 0 1]
 [0 0 0 ... 1 0 1]
 [0 0 0 ... 1 1 0]
 ...
 [1 1 0 ... 1 1 1]
 [1 0 0 ... 1 0 0]
 [0 0 0 ... 1 0 1]]

 Our bag of words object has 2000 rows, each representing a movie review, and 1000 columns.


We can also see the 1000 words selected by our count vectorizer with the code below.

In [8]:
vectorizer.get_feature_names_out()

array(['10', 'ability', 'able', 'about', 'above', 'absolutely', 'across',
       'act', 'acting', 'action', 'actor', 'actors', 'actress', 'actual',
       'actually', 'add', 'after', 'again', 'against', 'age', 'agent',
       'ago', 'air', 'alien', 'all', 'almost', 'alone', 'along',
       'already', 'also', 'although', 'always', 'am', 'amazing',
       'america', 'american', 'among', 'amount', 'amusing', 'an', 'and',
       'annoying', 'another', 'any', 'anyone', 'anything', 'anyway',
       'apparently', 'appear', 'appearance', 'appears', 'are', 'aren',
       'around', 'art', 'as', 'ask', 'asks', 'aspect', 'at', 'atmosphere',
       'attempt', 'attempts', 'attention', 'audience', 'audiences',
       'away', 'awful', 'back', 'background', 'bad', 'based', 'basically',
       'battle', 'be', 'beautiful', 'because', 'become', 'becomes',
       'been', 'before', 'begin', 'beginning', 'begins', 'behind',
       'being', 'believable', 'believe', 'best', 'better', 'between',
       'beyond'

# Create and Evaluate Model - Decision Tree

Now that we have created a structured `X` dataset for training, we are ready to construct a classification model to start making predictions! To begin, we can define our features and target variable from our dataset and split them into training and testing sets.

In [9]:
X = bag_of_words
y = movie_df.iloc[:,1] # class variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33)

To start, let's create a decision tree model using the settings we have been using previously for the British Bank Dataset and evaluate the model based on 10-Fold Cross Validation ROC AUC % and Test Classification Accuracy.

In [10]:
model = DecisionTreeClassifier(criterion = 'entropy', max_depth = 5, random_state = 3).fit(X_train, y_train)

score = cross_val_score(model, X_train, y_train, cv = 10, scoring = 'roc_auc').mean()
print('Decision Tree 10-Fold Cross Validation ROC AUC:', round(score*100,2),'%')

print('Decision Tree Test Classification Accuracy:', round(float(accuracy_score(model.predict(X_test), y_test)) * 100,2),'%')

Decision Tree 10-Fold Cross Validation ROC AUC: 67.44 %
Decision Tree Test Classification Accuracy: 60.15 %


The above results are definitely less-than-desireable; decision trees are typically not the best choice for modeling text data, especially when represented as a binary bag-of-words. There are a few reasons why:

1. **High Dimensionality:** Text data, especially when transformed into a bag-of-words model, tends to be high-dimensional (i.e., it has many features). Even with a limit of 1000 words, each data point is represented as a vector of 1000 dimensions. Decision trees are not very efficient with high-dimensional data and can lead to overfitting.

2. **Sparse Data:** Bag-of-words models are typically sparse, meaning that most of the features (words) will have a value of 0 (absent) for most documents. Decision trees, which rely on partitioning data based on feature values, can struggle with sparse data. They might end up creating very complex trees that do not generalize well.

3. **Binary Features**: In our specific case, the bag-of-words model is binary. This can limit the effectiveness of decision trees, as they just split on whether a word is present or not, without considering the frequency or weight of the words.

4. **Computational Complexity:** For large datasets, decision trees can become computationally expensive to train. The complexity increases with the number of features and depth of the tree. With high-dimensional data, trees can grow deep and large, leading to increased computational costs and potentially poorer performance.

Moreover, while decision trees can show how words interact (using "if" statements), but they might get too fixated on the specific stories they’ve seen.

Fortunately, as discussed in class, there is another model type that is able to handle text data more effectively, and is discussed next!

# Create and Evaluate Model - Naive Bayes

The Naive Bayes classifier is a simple yet powerful tool for text classification. The basic principle, in context, is that if a movie review contains words commonly seen in negative reviews and rarely in positive reviews, the review is more likely to be negative (and vice versa).

Essentially, Naive Bayes helps us decide whether a review is likely positive or negative based on the words it contains and conditional probabilities.

**What makes the method "Naive" is that the classifier assumes that the presence of each word is independent of the others.** As a result, Naive Bayes does not capture the way some words work together to change the overall meaning of a statement, which can be important (especially if we don't have many words to work with).

However, the naive assumption simplifies the required calculations and generally produces much better results for large text data than any of the other classification models covered in this class!



In Scikit-Learn, we use the [MultinomialNB()](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) to fit and train a Naive Bayes model!

There are very few parameters that can be set for this Naive Bayes compared to other models and that are in the scope of discussing in this class, but the two parameters you can experiment with are:

- `alpha` - Additive smoothing value.
- `force_alpha` - If set to false and alpha is less than 1e-10, it will set alpha to 1e-10. If True, alpha will remain unchanged.



In [15]:
model = MultinomialNB().fit(X_train, y_train)

score = cross_val_score(model, X_train, y_train, cv = 10, scoring = 'roc_auc').mean()
print('Naive Bayes 10-Fold Cross Validation ROC AUC:', round(score*100,2),'%')

print('Naive Bayes Test Classification Accuracy:', round(float(accuracy_score(model.predict(X_test), y_test)) * 100,2),'%')

Naive Bayes 10-Fold Cross Validation ROC AUC: 89.31 %
Naive Bayes Test Classification Accuracy: 79.85 %


As seen above, our Naive Bayes model performed much better than our Decision Tree model. Try experiment with Naive Bayes to see if you can get better results than the variations shown above!

Alternatively, another model in Scikit-Learn that you could look into for text data is [Support Vector Machines](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) or (Convolutional) Neural Networks through Tensorflow/Keras.

# Applying Feature Selection to Naive Bayes

We can apply the feature discussion techniques discussed in class and in the Feature_Selection_Final.ipynb file on Canvas to Naive Bayes!

## Filter Method

Suppose we want to limit our selection of words further from the 1000 most frequent words we originally selected when converting the unstructured text into a dataset suitable for training a predictive model to the top 250 words.

Using the `SelectKBest()` function to to automatically select the top 250 words from our `X` dataset, we can compare the results between scoring based on mutual information (through `mutual_info_classif`) and the chi-square test (through `chi2`).

**Note:** `fit_transform` is a method that allows you to use the `fit()` and `transform()` functions at the same time. In this case, we use it to calculate the score for each feature (words) in 'X' with respect to our target variable class ('y') AND based on those scores, only select the subset of 250 words from ('X') that had the highest scores based on mutual information or the chi-square test.

In [16]:
# Mutual Information

X_new = SelectKBest(mutual_info_classif, k = 250).fit_transform(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.33)

model = MultinomialNB().fit(X_train, y_train)

score = cross_val_score(model, X_train, y_train, cv = 10, scoring = 'roc_auc').mean()
print('Mutual Information Naive Bayes 10-Fold Cross Validation ROC AUC:', round(score*100,2),'%')

print('Mutual Information Naive Bayes Test Classification Accuracy:', round(float(accuracy_score(model.predict(X_test), y_test)) * 100,2),'%')

Mutual Information Naive Bayes 10-Fold Cross Validation ROC AUC: 91.83 %
Mutual Information Naive Bayes Test Classification Accuracy: 83.64 %


In [17]:
# Chi-Square Test

X_new = SelectKBest(chi2, k = 250).fit_transform(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.33)

model = MultinomialNB().fit(X_train, y_train)

score = cross_val_score(model, X_train, y_train, cv = 10, scoring = 'roc_auc').mean()
print('Chi-Square Naive Bayes 10-Fold Cross Validation ROC AUC:', round(score*100,2),'%')

print('Chi-Square Naive Bayes Test Classification Accuracy:', round(float(accuracy_score(model.predict(X_test), y_test)) * 100,2),'%')

Chi-Square Naive Bayes 10-Fold Cross Validation ROC AUC: 91.46 %
Chi-Square Naive Bayes Test Classification Accuracy: 85.91 %


## Wrapper Method

We can apply the wrapper method using the `SFS()` function and use forward selection to find the best subset of 250 words to use from the 1000 most frequent words we originaly selected when converting the unstructured text into a dataset suitable for training a predictive model.

**Note:** This will take a SIGNIFICANT amount of time to run **(we strongly recommend not running this on Google Colab)**, as unlike any of the other datasets we have worked with previously, we have 1000 different features to consider as opposed to ~ 54 from the employee attrition dataset and ~ 13 from the british bank dataset. The `n_jobs = -1` parameter tells SFS to use all processors on the machine to parallelize and speed up the computation.

In [14]:
sfs = SFS(model, n_features_to_select=250, direction='forward', scoring='roc_auc', cv=10, n_jobs = -1).fit(X,y)

X_new = sfs.transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.33)

model = MultinomialNB().fit(X_train, y_train)

score = cross_val_score(model, X_train, y_train, cv = 10, scoring = 'roc_auc').mean()
print('Wrapper-Selected Naive Bayes 10-Fold Cross Validation ROC AUC:', round(score*100,2),'%')

print('Wrapper-Selected Naive Bayes Test Classification Accuracy:', round(float(accuracy_score(model.predict(X_test), y_test)) * 100,2),'%')

Wrapper-Selected Naive Bayes 10-Fold Cross Validation ROC AUC: 93.89 %
Wrapper-Selected Naive Bayes Test Classification Accuracy: 86.36 %
