# IMDB Movie Reviews

In this lab we will be working through a Sentiment Analysis task using the IMDB Movie Review Dataset. This dataset consists of 50K movie reviews that are labeled as either positive or negative based upon their sentiment toward a film. For this, we will be using a simple Logistic Regression model for Binary Classification. This notebook will guide you through the 4 main steps for this task:

1. Text Preprocessing
2. Feature Engineering
3. Model Fitting
4. Hyper-Parameter Tuning

## 1. Text Preprocessing

The first step to this process will be loading the data. The data can be found [here](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download). Please download the data from this page which will produce a zip folder. Upon unzipping the folder, a file entitled "IMDB Dataset.csv" will be produced. Create a folder to hold data used in this course and place the "IMDB Dataset.csv" file in it. Next make sure that every package in the imports below is installed. The "nltk.download()" lines only need to be ran once. So comment them out with a # at the beginning of the line after the downloads have been completed.

In [1]:
import os
import re

import pandas as pd
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from tqdm import tqdm

[nltk_data] Downloading package punkt to /home/x-dchawra/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/x-dchawra/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/x-dchawra/nltk_data...


In the following cell, set the **data_path** variable equal to the path of the "IMDB Dataset.csv" file.

In [4]:
data_path = "/home/x-dchawra/nlpexperiments/battelle_example/IMDB Dataset.csv.zip"


### Normalization and Cleaning

In the next two cells, the data will be loaded, cleaned, and normalized. This will include:

1. Removing HTML chunks
2. Transforming text to lower case
3. Tokenizing text
4. Removing Stop Words and punctuation
5. Lemmatizing Words

In [5]:
data = pd.read_csv(data_path)
data.columns = ['text', 'label']
data['label'] = data.label.map({'negative': 0, 'positive': 1})

wnl = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
punct_remover = str.maketrans('', '', punctuation)

def preprocess_text(text):
    text = re.sub('<.*?>', '', text)
    tokens = wordpunct_tokenize(text.lower())
    tokens = [i.translate(punct_remover) for i in tokens]
    tokens = [i for i in tokens if i not in stop_words]
    tokens = [wnl.lemmatize(i) for i in tokens]
    text = " ".join(tokens)
    
    return text

In [6]:
text_data = []
for i in tqdm(data.text):
    text_data.append(preprocess_text(i))
data['text'] = text_data

100%|██████████| 50000/50000 [00:26<00:00, 1878.94it/s]


## Feature Engineering

For features, we will be using a Bag-of-Words model, so each review will be represented as a vector that contains the of each word seen in the corpus that is contained in the review. We can alter the features that are generated by setting the size or range of ngrams that we use to make the vector. An ngram is the number of words per token that is used to make up a vector. For instance, if we wanted to make a bigram, or 2-gram, model, then the vector for each review would consist of the counts of each 2 word phrase. It is also possible to use a mixed-gram model which would use multiple sized ngrams.

**Experiment:** Try different sizes and ranges of ngrams and different thresholds for minimum and maximum frequency for feature generate. Observe and note the effects on training time and performance on the test set. The exact variables in question are ngram_min, ngram_max, min_df, and max_df. Please refer to the documentation of CountVectorizer [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvectorizer#sklearn.feature_extraction.text.CountVectorizer) to determine acceptable values. In short, the 4 values that can be tuned are:

* ngram_min - Minimum size of ngrams to use. Acceptable values are integers greater than or equal to 1
* ngram_max - Maximum size of ngrams to use. Acceptable values are integers greater than or equal to 1
* min_df - Minimum document frequency to be included in model. Acceptable values are integers greater than or equal to 1 to denote the raw count or float values in [0.0, 1.0] to denote percent frequency
* max_df - Minimum document frequency to be included in model. Same acceptable values as min_df

In [31]:
ngram_min = 4
ngram_max = 4
min_df = 0.5
max_df = 1.0

train_df, test_df = train_test_split(data, test_size=0.2)
vectorizer = CountVectorizer(ngram_range=(ngram_min, ngram_max), min_df=0.0, max_df=max_df)
x_train = vectorizer.fit_transform(train_df.text)
x_test = vectorizer.transform(test_df.text)

In [32]:
print(x_train[0])

  (0, 1568310)	1
  (0, 2643426)	1
  (0, 3724563)	1
  (0, 1519218)	1
  (0, 3133568)	1
  (0, 2434180)	1
  (0, 3962110)	1
  (0, 4539252)	1
  (0, 108675)	1
  (0, 732034)	1
  (0, 2634185)	1
  (0, 3190483)	1
  (0, 3106058)	1
  (0, 2871070)	1
  (0, 4498894)	1
  (0, 4009322)	1
  (0, 837501)	1
  (0, 4195180)	1
  (0, 1559155)	1
  (0, 570856)	1
  (0, 1944949)	1
  (0, 962040)	1
  (0, 1548798)	1
  (0, 3811383)	1
  (0, 4219951)	1
  :	:
  (0, 2235001)	1
  (0, 2098097)	1
  (0, 3550170)	1
  (0, 4186725)	1
  (0, 837473)	1
  (0, 3159563)	1
  (0, 1550249)	1
  (0, 3263911)	1
  (0, 3422386)	1
  (0, 2626088)	1
  (0, 2692626)	1
  (0, 106896)	1
  (0, 4388265)	1
  (0, 2307969)	1
  (0, 4036417)	1
  (0, 4488578)	1
  (0, 1676162)	1
  (0, 1469673)	1
  (0, 487034)	1
  (0, 3804963)	1
  (0, 842321)	1
  (0, 2650982)	1
  (0, 4135561)	1
  (0, 3954804)	1
  (0, 2193423)	1


In [33]:
terms = vectorizer.get_feature_names()
term_counts = x_train.toarray().sum(axis=0).tolist()
term_counts = sorted(zip(terms, term_counts), key=lambda x: x[1])
term_counts.reverse()
num_terms = len(terms)

print(f"Number of Unique Tokens: {num_terms}")
print("Most Common Terms:")
print("-" * len("Most Common Terms:"))
for i, j in term_counts[:10]:
    print(f"{i} - {j}")
print()
print("#"*20)
print()

least_common = term_counts[-10:]
least_common.reverse()

print("Least Common Terms:")
print("-" * len("Least Common Terms:"))
for i, j in least_common:
    print(f"{i} - {j}")

MemoryError: Unable to allocate 1.32 TiB for an array with shape (40000, 4552209) and data type int64

## Fitting Model

In this section, we will be fitting the Logistic Regression model. Be aware that the number of tokens used in the Bag-of-Words will determine the size of the model. Including too many features can make the model run for several minutes.

In [9]:
print(f"Number of Logical Processors: {os.cpu_count()}")

Number of Logical Processors: 128


Above you will see the number of logical processors you computer can use for parallel processing. The more you use the faster your model will train. Setting it to the total number of processors will be the fastest but will also use the most memory. Setting it to the max is my recommendation. The **num_processors** variable below is set to -1 which will tell the model to use every processor. If your model is still taking more than ~5 minutes to train than you likely have included too many tokens in your model.

In [10]:
num_processors = -1

model = LogisticRegression(max_iter=2000, n_jobs=num_processors, solver='liblinear')
model.fit(x_train.toarray(), train_df.label.values)



LogisticRegression(max_iter=2000, n_jobs=-1, solver='liblinear')

Below will be the accuracy of the model on a held-out test set. The maximal score is 1.0 and 0.5 would be equivalent to a random guess.

In [11]:
test_acc = model.score(x_test, test_df.label.values)
print(f"Test Set Accuracy: {test_acc}")

Test Set Accuracy: 0.7758


### Notes

In the next Markdown cell add notes of your observations. What is the effect of the number of tokens included? How did it impact performance? What size of ngrams and number of tokens produced the best model? Simply using frequency might not be the best method for selecting features, how might you do it differently?

Add notes here

## Hyper-Parameters

Run the next two cells to perform hyper-parameter tuning and see the results of the best hyper-parameters on the test set. It is not uncommon for hyper-parameter tuning to take several minutes. If it takes longer than ~10, you may need to interrupt the cell or restart the kernel and try again with a different ngram size.

In [12]:
import numpy as np

parameters = {'penalty': ['l1', 'l2', 'elasticnet'], 'C': [0.01, 0.1, 1, 10, 100]}

log_model = LogisticRegression(max_iter=2000, solver='liblinear')
grid = GridSearchCV(log_model, parameters, n_jobs=-1)
grid.fit(x_train, train_df.label.values)


GridSearchCV(estimator=LogisticRegression(max_iter=2000, solver='liblinear'),
             n_jobs=-1,
             param_grid={'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']})

In [13]:
best_model = grid.best_estimator_
print("Best_Parameters:")
for k, v in grid.best_params_.items():
    print(f"{k} = {v}")
print()
print("#" * 20)
print()

best_test_acc = best_model.score(x_test, test_df.label.values)
print(f"New Test Accuracy: {best_test_acc}")

Best_Parameters:
C = 1
penalty = l1

####################

New Test Accuracy: 0.7776


**Experiment:** Test different variables and values for hyper-parameter tuning to see the best results that you can generate. Provide notes in the next cell for your findings in testing with hyper-parameters.

Add notes here