# IMDB Movie Review Sentiment Analysis - Suman Malla - 200537767

##### In this project, we will analyze a dataset of IMDB movie reviews, preprocess the text data, and classify the reviews as positive or negative. The analysis will include text preprocessing, feature extraction using TFIDF, and model training using Random Forest and Gradient Boosting/XGBoost classifiers. We will also explore parameter tuning using GridSearchCV to optimize the models.

## Installing necessary packages  at once

In [9]:
!pip install dask dask-ml pandas scikit-learn nltk



## Importing all the necessary library

## Loading Dataset

In [10]:
import dask.dataframe as dd
import pandas as pd

# Load the IMDB dataset using Dask
ddf = dd.read_csv('IMDB_dataset.csv')

# Display the first few rows of the dataset
ddf.head(15)

Unnamed: 0,review,sentiment
0,I thought this was a wonderful way to spend ti...,positive
1,"Probably my all-time favorite movie, a story o...",positive
2,I sure would like to see a resurrection of a u...,positive
3,"This show was an amazing, fresh & innovative i...",negative
4,Encouraged by the positive comments about this...,negative
5,Phil the Alien is one of those quirky films wh...,negative
6,I saw this movie when I was about 12 when it c...,negative
7,So im not a big fan of Boll's work but then ag...,negative
8,This a fantastic movie of three prisoners who ...,positive
9,This movie made it into one of my top 10 most ...,negative


## 1. Preprocess Text Data (3 pts):
- ***Remove Punctuation: Strip punctuation marks to clean the text.***
- ***Tokenization: Break down the text into individual words or tokens.***
- ***Remove Stopwords: Eliminate common words that do not contribute much to the sentiment (e.g., "the," "is," "and").***
- ***Lemmatization/Stemming: Reduce words to their base or root form (e.g., "running" to "run").***

In [13]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download required NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize the necessary tools for text preprocessing
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    text = text.replace('<br />', ' ')
    
    # Remove punctuation
    text = ''.join([char for char in text if char.isalnum() or char.isspace()])
    
    # Tokenization and remove stopwords
    words = text.split()
    words = [word for word in words if word not in stop_words]
    
    # Lemmatization
    words = [lemmatizer.lemmatize(word) for word in words]
    
    return ' '.join(words)

# Apply text preprocessing in parallel using Dask
ddf['processed_review'] = ddf['review'].map(preprocess_text, meta=('review', 'str'))

ddf.head(10)

[nltk_data] Downloading package stopwords to /home/suman/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/suman/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/suman/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,review,sentiment,processed_review
0,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...
1,"Probably my all-time favorite movie, a story o...",positive,probably alltime favorite movie story selfless...
2,I sure would like to see a resurrection of a u...,positive,sure would like see resurrection dated seahunt...
3,"This show was an amazing, fresh & innovative i...",negative,show amazing fresh innovative idea 70 first ai...
4,Encouraged by the positive comments about this...,negative,encouraged positive comment film looking forwa...
5,Phil the Alien is one of those quirky films wh...,negative,phil alien one quirky film humour based around...
6,I saw this movie when I was about 12 when it c...,negative,saw movie 12 came recall scariest scene big bi...
7,So im not a big fan of Boll's work but then ag...,negative,im big fan boll work many enjoyed movie postal...
8,This a fantastic movie of three prisoners who ...,positive,fantastic movie three prisoner become famous o...
9,This movie made it into one of my top 10 most ...,negative,movie made one top 10 awful movie horrible was...


## 2.TF-IDF Vectorization (2 pts):
Converting the preprocessed text into numerical features using the TF-IDF (Term Frequency-Inverse Document Frequency) technique. This will help to weigh the importance of each word in a review relative to the entire dataset

In [17]:
# Convert the processed Dask DataFrame to a Pandas DataFrame
df = ddf.compute()

# Define the target variable
y = df['sentiment'].map({'positive': 1, 'negative': 0})

from sklearn.feature_extraction.text import TfidfVectorizer
# TFIDF Vectorization
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['processed_review'])


In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning for Random Forest
rf_params = {'n_estimators': [100, 200], 'max_depth': [10, 20]}
rf_grid = GridSearchCV(RandomForestClassifier(), rf_params, cv=3, n_jobs=-1, verbose=1)
rf_grid.fit(X_train, y_train)

print(f'Best Parameters for Random Forest: {rf_grid.best_params_}')


Fitting 3 folds for each of 4 candidates, totalling 12 fits


  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (
  from pandas.core import (


Best Parameters for Random Forest: {'max_depth': 20, 'n_estimators': 200}


In [20]:
from sklearn.ensemble import GradientBoostingClassifier

# Hyperparameter tuning for Gradient Boosting
gb_params = {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1]}
gb_grid = GridSearchCV(GradientBoostingClassifier(), gb_params, cv=3, n_jobs=-1, verbose=1)
gb_grid.fit(X_train, y_train)

print(f'Best Parameters for Gradient Boosting: {gb_grid.best_params_}')


Fitting 3 folds for each of 4 candidates, totalling 12 fits
Best Parameters for Gradient Boosting: {'learning_rate': 0.1, 'n_estimators': 200}


In [21]:
from sklearn.metrics import accuracy_score, classification_report

# Evaluate Random Forest
rf_best = rf_grid.best_estimator_
rf_pred = rf_best.predict(X_test)
print(f'Random Forest Accuracy: {accuracy_score(y_test, rf_pred)}')
print("Random Forest Classification Report")
print(classification_report(y_test, rf_pred))

# Evaluate Gradient Boosting
gb_best = gb_grid.best_estimator_
gb_pred = gb_best.predict(X_test)
print(f'Gradient Boosting Accuracy: {accuracy_score(y_test, gb_pred)}')
print("Gradient Boosting Classification Report")
print(classification_report(y_test, gb_pred))


Random Forest Accuracy: 0.8438
Random Forest Classification Report
              precision    recall  f1-score   support

           0       0.88      0.80      0.84      2527
           1       0.82      0.88      0.85      2473

    accuracy                           0.84      5000
   macro avg       0.85      0.84      0.84      5000
weighted avg       0.85      0.84      0.84      5000

Gradient Boosting Accuracy: 0.8336
Gradient Boosting Classification Report
              precision    recall  f1-score   support

           0       0.87      0.79      0.83      2527
           1       0.80      0.88      0.84      2473

    accuracy                           0.83      5000
   macro avg       0.84      0.83      0.83      5000
weighted avg       0.84      0.83      0.83      5000

