# **Task 2 - Sentiment Classification**

For the task 2 of this exercise, we will be performing the *sentiment classification* task using a traditional ML model, and LSTM. We will then compare their performance.

In this notebook, we will have the following objectives:
- Clean and preprocess the dataset
    - Normalize the text
    - Stopword removal
    - Tokenization
- Apply TF-IDF for Contextual Importance
- Train MNB, SVM, Logistic Regression, and LSTM
- Evaluate the models and plot their results
    - Accuracy
    - F1-Score
    - Recall
    - Precision

## Objective 1: Clean and Preprocess the Dataset

In [1]:
import pandas as pd
df = pd.read_csv('./data/task2_dataset/IMDB-Dataset.csv')
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


As we can see, some of the inputs have tags such as `<br>`, which we can deal with later.

In [3]:
# Check if there's any missing vals
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [None]:
# Text normalization
# Define a method for reusability

import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure required NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text: str) -> str:
    """Performs full text preprocessing: HTML removal, lowercasing, 
       punctuation and escape character removal, stopword removal, and tokenization.
    """
    # Remove HTML tags like <br>, <p>, etc.
    no_tags = re.sub(r'<[^>]+>', ' ', text)
    
    # Lowercase all characters
    lowercase = no_tags.lower()
    
    # Remove escape characters such as \n, \t
    no_escapes = re.sub(r'[\n\t\r]', ' ', lowercase)
    
    # Remove punctuation (retains words and digits only)
    no_punc = re.sub(r'[^\w\s]', '', no_escapes)
    
    # Normalize whitespace
    clean_text = re.sub(r'\s+', ' ', no_punc).strip()
    
    # Tokenize text
    tokens = word_tokenize(clean_text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # Optionally, return the cleaned string or the token list
    return ' '.join(filtered_tokens)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


heres sample text html tags punctuation new lines


In [5]:
df['preprocessed'] = df['review'].apply(preprocess_text)

In [6]:
df.head(10)

Unnamed: 0,review,sentiment,preprocessed
0,One of the other reviewers has mentioned that ...,positive,one reviewers mentioned watching 1 oz episode ...
1,A wonderful little production. <br /><br />The...,positive,wonderful little production filming technique ...
2,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...
3,Basically there's a family where a little boy ...,negative,basically theres family little boy jake thinks...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter matteis love time money visually stunni...
5,"Probably my all-time favorite movie, a story o...",positive,probably alltime favorite movie story selfless...
6,I sure would like to see a resurrection of a u...,positive,sure would like see resurrection dated seahunt...
7,"This show was an amazing, fresh & innovative i...",negative,show amazing fresh innovative idea 70s first a...
8,Encouraged by the positive comments about this...,negative,encouraged positive comments film looking forw...
9,If you like original gut wrenching laughter yo...,positive,like original gut wrenching laughter like movi...


In [11]:
from sklearn.model_selection import train_test_split

# Define X, y
X = df['preprocessed'].values
y = df['sentiment'].values

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [13]:
len(X_train), len(X_test), len(y_train), len(y_test)

(40000, 10000, 40000, 10000)

In [15]:
# apply tfidf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
trans_Xtrain = vec.fit_transform(X_train)
trans_Xtest = vec.transform(X_train)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# -------------------------------
# 1. Multinomial Naive Bayes
# -------------------------------
nb_model = MultinomialNB(alpha=1.0)  # alpha: smoothing parameter
nb_model.fit(trans_Xtrain, y_train)

# -------------------------------
# 2. Logistic Regression
# -------------------------------
lr_model = LogisticRegression(
    C=1.0,                # inverse of regularization strength
    max_iter=1000
)
lr_model.fit(trans_Xtrain, y_train)

svm_model = SVC(
    kernel='linear',  # linear kernel is best for text data
    C=1.0,            # regularization parameter
    probability=True  # enable probability estimates if needed later
)
svm_model.fit(trans_Xtrain, y_train)

ValueError: Found input variables with inconsistent numbers of samples: [10000, 40000]