## Step 1: Import Required Libraries

In [2]:
import re
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

## Step 2: Loading the Dataset

This code loads the YouTube comments dataset from the CSV file. The dataset contains three columns: video_id, comment and label. For classification we only use comment and label column. The dataset is then split into training and testing set using an 80/20 ratio. The split preserves the sentiment label distribution using `stratify=y` and ensures reproducibility with a fixed `random_state`,

In [None]:
CSV_PATH = r"youtube_comments_cleaned.csv"

df = pd.read_csv(CSV_PATH)

X = df["comment"].astype(str).values
y = df["label"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Step 3: Text Preprocessing Function

This code defines a custom tokenizer that perfoms essential text cleaning steps: converting text to lowercase, extracting alphabetic tokens, removing stop-words and applying lemmatisation. These steps help reduce noise, standardise the text and ensure that different grammatical forms of a word are mapped to a common base form. By this we maintain full control over the preprocessing pipeline and ensure consistent text cleaning across all models.

In [5]:
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
token_pattern = re.compile(r"[A-Za-z]+")

def custom_tokenizer(text):
    text = text.lower()
    tokens = token_pattern.findall(text)
    tokens = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop_words]
    return tokens

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


## Step 4: Model Evaluation Function

This code function trains a given model, generates predictions on the test set and calculates accuracy, precision,recall and F1 score. `Macro averaging` is used to treat all classes equally which is important when the dataset is imbalanced. The function prints the results and returns them for comparison.

In [6]:
def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_test, y_pred, average='macro', zero_division=0
    )

    print(f"{name}:")
    print(f"Accuracy       : {acc:.4f}")
    print(f"Macro Precision: {precision:.4f}")
    print(f"Macro Recall   : {recall:.4f}")
    print(f"Macro F1       : {f1:.4f}")
    print()

    return {
        "model": name,
        "accuracy": acc,
        "precision_macro": precision,
        "recall_macro": recall,
        "f1_macro": f1
    }

## Step 5: Machine Learning Pipeline

This code defines four different ML pipeplines. The first two pipelines uses **CountVectorizer** which represents text as raw word counts, paired with **Multinomial Naive Bayes**. Next two pipeline uses **TF-IDF** which down weights common words, paired with **Logistic Regression**. Both unigram and bigram features are tested on each pipeline to evalute the imapct of n gram features.

In [7]:
#NB + Count (unigram)
nb_count_uni = Pipeline([
    ("vect", CountVectorizer(
        tokenizer=custom_tokenizer,
        ngram_range=(1, 1),
        min_df=2
    )),
    ("clf", MultinomialNB())
])

#NB + Count (bigram)
nb_count_bi = Pipeline([
    ("vect", CountVectorizer(
        tokenizer=custom_tokenizer,
        ngram_range=(1, 2),
        min_df=2
    )),
    ("clf", MultinomialNB())
])

#LR + TF-IDF (unigram)
lr_tfidf_uni = Pipeline([
    ("vect", TfidfVectorizer(
        tokenizer=custom_tokenizer,
        ngram_range=(1, 1),
        min_df=2
    )),
    ("clf", LogisticRegression(max_iter=300, n_jobs=-1))
])

#LR + TF-IDF (bigram)
lr_tfidf_bi = Pipeline([
    ("vect", TfidfVectorizer(
        tokenizer=custom_tokenizer,
        ngram_range=(1, 2),
        min_df=2
    )),
    ("clf", LogisticRegression(max_iter=300, n_jobs=-1))
])

## Step 6: Training and Evaluating Models

This code executes each of four pipelines and evaluates their performance using the defined evaluation function. The results are stored for later comparison.

In [12]:
results = []

results.append(
    evaluate_model("NB + Count (unigram)", nb_count_uni, X_train, y_train, X_test, y_test)
)
results.append(
    evaluate_model("NB + Count (bigram)", nb_count_bi, X_train, y_train, X_test, y_test)
)
results.append(
    evaluate_model("LR + TF-IDF (unigram)", lr_tfidf_uni, X_train, y_train, X_test, y_test)
)
results.append(
    evaluate_model("LR + TF-IDF (bigram)", lr_tfidf_bi, X_train, y_train, X_test, y_test)
)



NB + Count (unigram):
Accuracy       : 0.6476
Macro Precision: 0.6376
Macro Recall   : 0.4930
Macro F1       : 0.4830





NB + Count (bigram):
Accuracy       : 0.6453
Macro Precision: 0.6283
Macro Recall   : 0.4857
Macro F1       : 0.4699





LR + TF-IDF (unigram):
Accuracy       : 0.6768
Macro Precision: 0.6598
Macro Recall   : 0.5256
Macro F1       : 0.5283





LR + TF-IDF (bigram):
Accuracy       : 0.6667
Macro Precision: 0.6681
Macro Recall   : 0.5059
Macro F1       : 0.5000



## Step 7: Model Comparison

This code sorts all model results by macro F1 score and prints concise comparison table. This makes it easy to identify the strongest model and understand different feature extraction methods.

In [13]:
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(by="f1_macro", ascending=False)

print("\nModel Comparison Table:\n")
print(results_df)


Model Comparison Table:

                   model  accuracy  precision_macro  recall_macro  f1_macro
2  LR + TF-IDF (unigram)  0.676768         0.659753      0.525595  0.528258
3   LR + TF-IDF (bigram)  0.666667         0.668133      0.505935  0.500010
0   NB + Count (unigram)  0.647587         0.637637      0.493007  0.482965
1    NB + Count (bigram)  0.645342         0.628333      0.485695  0.469923
