# Hyperparameter Tuning of ML Models

In this notebook, we perform hyperparameter tuning for several machine learning models (Logistic Regression, Naive Bayes, Random Forest Classifier, LinearSVC, and SGD SVM) to optimize their performance. Given the large number of hyperparameters available, we focused on those most influential on model performance.

We begin by using RandomisedSearchCV, which efficiently samples a fixed number of hyperparameter combinations from predefined distributions, providing a broad exploration of the parameter space with reduced computational cost.

Based on these preliminary results, we then employ GridSearchCV to exhaustively evaluate all combinations within a narrowed hyperparameter grid, fine-tuning the models for optimal accuracy.

Due to computational constraints, hyperparameter tuning was not performed for SVM with a linear kernel.

After tuning all models, our results indicated that the TF-IDF combined with Logistic Regression delivered the best performance with the highest accuracy.

# Imports

In [None]:
import pandas as pd
from collections import defaultdict

from google.colab import drive

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
    GridSearchCV,
    RandomizedSearchCV
)

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report
)

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, LinearSVC

from sklearn.pipeline import make_pipeline

from scipy.stats import uniform, randint

In [None]:
import torch_xla
import torch_xla.core.xla_model as xm
device = xm.xla_device()
print(f"Using device: {device}")

Using device: xla:0


# Loading dataset

In [None]:
drive.mount('/content/drive')

neg_df = pd.read_csv('/content/drive/MyDrive/it1244/cleaned_neg_reviews_regular_stopwords.csv')
neg_df['label'] = 0

pos_df = pd.read_csv('/content/drive/MyDrive/it1244/cleaned_pos_reviews_regular_stopwords.csv')
pos_df['label'] = 1

df = pd.concat([neg_df, pos_df], ignore_index=True)

print(df.head())
print("Total reviews:", len(df))

Mounted at /content/drive
    FileName                                       Cleaned_Text  label
0  23129.txt  not even goebbels could pulled propaganda stun...      0
1  22912.txt  plot fizzled reeked irreconcilable difference ...      0
2  23622.txt  first look cover picture look like good rock n...      0
3  23637.txt  drama core anna display genuine truth actor ag...      0
4  23109.txt  magic lassie opened radio city music hall wa f...      0
Total reviews: 50000


# Splitting dataset

In [None]:
X = df['Cleaned_Text']
y = df['label']

# Split the data into training and testing sets.
# The test set size is 20% of the total data, 'stratify=y' ensures that both training and test sets maintain the same class distribution as the original dataset.
# 'random_state=42' is set for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Training set size:", len(X_train))
print("Test set size:", len(X_test))

Training set size: 40000
Test set size: 10000


# Vectorization

BoW Vectorizer

In [None]:
# This code performs text vectorization on the X train and test sets.
# The resulting feature shapes are printed to compare the effect of stopword removal on vocabulary size and representation.

bow = CountVectorizer(
    max_features=50000,
    min_df=2, # ignore terms that appear in fewer than 2 documents
    lowercase=False, # assumes text is already lowercased
    tokenizer=lambda x: x.split(), # split the text on whitespace (custom tokenizer)
    preprocessor=None, # disable default preprocessing pattern
    token_pattern=None # disable default token pattern
)
X_train_bow = bow.fit_transform(X_train)
X_test_bow = bow.transform(X_test)
print("Regular Stopwords - Trained BoW shape:", X_train_bow.shape)
print("Regular Stopwords - Test BoW shape:", X_test_bow.shape)

Regular Stopwords - Trained BoW shape: (40000, 50000)
Regular Stopwords - Test BoW shape: (10000, 50000)


TF-IDF Vectorizer

In [None]:
# the TF-IDF vectoriser has the same parameters as stated in BoW

tfidf = TfidfVectorizer(
    max_features=50000,
    min_df=2,
    lowercase=False,
    tokenizer=lambda x: x.split(),
    preprocessor=None,
    token_pattern=None
)


X_train_tfidf = tfidf.fit_transform(X_train)

X_test_tfidf = tfidf.transform(X_test)

print("Regular Stopwords - Trained TF-IDF shape:", X_train_tfidf.shape)
print("Regular Stopwords - Test TF-IDF shape:", X_test_tfidf.shape)

Regular Stopwords - Trained TF-IDF shape: (40000, 50000)
Regular Stopwords - Test TF-IDF shape: (10000, 50000)


In [None]:
# Here, we define a function to evaluate the performance of a trained model on test data.
def evaluate_model(model, X_test, y_test, name="Model"):
    print(f"\n Evaluating: {name}")

    # Predict the labels for the test data using the provided model.
    y_pred = model.predict(X_test)

    # Calculate evaluation metrics: accuracy, F1 score, precision, and recall.
    test_accuracy = accuracy_score(y_test, y_pred)
    test_f1 = f1_score(y_test, y_pred)
    test_precision = precision_score(y_test, y_pred)
    test_recall = recall_score(y_test, y_pred)

    print(f"Test Accuracy:  {round(test_accuracy, 4)}")
    print(f"Test F1 Score:  {round(test_f1, 4)}")
    print(f"Test Precision:{round(test_precision, 4)}")
    print(f"Test Recall:   {round(test_recall, 4)}")

    print("\n Classification Report:")
    print(classification_report(y_test, y_pred))


# Randomised Search

We first perform Randomised Search CV to narrow down the ranges for our Grid Search CV Later on

## Helper Functions

To extract the best hyperparameter combinations from the RandomisedSearchCV

In [None]:
def get_top_params(random_search, top_n=3):
    results_df = pd.DataFrame(random_search.cv_results_)
    top_params = results_df.sort_values(by='mean_test_score', ascending=False).head(top_n)
    return top_params['params'].tolist()

Function to run RandomisedSearchCV

In [None]:
# This function sets up RandomizedSearchCV with the provided model and hyperparameter distribution.
# It will perform n_iter iterations using cross-validation (cv), optimizing for accuracy.

def run_randomsearch_and_evaluate(model, param_dist, X_train, X_test, y_train, y_test, name="Model", n_iter=20, cv=5):
    random_search = RandomizedSearchCV(
        estimator=model,
        param_distributions=param_dist,
        n_iter=n_iter,
        cv=cv,
        scoring='accuracy',
        random_state=42,
        n_jobs=-1,
        verbose=1
    )

    print(f"\nRunning RandomizedSearchCV for {name} ...")
    random_search.fit(X_train, y_train)

    # Extract the top 3 hyperparameter settings based on the mean test score.
    top_params = get_top_params(random_search, top_n=3)
    print(f"\nTop Parameters for {name}:", top_params)

    # Retrieve the best estimator from the search.
    best_model = random_search.best_estimator_

    # Evaluate the best model on the test data and print the evaluation metrics.
    evaluate_model(best_model, X_test, y_test, name=name)

    return top_params, best_model


## Multinomial Naive Bayes

In [None]:
nb_param_dist = {'alpha': uniform(0.01, 1.0)}

With BoW

In [None]:
nb_top_params_bow, nb_best_bow = run_randomsearch_and_evaluate(
    model=MultinomialNB(),
    param_dist=nb_param_dist,
    X_train=X_train_bow,
    X_test=X_test_bow,
    y_train=y_train,
    y_test=y_test,
    name="Naive Bayes (BoW)"
)


Running RandomizedSearchCV for Naive Bayes (BoW) ...
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Top Parameters for Naive Bayes (BoW): [{'alpha': np.float64(0.7419939418114051)}, {'alpha': np.float64(0.8761761457749352)}, {'alpha': np.float64(0.9799098521619943)}]

 Evaluating: Naive Bayes (BoW)
Test Accuracy:  0.8533
Test F1 Score:  0.85
Test Precision:0.8696
Test Recall:   0.8312

 Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.88      0.86      5000
           1       0.87      0.83      0.85      5000

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



With TF-IDF

In [None]:
nb_top_params_tfidf, nb_best_tfidf = run_randomsearch_and_evaluate(
    model=MultinomialNB(),
    param_dist=nb_param_dist,
    X_train=X_train_tfidf,
    X_test=X_test_tfidf,
    y_train=y_train,
    y_test=y_test,
    name="Naive Bayes (TF-IDF)"
)


Running RandomizedSearchCV for Naive Bayes (TF-IDF) ...
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Top Parameters for Naive Bayes (TF-IDF): [{'alpha': np.float64(0.6086584841970366)}, {'alpha': np.float64(0.6111150117432088)}, {'alpha': np.float64(0.9799098521619943)}]

 Evaluating: Naive Bayes (TF-IDF)
Test Accuracy:  0.8636
Test F1 Score:  0.8625
Test Precision:0.8695
Test Recall:   0.8556

 Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.87      0.86      5000
           1       0.87      0.86      0.86      5000

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



## Logistic Regression

In [None]:
logreg_param_dist = {
    'C': uniform(0.01, 10),
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear'],
    'max_iter': [400]
}

With BoW

In [None]:
logreg_top_params_bow, logreg_best_bow = run_randomsearch_and_evaluate(
    model=LogisticRegression(random_state=42),
    param_dist=logreg_param_dist,
    X_train=X_train_bow,
    X_test=X_test_bow,
    y_train=y_train,
    y_test=y_test,
    name="Logistic Regression (BoW)"
)


Running RandomizedSearchCV for Logistic Regression (BoW) ...
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Top Parameters for Logistic Regression (BoW): [{'C': np.float64(0.017787658410143285), 'max_iter': 400, 'penalty': 'l2', 'solver': 'liblinear'}, {'C': np.float64(0.5741157902710026), 'max_iter': 400, 'penalty': 'l2', 'solver': 'liblinear'}, {'C': np.float64(0.5908361216819946), 'max_iter': 400, 'penalty': 'l2', 'solver': 'liblinear'}]

 Evaluating: Logistic Regression (BoW)
Test Accuracy:  0.8882
Test F1 Score:  0.8901
Test Precision:0.875
Test Recall:   0.9058

 Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.87      0.89      5000
           1       0.88      0.91      0.89      5000

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



With TF-IDF

In [None]:
logreg_top_params_tfidf, logreg_best_tfidf = run_randomsearch_and_evaluate(
    model=LogisticRegression(random_state=42),
    param_dist=logreg_param_dist,
    X_train=X_train_tfidf,
    X_test=X_test_tfidf,
    y_train=y_train,
    y_test=y_test,
    name="Logistic Regression (TF-IDF)"
)


Running RandomizedSearchCV for Logistic Regression (TF-IDF) ...
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Top Parameters for Logistic Regression (TF-IDF): [{'C': np.float64(3.347086111390218), 'max_iter': 400, 'penalty': 'l2', 'solver': 'liblinear'}, {'C': np.float64(5.152344384136116), 'max_iter': 400, 'penalty': 'l2', 'solver': 'liblinear'}, {'C': np.float64(6.193860093330873), 'max_iter': 400, 'penalty': 'l2', 'solver': 'liblinear'}]

 Evaluating: Logistic Regression (TF-IDF)
Test Accuracy:  0.8972
Test F1 Score:  0.8986
Test Precision:0.8867
Test Recall:   0.9108

 Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.88      0.90      5000
           1       0.89      0.91      0.90      5000

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



## Random Forest Classifier

In [None]:
rf_param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': [None] + list(range(5, 20)),
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2']
}

With BoW

In [None]:
rf_top_params_bow, rf_best_bow = run_randomsearch_and_evaluate(
    model=RandomForestClassifier(random_state=42),
    param_dist=rf_param_dist,
    X_train=X_train_bow,
    X_test=X_test_bow,
    y_train=y_train,
    y_test=y_test,
    name="Random Forest (BoW)"
)


Running RandomizedSearchCV for Random Forest (BoW) ...
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Top Parameters for Random Forest (BoW): [{'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 104}, {'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 6, 'min_samples_split': 6, 'n_estimators': 138}, {'max_depth': 16, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 120}]

 Evaluating: Random Forest (BoW)
Test Accuracy:  0.855
Test F1 Score:  0.8571
Test Precision:0.8451
Test Recall:   0.8694

 Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.84      0.85      5000
           1       0.85      0.87      0.86      5000

    accuracy                           0.85     10000
   macro avg       0.86      0.85      0.85     10000
weighted avg       0.86      0.85      0.85     10000



With TF-IDF

In [None]:
rf_top_params_tfidf, rf_best_tfidf = run_randomsearch_and_evaluate(
    model=RandomForestClassifier(random_state=42),
    param_dist=rf_param_dist,
    X_train=X_train_tfidf,
    X_test=X_test_tfidf,
    y_train=y_train,
    y_test=y_test,
    name="Random Forest (TF-IDF)"
)


Running RandomizedSearchCV for Random Forest (TF-IDF) ...
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Top Parameters for Random Forest (TF-IDF): [{'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 104}, {'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 6, 'min_samples_split': 6, 'n_estimators': 138}, {'max_depth': 16, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 120}]

 Evaluating: Random Forest (TF-IDF)
Test Accuracy:  0.851
Test F1 Score:  0.8516
Test Precision:0.8484
Test Recall:   0.8548

 Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.85      0.85      5000
           1       0.85      0.85      0.85      5000

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



## SVM

We will only conduct CV on LinearSVC and SGD Linear SVM because SVM(kernel = 'linear') is only suitable for smaller datasets and will take very long to run

### LinearSVC

In [None]:
svc_pipeline = make_pipeline(
    StandardScaler(with_mean=False),
    LinearSVC(max_iter=1_000_000, random_state=42)
)

svc_param_dist = {
    'linearsvc__C': uniform(0.01, 10)
}

With BoW

In [None]:
svc_top_params_bow, svc_best_bow = run_randomsearch_and_evaluate(
    model=svc_pipeline,
    param_dist=svc_param_dist,
    X_train=X_train_bow,
    X_test=X_test_bow,
    y_train=y_train,
    y_test=y_test,
    name="LinearSVC (BoW)"
)


Running RandomizedSearchCV for LinearSVC (BoW) ...
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Top Parameters for LinearSVC (BoW): [{'linearsvc__C': np.float64(0.21584494295802448)}, {'linearsvc__C': np.float64(0.5908361216819946)}, {'linearsvc__C': np.float64(1.5701864044243652)}]

 Evaluating: LinearSVC (BoW)
Test Accuracy:  0.8393
Test F1 Score:  0.8394
Test Precision:0.839
Test Recall:   0.8398

 Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.84      0.84      5000
           1       0.84      0.84      0.84      5000

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



With TF-IDF

In [None]:
svc_top_params_tfidf, svc_best_tfidf = run_randomsearch_and_evaluate(
    model=svc_pipeline,
    param_dist=svc_param_dist,
    X_train=X_train_tfidf,
    X_test=X_test_tfidf,
    y_train=y_train,
    y_test=y_test,
    name="LinearSVC (TF-IDF)"
)


Running RandomizedSearchCV for LinearSVC (TF-IDF) ...
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Top Parameters for LinearSVC (TF-IDF): [{'linearsvc__C': np.float64(0.21584494295802448)}, {'linearsvc__C': np.float64(0.5908361216819946)}, {'linearsvc__C': np.float64(3.7554011884736247)}]

 Evaluating: LinearSVC (TF-IDF)
Test Accuracy:  0.8332
Test F1 Score:  0.834
Test Precision:0.8302
Test Recall:   0.8378

 Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.83      0.83      5000
           1       0.83      0.84      0.83      5000

    accuracy                           0.83     10000
   macro avg       0.83      0.83      0.83     10000
weighted avg       0.83      0.83      0.83     10000



### SGDClassifier

In [None]:
sgd_pipeline = make_pipeline(
    StandardScaler(with_mean=False),
    SGDClassifier(loss='hinge', max_iter=1000, tol=1e-3, random_state=42)
)

sgd_param_dist = {
    'sgdclassifier__alpha': uniform(1e-5, 1e-2),
    'sgdclassifier__penalty': ['l2', 'l1', 'elasticnet']
}

With BoW

In [None]:
sgd_top_params_bow, sgd_best_bow = run_randomsearch_and_evaluate(
    model=sgd_pipeline,
    param_dist=sgd_param_dist,
    X_train=X_train_bow,
    X_test=X_test_bow,
    y_train=y_train,
    y_test=y_test,
    name="SGD SVM (BoW)"
)


Running RandomizedSearchCV for SGD SVM (BoW) ...
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Top Parameters for SGD SVM (BoW): [{'sgdclassifier__alpha': np.float64(0.006193860093330872), 'sgdclassifier__penalty': 'elasticnet'}, {'sgdclassifier__alpha': np.float64(0.006021150117432088), 'sgdclassifier__penalty': 'elasticnet'}, {'sgdclassifier__alpha': np.float64(0.004570699842170359), 'sgdclassifier__penalty': 'elasticnet'}]

 Evaluating: SGD SVM (BoW)
Test Accuracy:  0.8703
Test F1 Score:  0.872
Test Precision:0.8608
Test Recall:   0.8834

 Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.86      0.87      5000
           1       0.86      0.88      0.87      5000

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000



With TF-IDF

In [None]:
sgd_top_params_tfidf, sgd_best_tfidf = run_randomsearch_and_evaluate(
    model=sgd_pipeline,
    param_dist=sgd_param_dist,
    X_train=X_train_tfidf,
    X_test=X_test_tfidf,
    y_train=y_train,
    y_test=y_test,
    name="SGD SVM (TF-IDF)"
)


Running RandomizedSearchCV for SGD SVM (TF-IDF) ...
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Top Parameters for SGD SVM (TF-IDF): [{'sgdclassifier__alpha': np.float64(0.006193860093330872), 'sgdclassifier__penalty': 'elasticnet'}, {'sgdclassifier__alpha': np.float64(0.006021150117432088), 'sgdclassifier__penalty': 'elasticnet'}, {'sgdclassifier__alpha': np.float64(0.004570699842170359), 'sgdclassifier__penalty': 'elasticnet'}]

 Evaluating: SGD SVM (TF-IDF)
Test Accuracy:  0.8638
Test F1 Score:  0.8661
Test Precision:0.8518
Test Recall:   0.8808

 Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.85      0.86      5000
           1       0.85      0.88      0.87      5000

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



# Grid Search

## Helper Functions

Function to take the top hyperparameters and creates variations to explore a narrower region of the hyperparameter space for fine tuning.

In [None]:
# This function creates a refined hyperparameter grid based on the top parameters from conducting Randomised Search.
# It adjusts numeric parameter values by applying a set of multipliers and incorporates any fixed parameters provided.

def create_refined_param_grid(top_params,
                              base_multipliers=[0.5, 1.0, 2.0],
                              fixed_params=None):

    param_grid = defaultdict(set)

    for params in top_params:
        for key, value in params.items():
            if isinstance(value, (int, float)): # If the parameter value is numeric (integer or float), adjust it using the base multipliers.
                for factor in base_multipliers:
                    if key == 'n_estimators': # For 'n_estimators', multiply and convert to integer.
                        param_grid[key].add(int(value * factor))
                    elif key in ('min_samples_split', 'min_samples_leaf', 'max_depth'): # For specific tree-based parameters, adjust and ensure they remain within acceptable ranges.
                        new_value = round(value * factor)
                        if key == 'min_samples_split' and new_value >= 2:
                            param_grid[key].add(new_value)
                        elif key == 'min_samples_leaf' and new_value >= 1:
                            param_grid[key].add(new_value)
                        elif key == 'max_depth' and (new_value >= 1 or new_value is None): # For max_depth, new_value must be at least 1 or None (if originally None).
                            param_grid[key].add(new_value)
                    else:
                        param_grid[key].add(round(value * factor, 5)) # For any other numeric parameter, multiply and round the result.
            else:
                param_grid[key].add(value) # If the parameter value is not numeric, simply add the original value.

    # Convert the sets into sorted lists to create the refined grid.
    refined_grid = {}
    for key, values in param_grid.items():
        try:
            refined_grid[key] = sorted(values)
        except TypeError:
            refined_grid[key] = list(values)

    if fixed_params:
        for key, value in fixed_params.items():
            refined_grid[key] = value  # If there are any fixed parameters provided, override the refined grid with these fixed values.

    return refined_grid

Function to run GridSearchCV

In [None]:
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # Define a cross-validation strategy using StratifiedKFold to maintain the class distribution.

# This function runs GridSearchCV using a refined parameter grid, evaluates the best model, and prints results.
def run_gridsearch_and_evaluate(X_train, X_test, y_train, y_test, model, top_params, name, fixed_params=None):
    param_grid = create_refined_param_grid(top_params, fixed_params=fixed_params)

    grid_clf = GridSearchCV(
        estimator=model,
        param_grid=param_grid,
        scoring={
            'accuracy': 'accuracy',
            'precision': 'precision',
            'recall': 'recall',
            'f1': 'f1'
        },
        cv=cv_strategy,
        refit='accuracy',
        verbose=5,
        n_jobs=-1
    )

    print(f"\nRunning GridSearchCV for {name} ...")
    grid_clf.fit(X_train, y_train)

    print("\nBest Parameters:", grid_clf.best_params_)
    print("Best CV Score:", grid_clf.best_score_)

    best_model = grid_clf.best_estimator_
    evaluate_model(best_model, X_test, y_test, name=name)

## Multinomial Naive Bayes

In [None]:
nb_model = MultinomialNB()
fixed_nb = {}

With BoW

In [None]:
run_gridsearch_and_evaluate(
    X_train_tfidf, X_test_bow, y_train, y_test,
    model=nb_model,
    top_params=nb_top_params_bow,
    name="Naive Bayes + BoW",
    fixed_params=fixed_nb
)


Running GridSearchCV for Naive Bayes + BoW ...
Fitting 5 folds for each of 9 candidates, totalling 45 fits

Best Parameters: {'alpha': np.float64(0.74199)}
Best CV Score: 0.861675

 Evaluating: Naive Bayes + BoW
Test Accuracy:  0.8562
Test F1 Score:  0.8481
Test Precision:0.8988
Test Recall:   0.8028

 Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.91      0.86      5000
           1       0.90      0.80      0.85      5000

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



With TF-IDF

In [None]:
run_gridsearch_and_evaluate(
    X_train_tfidf, X_test_tfidf, y_train, y_test,
    model=nb_model,
    top_params=nb_top_params_tfidf,
    name="Naive Bayes + TF-IDF",
    fixed_params=fixed_nb
)


Running GridSearchCV for Naive Bayes + TF-IDF ...
Fitting 5 folds for each of 9 candidates, totalling 45 fits

Best Parameters: {'alpha': np.float64(0.60866)}
Best CV Score: 0.86165

 Evaluating: Naive Bayes + TF-IDF
Test Accuracy:  0.8636
Test F1 Score:  0.8625
Test Precision:0.8695
Test Recall:   0.8556

 Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.87      0.86      5000
           1       0.87      0.86      0.86      5000

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



## Logistic Regression

In [None]:
log_reg_model = LogisticRegression(random_state=42)


fixed_logreg = {
    'solver': ['liblinear'],
    'max_iter': [400]
}

With BoW

In [None]:
run_gridsearch_and_evaluate(
    X_train_bow, X_test_bow, y_train, y_test,
    model=log_reg_model,
    top_params=logreg_top_params_bow,
    name="Logistic Regression + BoW",
    fixed_params=fixed_logreg
)


Running GridSearchCV for Logistic Regression + BoW ...
Fitting 5 folds for each of 9 candidates, totalling 45 fits

Best Parameters: {'C': np.float64(0.03558), 'max_iter': 400, 'penalty': 'l2', 'solver': 'liblinear'}
Best CV Score: 0.88555

 Evaluating: Logistic Regression + BoW
Test Accuracy:  0.8874
Test F1 Score:  0.889
Test Precision:0.8767
Test Recall:   0.9016

 Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.87      0.89      5000
           1       0.88      0.90      0.89      5000

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



With TF-IDF

In [None]:
run_gridsearch_and_evaluate(
    X_train_tfidf, X_test_tfidf, y_train, y_test,
    model=log_reg_model,
    top_params=logreg_top_params_tfidf,
    name="Logistic Regression + TF-IDF",
    fixed_params=fixed_logreg
)


Running GridSearchCV for Logistic Regression + TF-IDF ...
Fitting 5 folds for each of 9 candidates, totalling 45 fits

Best Parameters: {'C': np.float64(3.09693), 'max_iter': 400, 'penalty': 'l2', 'solver': 'liblinear'}
Best CV Score: 0.893

 Evaluating: Logistic Regression + TF-IDF
Test Accuracy:  0.8976
Test F1 Score:  0.899
Test Precision:0.8871
Test Recall:   0.9112

 Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.88      0.90      5000
           1       0.89      0.91      0.90      5000

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



## Random Forest Classifier

In [None]:
rf_model = RandomForestClassifier(random_state=42)
fixed_rf = {'random_state': [42]}

With BoW

In [None]:
run_gridsearch_and_evaluate(
    X_train_bow, X_test_bow, y_train, y_test,
    model=rf_model,
    top_params=rf_top_params_bow,
    name="Random Forest + BoW",
    fixed_params=fixed_rf
)


Running GridSearchCV for Random Forest + BoW ...
Fitting 5 folds for each of 2520 candidates, totalling 12600 fits





Best Parameters: {'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 276, 'random_state': 42}
Best CV Score: 0.865475

 Evaluating: Random Forest + BoW
Test Accuracy:  0.8666
Test F1 Score:  0.8681
Test Precision:0.8586
Test Recall:   0.8778

Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.86      0.87      5000
           1       0.86      0.88      0.87      5000

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000



With TF-IDF

In [None]:
run_gridsearch_and_evaluate(
    X_train_tfidf, X_test_tfidf, y_train, y_test,
    model=rf_model,
    top_params=rf_top_params_tfidf,
    name="Random Forest + TF-IDF",
    fixed_params=fixed_rf
)


Running GridSearchCV for Random Forest + TF-IDF ...
Fitting 5 folds for each of 2520 candidates, totalling 12600 fits





Best Parameters: {'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 6, 'n_estimators': 276, 'random_state': 42}
Best CV Score: 0.860225

 Evaluating: Random Forest + TF-IDF
Test Accuracy:  0.8605
Test F1 Score:  0.8624
Test Precision:0.851
Test Recall:   0.874

Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.85      0.86      5000
           1       0.85      0.87      0.86      5000

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



## Linear SVC

In [None]:
svc_pipeline = make_pipeline(
    StandardScaler(with_mean=False),
    LinearSVC(max_iter=1_000_000, random_state=42)
)

top_params_svc = [
    {'linearsvc__C': 1.0}, {'linearsvc__C': 0.5}, {'linearsvc__C': 2.0}
]

With BoW

In [None]:
run_gridsearch_and_evaluate(
    X_train_bow, X_test_bow, y_train, y_test,
    model=svc_pipeline,
    top_params=top_params_svc,
    name="LinearSVC + BoW"
)


Running GridSearchCV for LinearSVC + BoW ...
Fitting 5 folds for each of 5 candidates, totalling 25 fits

Best Parameters: {'linearsvc__C': 0.25}
Best CV Score: 0.829525

 Evaluating: LinearSVC + BoW
Test Accuracy:  0.8391
Test F1 Score:  0.8392
Test Precision:0.8386
Test Recall:   0.8398

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.84      0.84      5000
           1       0.84      0.84      0.84      5000

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



With TF-IDF

In [None]:
run_gridsearch_and_evaluate(
    X_train_tfidf, X_test_tfidf, y_train, y_test,
    model=svc_pipeline,
    top_params=top_params_svc,
    name="LinearSVC + TF-IDF"
)


Running GridSearchCV for LinearSVC + TF-IDF ...
Fitting 5 folds for each of 5 candidates, totalling 25 fits

Best Parameters: {'linearsvc__C': 0.25}
Best CV Score: 0.8297000000000001

 Evaluating: LinearSVC + TF-IDF
Test Accuracy:  0.8332
Test F1 Score:  0.834
Test Precision:0.8302
Test Recall:   0.8378

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.83      0.83      5000
           1       0.83      0.84      0.83      5000

    accuracy                           0.83     10000
   macro avg       0.83      0.83      0.83     10000
weighted avg       0.83      0.83      0.83     10000



## SGD Classifier

In [None]:
sgd_pipeline = make_pipeline(
    StandardScaler(with_mean=False),
    SGDClassifier(loss='hinge', max_iter=1000, tol=1e-3, random_state=42)
)

top_params_sgd = [
    {'sgdclassifier__alpha': 0.0001, 'sgdclassifier__penalty': 'l2'},
    {'sgdclassifier__alpha': 0.001, 'sgdclassifier__penalty': 'l1'}
]

With BoW

In [None]:
run_gridsearch_and_evaluate(
    X_train_bow, X_test_bow, y_train, y_test,
    model=sgd_pipeline,
    top_params=top_params_sgd,
    name="SGD SVM + BoW"
)


Running GridSearchCV for SGD SVM + BoW ...
Fitting 5 folds for each of 12 candidates, totalling 60 fits

Best Parameters: {'sgdclassifier__alpha': 0.002, 'sgdclassifier__penalty': 'l2'}
Best CV Score: 0.8134750000000001

 Evaluating: SGD SVM + BoW
Test Accuracy:  0.8384
Test F1 Score:  0.8402
Test Precision:0.831
Test Recall:   0.8496

Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.83      0.84      5000
           1       0.83      0.85      0.84      5000

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



With TF-IDF

In [None]:
run_gridsearch_and_evaluate(
    X_train_tfidf, X_test_tfidf, y_train, y_test,
    model=sgd_pipeline,
    top_params=top_params_sgd,
    name="SGD SVM + TF-IDF"
)


Running GridSearchCV for SGD SVM + TF-IDF ...
Fitting 5 folds for each of 12 candidates, totalling 60 fits

Best Parameters: {'sgdclassifier__alpha': 0.0001, 'sgdclassifier__penalty': 'l2'}
Best CV Score: 0.8006249999999999

 Evaluating: SGD SVM + TF-IDF
Test Accuracy:  0.8232
Test F1 Score:  0.8241
Test Precision:0.8197
Test Recall:   0.8286

Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.82      0.82      5000
           1       0.82      0.83      0.82      5000

    accuracy                           0.82     10000
   macro avg       0.82      0.82      0.82     10000
weighted avg       0.82      0.82      0.82     10000



# Conclusion

After conducting hyperparameter tuning, we trained each model with the best hyperparameters.

The saved models can be found here: https://drive.google.com/drive/folders/1zFIHeiYR8FIGrFkIerj3VOJS1VztmSuW?usp=sharing