# Hate Speech Classification Project Submission

**50.007 Machine Learning - Summer 2024**

**Group Name: Borky**

**Task 1:**
We implemented a plain logistic regression model and saved it as `LogRed_Prediction_Plain.csv`, which obtained a score of 0.39226. We also modified the implementation to better match and obtain similar performance to the scikit-learn logistic regression model by adding L2 regularization, convergence check, and increasing the number of epochs to allow the model to converge properly. This led to us obtaining a score of 0.62412 saved under `LogRed_Prediction`, comparable to the scikit-learn model which obtained 0.68446 under `SK_Learn_LogisticRegression_Predictions.csv`.

**Task 2:**
We experimented reducing dimensionality of the dataset using PCA and evaluating using KNN with the following number of principal components: 100, 500, 1000, and 2000. Our results showed that using 500 principal components yielded the highest F1 score of 0.57897, while the lowest performance was observed with 2000 components, scoring 0.50479. This indicates that a moderate reduction in dimensionality retains essential information, improving model accuracy.

While higher numbers of components capture more variance, they also introduce computational inefficiency and potential overfitting. Optimal performance at 500 components highlights the balance between dimensionality reduction and retaining meaningful data which results in the KNN model being able to generalizes better.

**Task 3:**
We explored various machine learning models for classification from scikit-learn and XGB. Our best score of 0.71870 was achieved using a Voting Classifier combining Random Forest, Extra Trees, and Logistic Regression.

**Results:**
- **First Attempt Score:** 0.69386 (PCA + Logistic Regression)
- **Random Forest Classifier Score:** 0.71254
- **Extra Trees Classifier Score:** 0.71424
- **Best Score (Voting Classifier):** 0.71870

**Classifiers Experimented With and Tuned:**
- Logistic Regression: Dimension reduction with PCA and TruncatedSVD
- SVM: Dimension reduction with PCA
- Decision Tree
- Random Forest: Dimension reduction with PCA
- AdaBoost
- Extra Trees
- GradientBoostingClassifier
- XGBClassifier
- Multinomial Naive Bayes
- Complement Naive Bayes
- Bernoulli Naive Bayes
- Voting Classifier of various other classifiers


## Task 1: Logistic Regression Implementation

### 1.1: Plain Logistic Regression. Score = 0.39226

In [1]:
import numpy as np
import pandas as pd

# Read csv files
train_df = pd.read_csv('./dataset/train_tfidf_features.csv')
test_df = pd.read_csv('./dataset/test_tfidf_features.csv')

X_train = train_df.drop(['id', 'label'], axis=1) # Features
y_train = train_df['label'] # Labels
X_test = test_df.drop(['id'], axis=1) # Test Features

print(X_train.head())
print(y_train.head())

     0    1    2    3    4    5    6    7    8    9  ...  4990  4991  4992  \
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   0.0   0.0   0.0   
1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   0.0   0.0   0.0   
2  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   0.0   0.0   0.0   
3  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   0.0   0.0   0.0   
4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...   0.0   0.0   0.0   

   4993  4994  4995  4996  4997  4998  4999  
0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
1   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
2   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
3   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
4   0.0   0.0   0.0   0.0   0.0   0.0   0.0  

[5 rows x 5000 columns]
0    1
1    0
2    1
3    0
4    1
Name: label, dtype: int64


In [3]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def loss(y, y_hat):
    return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

def gradients(X, y, y_hat):
    m = X.shape[0]
    dw = (1/m) * np.dot(X.T, (y_hat - y))
    db = (1/m) * np.sum(y_hat - y)
    return dw, db

def train(X, y, bs, epochs, lr):
    m, n = X.shape
    w = np.zeros((n, 1))
    b = 0
    y = y.values.reshape(m,1)
    losses = []
    
    for epoch in range(epochs):
        for i in range((m - 1) // bs + 1):
            start_i = i * bs
            end_i = start_i + bs
            xb = X[start_i:end_i]
            yb = y[start_i:end_i]
            
            # Calculate hypothesis
            y_hat = sigmoid(np.dot(xb, w) + b)
            
            # Getting gradients of loss
            dw, db = gradients(xb, yb, y_hat)
            
            # Update parameters
            w -= lr*dw
            b -= lr*db
            
        # Calculating loss and appending to list
        l = loss(y, sigmoid(np.dot(X, w) + b))
        losses.append(l)
        
    return w, b, losses

def predict(X, w, b):
    preds = sigmoid(np.dot(X, w) + b)
    pred_class = [1 if i > 0.5 else 0 for i in preds]
    return np.array(pred_class)

In [8]:
# Training
w, b, l = train(X_train, y_train, bs=64, epochs=100, lr=0.01)

# Save predictions to CSV for the test set
y_test_pred = predict(X_test, w, b)
predictions_df = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
predictions_df.to_csv('./predictions/LogRed_Prediction_Plain.csv', index=False)

### 1.2: Modified Logistic Regression. Score = 0.62412

In [None]:
import numpy as np
import pandas as pd

train_df = pd.read_csv('./dataset/train_tfidf_features.csv')
test_df = pd.read_csv('./dataset/test_tfidf_features.csv')

X_train = train_df.drop(['id', 'label'], axis=1) # Features
y_train = train_df['label'] # Labels
X_test = test_df.drop(['id'], axis=1) # Test Features

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Loss function with regularization
def loss(y, y_hat, w, lambda_):
    y_hat = np.clip(y_hat, 1e-10, 1-1e-10) # Avoid log(0)
    log_loss = -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
    reg_loss = (lambda_ / 2) * np.sum(w**2) # L2 regularization
    return log_loss + reg_loss

# Gradients calculation with regularization
def gradients(X, y, y_hat, w, lambda_):
    m = X.shape[0]
    dw = (1/m) * np.dot(X.T, (y_hat - y)) + (lambda_ / m) * w
    db = (1/m) * np.sum(y_hat - y)
    return dw, db

def train(X, y, bs, epochs, lr, lambda_):
    m, n = X.shape
    w = np.zeros((n, 1))
    b = 0
    y = y.values.reshape(m, 1)
    losses = []
    
    for epoch in range(epochs):
        for i in range((m - 1) // bs + 1):
            start_i = i * bs
            end_i = start_i + bs
            xb = X[start_i:end_i]
            yb = y[start_i:end_i]
            
            y_hat = sigmoid(np.dot(xb, w) + b)
            
            dw, db = gradients(xb, yb, y_hat, w, lambda_)
            
            w -= lr * dw
            b -= lr * db
            
        l = loss(y, sigmoid(np.dot(X, w) + b), w, lambda_)
        losses.append(l)
        
        # Convergence check
        if epoch > 0 and np.abs(losses[-1] - losses[-2]) < 1e-6:
            print(f'Converged at epoch {epoch}')
            break
    
    return w, b, losses

def predict(X, w, b):
    preds = sigmoid(np.dot(X, w) + b)
    pred_class = [1 if i > 0.5 else 0 for i in preds]
    return np.array(pred_class)

lambda_ = 0.01
w, b, l = train(X_train, y_train, bs=64, epochs=1000, lr=0.01, lambda_=lambda_)

y_test_pred = predict(X_test, w, b)
predictions_df = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
predictions_df.to_csv('./predictions/LogRed_Prediction.csv', index=False)


### 1.3: SKLearn Logistic Regression for comparison. Score = 0.68446

In [13]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

train_df = pd.read_csv('./dataset/train_tfidf_features.csv')
test_df = pd.read_csv('./dataset/test_tfidf_features.csv')

X_train = train_df.drop(['id', 'label'], axis=1)  # Features
y_train = train_df['label']  # Labels
X_test = test_df.drop(['id'], axis=1)  # Test Features

log_reg = LogisticRegression()

log_reg.fit(X_train, y_train)
y_train_pred = log_reg.predict(X_train)

y_test_pred = log_reg.predict(X_test)

# Save the predictions to a CSV file
output = pd.DataFrame({"id": test_df["id"], "label": y_test_pred})
output.to_csv("./predictions/SK_Learn_LogisticRegression_Predictions.csv", index=False)


## Task 2: PCA Dimension Reduction

In [2]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

train_df = pd.read_csv("./dataset/train_tfidf_features.csv")
test_df = pd.read_csv("./dataset/test_tfidf_features.csv")

X_train = train_df.drop(["id", "label"], axis=1)
y_train = train_df["label"]
X_test = test_df.drop(["id"], axis=1)

X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [3]:
components_list = [2000, 1000, 500, 100]
results = []

for n_components in components_list:
    pca = PCA(n_components=n_components)
    X_train_pca = pca.fit_transform(X_train_split)
    X_val_pca = pca.transform(X_val)
    X_test_pca = pca.transform(X_test)
    
    neigh = KNeighborsClassifier(n_neighbors=2)
    neigh.fit(X_train_pca, y_train_split)
    
    y_val_pred = neigh.predict(X_val_pca)
    
    f1 = f1_score(y_val, y_val_pred, average='weighted')
    results.append((n_components, f1))
    print(f"Components: {n_components}, F1 Score: {f1:.4f}")

    y_test_pred = neigh.predict(X_test_pca)

    output = pd.DataFrame({"id": test_df["id"], "label": y_test_pred})
    output.to_csv(f"./predictions/KNN_Predictions_{n_components}_components.csv", index=False)

print("Results Summary:")
for n_components, f1 in results:
    print(f"Number of components: {n_components}, F1 Score: {f1}")

Components: 2000, F1 Score: 0.4858
Components: 1000, F1 Score: 0.5881
Components: 500, F1 Score: 0.5954
Components: 100, F1 Score: 0.6031
Results Summary:
Number of components: 2000, F1 Score: 0.48576472368733437
Number of components: 1000, F1 Score: 0.588069875239095
Number of components: 500, F1 Score: 0.5954067618257358
Number of components: 100, F1 Score: 0.6031241190672758


## Task 3: Implementation of Other Models

### 3.0. Loading dataset and accuracy functions

In [None]:
# Load dataset
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score, make_scorer

train_df = pd.read_csv("./dataset/train_tfidf_features.csv")
test_df = pd.read_csv("./dataset/test_tfidf_features.csv")

X_train = train_df.drop(["id", "label"], axis=1)
y_train = train_df["label"]
X_test = test_df.drop(["id"], axis=1)

import os
if not os.path.exists('./predictions'):
    os.makedirs('./predictions')

def cross_validation(model, X, y):
    scores = cross_val_score(model, X, y, cv=5, scoring=make_scorer(f1_score, average='weighted'))
    return scores.mean()

def save_predictions(y_pred, filename):
        output = pd.DataFrame({"id": test_df["id"], "label": y_pred})
        output.to_csv(filename, index=False)

In [21]:
# Functions for self-evaluation on training set
def accuracy(y_true, y_pred):
    return np.sum(y_true == y_pred) / len(y_true)

def f1_score(y_true, y_pred, class_label):
    tp = np.sum((y_true == class_label) & (y_pred == class_label))
    fp = np.sum((y_true != class_label) & (y_pred == class_label))
    fn = np.sum((y_true == class_label) & (y_pred != class_label))
    
    if tp + 0.5 * (fp + fn) == 0:
        return 0
    
    f1 = tp / (tp + 0.5 * (fp + fn))
    return f1

def macro_f1_score(y_true, y_pred):
    f1_hateful = f1_score(y_true, y_pred, class_label=1)
    f1_non_hateful = f1_score(y_true, y_pred, class_label=0)
    return (f1_hateful + f1_non_hateful) / 2

def self_evaluatation_training_set(y_train, y_train_pred):
    print("Training set accuracy:", accuracy(y_train, y_train_pred))
    print("F1 Score for Hateful (class 1):", f1_score(y_train, y_train_pred, 1))
    print("F1 Score for Non-Hateful (class 0):", f1_score(y_train, y_train_pred, 0))
    print("Macro F1 Score:", macro_f1_score(y_train, y_train_pred))

### 3.1: Logistic Regression and SVM, with PCA and TSVD. Best score = 0.69597

In [None]:
# PCA dimensionality reduction and logistic regression
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegressionCV

components_list = [2000, 1000, 500, 100]
log_reg_cv = LogisticRegressionCV(cv=5, random_state=0) # max_iter 1000 no dif

for n_components in components_list:
    pca = PCA(n_components=n_components)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)

    log_reg_cv.fit(X_train_pca, y_train)

    y_pred = log_reg_cv.predict(X_test_pca)

    # Self evaluation on training set
    y_train_pred = log_reg_cv.predict(X_train_pca)
    print("Components:", n_components)
    self_evaluatation_training_set(y_train, y_train_pred)


    save_predictions(y_pred, f'./predictions/LogisticRegressionCV_PCA_{n_components}_components_Predictions.csv')

In [24]:
# SVM
from sklearn.svm import SVC

kernels = ['sigmoid', 'linear', 'rbf']

for kernel in kernels:
    svm = SVC(kernel=kernel)
    svm.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_test_pred = svm.predict(X_test)

    #Self evaluation on training set
    y_train_pred = svm.predict(X_train)
    self_evaluatation_training_set(y_train, y_train_pred)
    
    save_predictions(y_test_pred, f"./predictions/svm_{kernel}_predictions_default.csv")

Training set accuracy: 0.7540735567970205
F1 Score for Hateful (class 1): 0.6373154823206316
F1 Score for Non-Hateful (class 0): 0.8139637260081
Macro F1 Score: 0.7256396041643658
Training set accuracy: 0.8127327746741154
F1 Score for Hateful (class 1): 0.7275651879444632
F1 Score for Non-Hateful (class 0): 0.8573328604362476
Macro F1 Score: 0.7924490241903555
Training set accuracy: 0.9362779329608939
F1 Score for Hateful (class 1): 0.9122947537044453
F1 Score for Non-Hateful (class 0): 0.94996115706256
Macro F1 Score: 0.9311279553835027


In [None]:
# PCA dimensionality reduction and SVM, as SVMs typically perform better with high-dimensional and unstructured datasets, such as image and text data, compared to logistic regression.
# Tried multiple kernels, performed slightly better than Logistic Regression but took long to run
from sklearn.decomposition import PCA
from sklearn.svm import SVC

pca = PCA(n_components=500)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

kernels = ['linear', 'poly', 'rbf', 'sigmoid']

for kernel in kernels:
    svm = SVC(kernel=kernel, random_state=0)
    svm.fit(X_train_pca, y_train)
    
    y_pred = svm.predict(X_test_pca)

    # Self evaluation on training set
    y_train_pred = svm.predict(X_train_pca)
    print("Kernel:", kernel)
    self_evaluatation_training_set(y_train, y_train_pred)

    save_predictions(y_pred, f'./predictions/SVM_PCA_500_components_{kernel}_Predictions.csv')

Kernel: linear
Training set accuracy: 0.7314944134078212
F1 Score for Hateful (class 1): 0.5715081723625557
F1 Score for Non-Hateful (class 0): 0.8044915254237288
Macro F1 Score: 0.6879998488931423
Kernel: poly
Training set accuracy: 0.8564944134078212
F1 Score for Hateful (class 1): 0.7715397443023903
F1 Score for Non-Hateful (class 0): 0.8953932298294731
Macro F1 Score: 0.8334664870659316
Kernel: rbf
Training set accuracy: 0.8780842644320298
F1 Score for Hateful (class 1): 0.8254893794252395
F1 Score for Non-Hateful (class 0): 0.9063184724768591
Macro F1 Score: 0.8659039259510493
Kernel: sigmoid
Training set accuracy: 0.6504888268156425
F1 Score for Hateful (class 1): 0.4891988433407042
F1 Score for Non-Hateful (class 0): 0.7343653250773994
Macro F1 Score: 0.6117820842090518


In [None]:
# TVSD and Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import TruncatedSVD

n_components_list = [100, 500, 1000]

for n_components in n_components_list:
    print(f"\nApplying TSVD with {n_components} components")
    
    tsvd = TruncatedSVD(n_components=n_components)
    X_train_tsvd = tsvd.fit_transform(X_train)
    X_test_tsvd = tsvd.transform(X_test)
    
    print(f"\nTraining Logistic Regression with {n_components} TSVD components")
    
    lr = LogisticRegression(random_state=0, max_iter=100)
    lr.fit(X_train_tsvd, y_train)
    
    y_pred = lr.predict(X_test_tsvd)
    
    # Self evaluation on training set
    y_train_pred = lr.predict(X_train_tsvd)
    self_evaluatation_training_set(y_train, y_train_pred)
    
    save_predictions(y_pred, f'./predictions/LogisticRegression_Predictions_{n_components}_components_TSVD.csv')

### 3.2: Decision Tree. Best score = 0.67732

In [None]:
# Decision Tree Classifier, no tuning
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier(random_state=0)
decision_tree.fit(X_train, y_train)

y_pred = decision_tree.predict(X_test)

# Self evaluation on training set. Shows a lot of overfitting
y_train_pred = decision_tree.predict(X_train)
self_evaluatation_training_set(y_train, y_train_pred)

save_predictions(y_pred, f'./predictions/DecisionTree_Predictions.csv')

In [None]:
# Decision Tree Classifier with max_depth after considering overfitting. Performs worse
from sklearn.tree import DecisionTreeClassifier

max_depth_values = [10, 100, 500]

# Train and predict using Decision Tree for each max_depth value
for max_depth in max_depth_values:
    decision_tree = DecisionTreeClassifier(random_state=0, max_depth=max_depth)
    decision_tree.fit(X_train, y_train)
    
    y_pred = decision_tree.predict(X_test)
    
    save_predictions(y_pred, f'./predictions/DecisionTree_max_depth_{max_depth}_Predictions.csv')

### 3.3: Random Forest. Best score = 0.71254

In [None]:
# Random Forest Implementation, no tuning, first high score = 0.71254
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import f1_score, make_scorer

rf = RandomForestClassifier(n_estimators=200, random_state=0)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

# Self evaluation on training set
y_train_pred = rf.predict(X_train)
self_evaluatation_training_set(y_train, y_train_pred)

save_predictions(y_pred, './predictions/RandomForest_Predictions.csv')

In [None]:
# Random Forest Hyperparameter tuning, performs worse
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer, f1_score
import numpy as np

param_distributions = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf = RandomForestClassifier(random_state=0)

scorer = make_scorer(f1_score, average='macro')

random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_distributions, 
                                   n_iter=20, cv=3, scoring=scorer, verbose=2, random_state=0, n_jobs=-1)

random_search.fit(X_train, y_train)

best_params = random_search.best_params_
best_rf = random_search.best_estimator_

print("Best Parameters:", best_params)

y_pred = best_rf.predict(X_test)

save_predictions(y_pred, './predictions/RandomForest_Tuned_Predictions.csv')

# Self evaluation on training set
y_train_pred = best_rf.predict(X_train)
self_evaluatation_training_set(y_train, y_train_pred)

Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best Parameters: {'n_estimators': 200, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_depth': None}
Training set accuracy: 0.9917946927374302
F1 Score for Hateful (class 1): 0.9892358195282083
F1 Score for Non-Hateful (class 0): 0.9933706333160939
Macro F1 Score: 0.9913032264221511


In [None]:
# PCA and random forest, performs worse
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

n_components_list = [100, 500, 1000]

for n_components in n_components_list:
    pca = PCA(n_components=n_components)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    n_estimators_list = [100, 200, 300]
    
    for n_estimators in n_estimators_list:
        print(f"\nTraining Random Forest with {n_estimators} estimators, {n_components} components")
        
        # Random Forest Implementation
        rf = RandomForestClassifier(n_estimators=n_estimators, random_state=0)
        rf.fit(X_train_pca, y_train)
        
        y_pred = rf.predict(X_test_pca)
        
        # Self evaluation on training set
        y_train_pred = rf.predict(X_train_pca)
        self_evaluatation_training_set(y_train, y_train_pred)
        
        # Save predictions
        save_predictions(y_pred, f'./predictions/RandomForest_Predictions_{n_components}_components_{n_estimators}_estimators.csv')


Training Random Forest with 100 estimators, 100 components
Training set accuracy: 0.996566573556797
F1 Score for Hateful (class 1): 0.9954965269826731
F1 Score for Non-Hateful (class 0): 0.9972257488127145
Macro F1 Score: 0.9963611378976938

Training Random Forest with 200 estimators, 100 components
Training set accuracy: 0.996566573556797
F1 Score for Hateful (class 1): 0.9954937752997786
F1 Score for Non-Hateful (class 0): 0.9972267920094007
Macro F1 Score: 0.9963602836545896

Training Random Forest with 300 estimators, 100 components
Training set accuracy: 0.996566573556797
F1 Score for Hateful (class 1): 0.9954979015642884
F1 Score for Non-Hateful (class 0): 0.9972252269200019
Macro F1 Score: 0.9963615642421452

Training Random Forest with 100 estimators, 500 components
Training set accuracy: 0.996566573556797
F1 Score for Hateful (class 1): 0.9954972143783867
F1 Score for Non-Hateful (class 0): 0.9972254878909005
Macro F1 Score: 0.9963613511346436

Training Random Forest with 200

In [8]:
# Random Forest Implementation, varying estimators, performs worse
from sklearn.ensemble import RandomForestClassifier

n_estimators_values = [180, 220]

for n_estimators in n_estimators_values:
    print(f"Training Random Forest with n_estimators={n_estimators}")
    rf = RandomForestClassifier(n_estimators=n_estimators, random_state=0)
    rf.fit(X_train, y_train)
    
    y_pred = rf.predict(X_test)
    
    y_train_pred = rf.predict(X_train)
    self_evaluatation_training_set(y_train, y_train_pred)
    
    save_predictions(y_pred, f"./predictions/RandomForest_{n_estimators}_estimators_Predictions.csv")

Training Random Forest with n_estimators=180
Training set accuracy: 0.996566573556797
F1 Score for Hateful (class 1): 0.9954965269826731
F1 Score for Non-Hateful (class 0): 0.9972257488127145
Macro F1 Score: 0.9963611378976938
Training Random Forest with n_estimators=220
Training set accuracy: 0.996566573556797
F1 Score for Hateful (class 1): 0.9954965269826731
F1 Score for Non-Hateful (class 0): 0.9972257488127145
Macro F1 Score: 0.9963611378976938


In [9]:
# Random Forest Implementation, varying max_depth. performs worse
from sklearn.ensemble import RandomForestClassifier

max_depth_values = [10, 50, 100, 500]

for max_depth in max_depth_values:
    print(f"Training Random Forest with max_depth={max_depth}")
    rf = RandomForestClassifier(n_estimators=200, max_depth=max_depth, random_state=0)
    rf.fit(X_train, y_train)
    
    y_pred = rf.predict(X_test)
    
    y_train_pred = rf.predict(X_train)
    self_evaluatation_training_set(y_train, y_train_pred)
    
    save_predictions(y_pred, f"./predictions/RandomForest_max_depth_{max_depth}_Predictions.csv")

Training Random Forest with max_depth=10
Training set accuracy: 0.6213337988826816
F1 Score for Hateful (class 1): 0.0133434420015163
F1 Score for Non-Hateful (class 0): 0.7657077017246966
Macro F1 Score: 0.38952557186310643
Training Random Forest with max_depth=50
Training set accuracy: 0.8119762569832403
F1 Score for Hateful (class 1): 0.6755046700813498
F1 Score for Non-Hateful (class 0): 0.8676416369669412
Macro F1 Score: 0.7715731535241455
Training Random Forest with max_depth=100
Training set accuracy: 0.8948440409683427
F1 Score for Hateful (class 1): 0.8403851249889586
F1 Score for Non-Hateful (class 0): 0.9215950015186358
Macro F1 Score: 0.8809900632537973
Training Random Forest with max_depth=500
Training set accuracy: 0.9962756052141527
F1 Score for Hateful (class 1): 0.9951130116065975
F1 Score for Non-Hateful (class 0): 0.9969913501316284
Macro F1 Score: 0.996052180869113


In [None]:
# Random Forest with recursive feature elimination, performs worse
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

rfe = RFE(estimator=RandomForestClassifier(random_state=0), n_features_to_select=1000, step=10)
X_train_selected = rfe.fit_transform(X_train, y_train)
X_test_selected = rfe.transform(X_test)
selected_features = X_train.columns[rfe.support_]

print(f"Number of selected features: {len(selected_features)}")

rf_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
rf_classifier.fit(X_train_selected, y_train)

test_predictions = rf_classifier.predict(X_test_selected)

save_predictions(test_predictions, './predictions/rf_rfe_submission.csv')

### 3.4: Other Ensemble Methods. Best score = 0.71424

In [4]:
from sklearn.ensemble import AdaBoostClassifier

# AdaBoost Implementation, no tuning
ada = AdaBoostClassifier(n_estimators=200, random_state=0)

ada.fit(X_train, y_train)

y_pred_ada = ada.predict(X_test)

# Self evaluation on training set
y_train_pred_ada = ada.predict(X_train)
self_evaluatation_training_set(y_train, y_train_pred_ada)

save_predictions(y_pred_ada, './predictions/AdaBoost_Predictions.csv')


Training set accuracy: 0.7410381750465549
F1 Score for Hateful (class 1): 0.6061946902654868
F1 Score for Non-Hateful (class 0): 0.8070920756025664
Macro F1 Score: 0.7066433829340266


In [None]:
# Extra Trees Classifier. New Best Score = 0.71424
from sklearn.ensemble import ExtraTreesClassifier

extra_trees = ExtraTreesClassifier(n_estimators=200, random_state=0)

extra_trees.fit(X_train, y_train)

y_pred_extra_trees = extra_trees.predict(X_test)

# Self evaluation on training set
y_train_pred_extra_trees = extra_trees.predict(X_train)
self_evaluatation_training_set(y_train, y_train_pred_extra_trees)

save_predictions(y_pred_extra_trees, './predictions/ExtraTrees_Predictions.csv')


Training set accuracy: 0.996566573556797
F1 Score for Hateful (class 1): 0.9954903309638462
F1 Score for Non-Hateful (class 0): 0.9972280949025135
Macro F1 Score: 0.9963592129331799


In [8]:
from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting Implementation, no tuning
gb = GradientBoostingClassifier(n_estimators=200, random_state=0)

gb.fit(X_train, y_train)

y_pred_gb = gb.predict(X_test)

# Self evaluation on training set
y_train_pred_gb = gb.predict(X_train)
self_evaluatation_training_set(y_train, y_train_pred_gb)

save_predictions(y_pred_gb, './predictions/GradientBoosting_Predictions.csv')


Training set accuracy: 0.7321927374301676
F1 Score for Hateful (class 1): 0.5319365337672904
F1 Score for Non-Hateful (class 0): 0.8124388653407238
Macro F1 Score: 0.6721876995540071


### 3.5: Ensemble Methods using Voting Classifier. Best score = 0.71870

In [22]:
# Voting Classifier with RandomForest, ExtraTrees and LogisticRegression. Best score = 0.71870
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression

rf = RandomForestClassifier(n_estimators=200, random_state=0)
et = ExtraTreesClassifier(n_estimators=200, random_state=0)
lr = LogisticRegression(random_state=0)

voting = VotingClassifier(estimators=[
    ('rf', rf),
    ('et', et),
    ('lr', lr)
], voting='hard')

voting.fit(X_train, y_train)

y_pred_voting = voting.predict(X_test)

# Self evaluation on training set
y_train_pred_voting = voting.predict(X_train)
self_evaluatation_training_set(y_train, y_train_pred_voting)

save_predictions(y_pred_voting, './predictions/Voting_Predictions.csv')


Training set accuracy: 0.996566573556797
F1 Score for Hateful (class 1): 0.9954937752997786
F1 Score for Non-Hateful (class 0): 0.9972267920094007
Macro F1 Score: 0.9963602836545896


In [24]:
# Voting Classifier with RandomForesta and ExtraTrees.
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, ExtraTreesClassifier

rf = RandomForestClassifier(n_estimators=200, random_state=0)
et = ExtraTreesClassifier(n_estimators=200, random_state=0)

voting = VotingClassifier(estimators=[
    ('rf', rf),
    ('et', et),
], voting='hard')

voting.fit(X_train, y_train)

y_pred_voting = voting.predict(X_test)

# Self evaluation on training set
y_train_pred_voting = voting.predict(X_train)
self_evaluatation_training_set(y_train, y_train_pred_voting)

save_predictions(y_pred_voting, './predictions/Voting_RF_ET_Predictions.csv')


Training set accuracy: 0.996566573556797
F1 Score for Hateful (class 1): 0.9954903309638462
F1 Score for Non-Hateful (class 0): 0.9972280949025135
Macro F1 Score: 0.9963592129331799


In [25]:
# Voting Classifier with RandomForest, ExtraTrees and LogisticRegression. Best score = 0.71870
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier

rf = RandomForestClassifier(n_estimators=200, random_state=0)
et = ExtraTreesClassifier(n_estimators=200, random_state=0)
gb = GradientBoostingClassifier(n_estimators=200, random_state=0)

voting = VotingClassifier(estimators=[
    ('rf', rf),
    ('et', et),
    ('gb', gb)
], voting='hard')

voting.fit(X_train, y_train)

y_pred_voting = voting.predict(X_test)

# Self evaluation on training set
y_train_pred_voting = voting.predict(X_train)
self_evaluatation_training_set(y_train, y_train_pred_voting)

save_predictions(y_pred_voting, './predictions/Voting_RF_ET_GB_Predictions.csv')


Training set accuracy: 0.996566573556797
F1 Score for Hateful (class 1): 0.9954930868535635
F1 Score for Non-Hateful (class 0): 0.997227052685999
Macro F1 Score: 0.9963600697697812


### 3.6: Naive Bayes. Best score = 0.70821

** did not do Gaussian Naive Bayes because it is more suitable for continuous data.

#1: Multinomial Naive Bayes

In [9]:
# model: Multinomial Naive Bayes
# key hyperparameter: alpha 1.0
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

train_df = pd.read_csv('train_tfidf_features.csv')
test_df = pd.read_csv('test_tfidf_features.csv')

X = train_df.drop(columns=['id', 'label'])
y = train_df['label']
X_test = test_df.drop(columns=['id'])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)

# predict on validation set
y_val_pred = model.predict(X_val)
macro_f1 = f1_score(y_val, y_val_pred, average='macro')
print(f"Macro F1 score for MultinomialNB: {macro_f1}")

# predict on test set
y_test_pred = model.predict(X_test)
submission = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
submission.to_csv('MultinomialNB_1.csv', index=False) #name is kaggle submission: task3_MultinomialNB.csv

Macro F1 score for MultinomialNB: 0.6666332622584117
Predictions saved for MultinomialNB.


In [11]:
# Best Naive Bayes score = 0.70821
# model: Multinomial Naive Bayes with GridSearchCV, SMOTE
# key hyperparameter: 
# - alpha 0.1, 0.5, 1.0, 2.0 , best is 2.0
# - cross validation fold 3

import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

train_df = pd.read_csv('train_tfidf_features.csv')
test_df = pd.read_csv('test_tfidf_features.csv')

X = train_df.drop(columns=['id', 'label'])
y = train_df['label']
X_test = test_df.drop(columns=['id'])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE to the training data as labels are discrete
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

model = MultinomialNB()
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0] 
}
grid_search = GridSearchCV(model, param_grid, scoring='f1_macro', cv=3, n_jobs=1)
grid_search.fit(X_train_smote, y_train_smote) #hyperparameter tuning

best_params = grid_search.best_params_
best_score = grid_search.best_score_
print(f"Best parameters: {best_params}")

best_model = grid_search.best_estimator_
y_val_pred = best_model.predict(X_val)
macro_f1 = f1_score(y_val, y_val_pred, average='macro')
print(f"Macro F1 score on validation set: {macro_f1}")

y_test_pred = best_model.predict(X_test)
submission = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
submission.to_csv('MultinomialNB_2.csv', index=False) #name is kaggle submission: task3_MultinomialNB_simplified.csv

Best parameters: {'alpha': 2.0}
Macro F1 score on validation set: 0.6849255756869093


In [12]:
# model: Multinomial Naive Bayes with GridSearchCV, SMOTE (improved)
# key hyperparameter: 
# - alpha [0.1, 0.5, 1.0, 2.0, 4.0, 6.0, 10.0] , best is 4.0
# - cross validation fold 6 , best is 6 after testing  few values

import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

train_df = pd.read_csv('train_tfidf_features.csv')
test_df = pd.read_csv('test_tfidf_features.csv')

X = train_df.drop(columns=['id', 'label'])
y = train_df['label']
X_test = test_df.drop(columns=['id'])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

model = MultinomialNB()
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0, 4.0, 6.0, 10.0]
}
grid_search = GridSearchCV(model, param_grid, scoring='f1_macro', cv=6, n_jobs=1)
grid_search.fit(X_train_smote, y_train_smote)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best parameters: {best_params}")

best_model = grid_search.best_estimator_
y_val_pred = best_model.predict(X_val)
macro_f1 = f1_score(y_val, y_val_pred, average='macro')
print(f"Macro F1 score on validation set: {macro_f1}")

y_test_pred = best_model.predict(X_test)
submission = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
submission.to_csv('MultinomialNB_best.csv', index=False)

Best parameters: {'alpha': 4.0}
Macro F1 score on validation set: 0.6865894030598448


#2 Complement Naive Bayes

In [13]:
# model: Complement Naive Bayes with SMOTE
# key hyperparameter: 
# - alpha 1.0

import pandas as pd
from sklearn.naive_bayes import ComplementNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

train_df = pd.read_csv('train_tfidf_features.csv')
test_df = pd.read_csv('test_tfidf_features.csv')

X = train_df.drop(columns=['id', 'label'])
y = train_df['label']
X_test = test_df.drop(columns=['id'])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

model = ComplementNB(alpha=1.0)
model.fit(X_train_smote, y_train_smote)

y_val_pred = model.predict(X_val)
macro_f1 = f1_score(y_val, y_val_pred, average='macro')
print(f"Macro F1 score for ComplementNB: {macro_f1}")

y_test_pred = model.predict(X_test)
submission = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
submission.to_csv('ComplementNB_1.csv', index=False)

Macro F1 score for ComplementNB: 0.687515848490134


In [16]:
# model: Complement Naive Bayes with SMOTE
# key hyperparameter: 
# - alpha 0.01, 0.1, 0.5, 1.0, 2.0 , best is 2.0
# - cross validation fold 5

import pandas as pd
from sklearn.naive_bayes import ComplementNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

train_df = pd.read_csv('train_tfidf_features.csv')
test_df = pd.read_csv('test_tfidf_features.csv')

X = train_df.drop(columns=['id', 'label'])  
y = train_df['label']
X_test = test_df.drop(columns=['id'])  

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

model = ComplementNB()
param_grid = {'alpha': [0.01, 0.1, 0.5, 1.0, 2.0]}

grid_search = GridSearchCV(model, param_grid, scoring='f1_macro', cv=5, n_jobs=1)
grid_search.fit(X_train_smote, y_train_smote)

best_params = grid_search.best_params_
best_score = grid_search.best_score_
print(f"Best parameters: {best_params}")

best_model = grid_search.best_estimator_
y_val_pred = best_model.predict(X_val)
macro_f1 = f1_score(y_val, y_val_pred, average='macro')
print(f"Macro F1 score on validation set: {macro_f1}")

y_test_pred = best_model.predict(X_test)
submission = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
submission.to_csv('ComplementNB_2.csv', index=False)

Best parameters: {'alpha': 2.0}
Macro F1 score on validation set: 0.6849255756869093


In [17]:
# model: Complement Naive Bayes with SMOTE (improved)
# key hyperparameter: 
# - alpha [0.5, 1.0, 2.0, 4.0, 6.0] , best is 6.0
# - cross validation fold 10 , best is 10 after testing few values

from sklearn.naive_bayes import ComplementNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

train_df = pd.read_csv('train_tfidf_features.csv')
test_df = pd.read_csv('test_tfidf_features.csv')

X = train_df.drop(columns=['id', 'label'])
y = train_df['label']
X_test = test_df.drop(columns=['id'])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

model = ComplementNB()
param_grid = {'alpha': [2.0, 4.0, 6.0, 8.0, 10.0]}

grid_search = GridSearchCV(model, param_grid, scoring='f1_macro', cv=10, n_jobs=1)
grid_search.fit(X_train_smote, y_train_smote)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best parameters: {best_params}")

best_model = grid_search.best_estimator_
y_val_pred = best_model.predict(X_val)
macro_f1 = f1_score(y_val, y_val_pred, average='macro')
print(f"Macro F1 score on validation set: {macro_f1}")

y_test_pred = best_model.predict(X_test)
submission = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
submission.to_csv('ComplementNB_best.csv', index=False)

Best parameters: {'alpha': 6.0}
Macro F1 score on validation set: 0.6906421879781679


#3 Bernoulli Naive Bayes

In [21]:
# model: Bernoulli Naive Bayes with SMOTE
# key hyperparameter: 
# - alpha [0.5, 1.0, 2.0, 4.0, 6.0] , best is 4.0
# - cross validation fold 5

import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import Binarizer
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

train_df = pd.read_csv('train_tfidf_features.csv')
test_df = pd.read_csv('test_tfidf_features.csv')

X = train_df.drop(columns=['id', 'label'])  
y = train_df['label']
X_test = test_df.drop(columns=['id'])

# Binarize the features
binarizer = Binarizer()
X_binarized = binarizer.fit_transform(X)
X_test_binarized = binarizer.transform(X_test)

X_train, X_val, y_train, y_val = train_test_split(X_binarized, y, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

model = BernoulliNB()
param_grid = {'alpha': [0.1, 0.5, 1.0, 2.0, 4.0, 6.0]}

grid_search = GridSearchCV(model, param_grid, scoring='f1_macro', cv=5, n_jobs=1)
grid_search.fit(X_train_smote, y_train_smote)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best parameters: {best_params}")

best_model = grid_search.best_estimator_
y_val_pred = best_model.predict(X_val)
macro_f1 = f1_score(y_val, y_val_pred, average='macro')
print(f"Macro F1 score on validation set: {macro_f1}")

y_test_pred = best_model.predict(X_test_binarized)
submission = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
submission.to_csv('BernoulliNB_1.csv', index=False)

Best parameters: {'alpha': 4.0}
Macro F1 score on validation set: 0.7043370429336944


In [19]:
# model: Bernoulli Naive Bayes with SMOTE
# key hyperparameter: 
# - alpha [0.5, 1.0, 2.0, 4.0, 6.0] , best is 2.0
# - cross validation fold 10, best is 10 after testing few values

import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import Binarizer
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

train_df = pd.read_csv('train_tfidf_features.csv')
test_df = pd.read_csv('test_tfidf_features.csv')

X = train_df.drop(columns=['id', 'label'])  
y = train_df['label']
X_test = test_df.drop(columns=['id'])

# Binarize the features
binarizer = Binarizer()
X_binarized = binarizer.fit_transform(X)
X_test_binarized = binarizer.transform(X_test)

X_train, X_val, y_train, y_val = train_test_split(X_binarized, y, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

model = BernoulliNB()
param_grid = {'alpha': [0.5, 1.0, 2.0, 4.0, 6.0]}

grid_search = GridSearchCV(model, param_grid, scoring='f1_macro', cv=10, n_jobs=1)
grid_search.fit(X_train_smote, y_train_smote)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best parameters: {best_params}")

best_model = grid_search.best_estimator_
y_val_pred = best_model.predict(X_val)
macro_f1 = f1_score(y_val, y_val_pred, average='macro')
print(f"Macro F1 score on validation set: {macro_f1}")

y_test_pred = best_model.predict(X_test_binarized)
submission = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
submission.to_csv('BernoulliNB_best.csv', index=False)

Best parameters: {'alpha': 2.0}
Macro F1 score on validation set: 0.6988315321643366


#4 All 3 Naive Bayes models: Multinomial, Complement and Bernoulli

In [22]:
# model: 3 Naive Bayes Models with SMOTE
# key hyperparameter: 
# - alpha 1.0

import pandas as pd
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Binarizer
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

train_df = pd.read_csv('train_tfidf_features.csv')
test_df = pd.read_csv('test_tfidf_features.csv')

X = train_df.drop(columns=['id', 'label'])
y = train_df['label']
X_test = test_df.drop(columns=['id'])

binarizer = Binarizer()
X_binarized = binarizer.fit_transform(X)
X_test_binarized = binarizer.transform(X_test)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_bin, X_val_bin = train_test_split(X_binarized, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
X_train_bin_smote, y_train_bin_smote = smote.fit_resample(X_train_bin, y_train)

model_multinomial = MultinomialNB()
model_complement = ComplementNB()
model_bernoulli = BernoulliNB()

# Create an ensemble with VotingClassifier
ensemble = VotingClassifier(estimators=[
    ('multinomial', model_multinomial),
    ('complement', model_complement),
    ('bernoulli', model_bernoulli)
], voting='hard')  # 'hard' voting

# ** fit models separately because BernoulliNB requires binarized features
model_multinomial.fit(X_train_smote, y_train_smote)
model_complement.fit(X_train_smote, y_train_smote)
model_bernoulli.fit(X_train_bin_smote, y_train_bin_smote)


y_val_pred = ensemble.fit(X_val, y_val).predict(X_val)
macro_f1 = f1_score(y_val, y_val_pred, average='macro')
print(f"Macro F1 score for VotingClassifier with Naive Bayes models: {macro_f1}")

y_test_pred = ensemble.predict(X_test)
submission = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
submission.to_csv('NaiveBayesEnsemble.csv', index=False)

Macro F1 score for VotingClassifier with Naive Bayes models: 0.8428333829608325


In [28]:
# model: 3 Naive Bayes Models with SMOTE
# key hyperparameter: 
# - multinomial alpha [0.1, 0.5, 1.0, 2.0, 4.0, 6.0, 10.0] , best is 2.0
# - complement alpha [0.1, 0.5, 1.0, 2.0, 4.0, 6.0, 10.0] , best is 2.0
# - bernoulli alpha [0.1, 0.5, 1.0, 2.0, 4.0, 6.0, 10.0] , best is 4.0
# - cross validation fold 5

import pandas as pd
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import Binarizer
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline

train_df = pd.read_csv('train_tfidf_features.csv')
test_df = pd.read_csv('test_tfidf_features.csv')

X = train_df.drop(columns=['id', 'label'])
y = train_df['label']
X_test = test_df.drop(columns=['id'])

binarizer = Binarizer()
X_binarized = binarizer.fit_transform(X)
X_test_binarized = binarizer.transform(X_test)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_bin, X_val_bin = train_test_split(X_binarized, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
X_train_bin_smote, y_train_bin_smote = smote.fit_resample(X_train_bin, y_train)

param_grid = {'alpha': [0.1, 0.5, 1.0, 2.0, 4.0, 6.0, 8.0]}

# Set up GridSearchCV for each Naive Bayes model
grid_search_multinomial = GridSearchCV(MultinomialNB(), param_grid, scoring='f1_macro', cv=5, n_jobs=1)
grid_search_complement = GridSearchCV(ComplementNB(), param_grid, scoring='f1_macro', cv=5, n_jobs=1)
grid_search_bernoulli = GridSearchCV(BernoulliNB(), param_grid, scoring='f1_macro', cv=5, n_jobs=1)

grid_search_multinomial.fit(X_train_smote, y_train_smote)
grid_search_complement.fit(X_train_smote, y_train_smote)
grid_search_bernoulli.fit(X_train_bin_smote, y_train_bin_smote)

# Get the best models
best_multinomial = grid_search_multinomial.best_estimator_
best_complement = grid_search_complement.best_estimator_
best_bernoulli = grid_search_bernoulli.best_estimator_

print(f"Best MultinomialNB alpha: {grid_search_multinomial.best_params_['alpha']}")
print(f"Best ComplementNB alpha: {grid_search_complement.best_params_['alpha']}")
print(f"Best BernoulliNB alpha: {grid_search_bernoulli.best_params_['alpha']}")

# Create an ensemble with VotingClassifier using the best models
ensemble = VotingClassifier(estimators=[
    ('multinomial', best_multinomial),
    ('complement', best_complement),
    ('bernoulli', best_bernoulli)
], voting='hard')

ensemble.fit(X_train_smote, y_train_smote)
y_val_pred = ensemble.predict(X_val)
macro_f1 = f1_score(y_val, y_val_pred, average='macro')
print(f"Macro F1 score for VotingClassifier with Naive Bayes models: {macro_f1}")

y_test_pred = ensemble.predict(X_test)
submission = pd.DataFrame({'id': test_df['id'], 'label': y_test_pred})
submission.to_csv('NaiveBayesEnsemble_best.csv', index=False)

Best MultinomialNB alpha: 2.0
Best ComplementNB alpha: 2.0
Best BernoulliNB alpha: 4.0
Macro F1 score for VotingClassifier with Naive Bayes models: 0.6849255756869093


### 3.7: Gradient Boosting. Best score = 0.68816

In [None]:
import numpy as np
import pandas as pd

df_raw = pd.read_csv('train_tfidf_features.csv')
# df_text = pd.read_csv('train.csv')
df_raw_comp = pd.read_csv('test_tfidf_features.csv')
df_comp_text = pd.read_csv('test.csv')

from sklearn.model_selection import train_test_split
# Get the features from the dataframe ignoring the first 2 columns (index, label). 1st row is auto ignored.
df_features = df_raw.iloc[:, 2:]
df_labels = df_raw.iloc[:, 1]

# Get the features from the competition submittion dataset, ignoring the first column (index). 1st row is auto ignored.
df_features_comp = df_raw_comp.iloc[:, 1:]

# Convert the DataFrame to a numpy array
features = df_features.to_numpy()
labels = df_labels.to_numpy()[:,np.newaxis]

features_comp = df_features_comp.to_numpy()

#### XGB

Used GrindSearch and RandomSearch to attempt to obtain the best hyperparameters.  

The chosen hyperparameters for GridSearch were:  
* learning rate
* number of trees (estimators) created
* max depth per tree
* subsample (fraction of samples used to fit each tree)
* colsample_bytree (fraction of features used to fit each tree)

Due to RandomSearch running faster than GridSearch, the additional hyperparameters were added:
* reg_alpha (l1 regularization term on weights)
* reg_lambda (l2 regularization term on weights)

#### Search for parameters

In [None]:

from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 300, 500],
    'max_depth': [3, 5, 7],
    'subsample': [0.7, 0.8, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0]
}

xgb_model = XGBClassifier(tree_method="hist", objective='binary:logistic', n_jobs=4)

# Grid Search
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, scoring='f1', n_jobs=4)
grid_search.fit(features, labels)
print("Best parameters found by grid search:", grid_search.best_params_)


In [None]:
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier
from scipy.stats import randint, uniform

param_dist = {
    'n_estimators': randint(100, 1000),
    'max_depth': randint(3, 10),
    'learning_rate': uniform(0.01, 0.1),
    'subsample': uniform(0.6, 1.0),
    'colsample_bytree': uniform(0.6, 1.0),
    'reg_alpha': uniform(0, 1),
    'reg_lambda': uniform(1, 2)
}

xgb_model = XGBClassifier(tree_method = "hist", device = "cpu", objective='binary:logistic', n_jobs=4)

random_search = RandomizedSearchCV(estimator=xgb_model, param_distributions=param_dist, n_iter=10, cv=3, scoring='f1', random_state=42, n_jobs=4)
random_search.fit(features, labels)
print("Best parameters found by random search:", random_search.best_params_)

Based on the results of GridSearch and RandomSearch, the following hyperparameters were obtained and used to train the XGB classifier:  
* n_estimators=300, max_depth=7, learning_rate=0.1, reg_alpha=0, reg_lambda=1 = 0.680
* n_estimators=500, max_depth=7, learning_rate=0.1, reg_alpha=0, reg_lambda=1 = 0.734
* n_estimators=500, max_depth=10, learning_rate=0.1, reg_alpha=0, reg_lambda=1, subsample=0.8
* n_estimators=781, max_depth=9, learning_rate=0.06247746602583892, reg_lambda=2.9475110376829186, subsample=0.8327713404303042, reg_alpha=0.04666566321361543, colsample_bytree=0.6230624250414157

#### Trainer

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier

# Explored PCA to reduce the number of features to 95% of the variance, though it did not seem improve the model's F1 score.
pca = PCA(n_components=0.95)
clf = XGBClassifier(n_estimators=500, max_depth=7, learning_rate=0.1, reg_alpha=0, reg_lambda=1, subsample=0.8, objective='binary:logistic')
pipeline = Pipeline(steps=[('pca', pca), ('clf', clf)])
pipeline.fit(features, labels)

In [None]:
from sklearn.metrics import f1_score

preds = pipeline.predict(features)
f1 = f1_score(labels, preds)
print("F1 Score:", f1)

#### Make Predictions

In [None]:
predictions = pipeline.predict(features_comp)

# copy df_test_text to df_submission
df_submission = df_comp_text.copy()
# Add the predictions to the df_submission DataFrame
df_submission['label'] = predictions
# remove the post column
df_submission = df_submission.drop(columns=['post'])
df_submission.to_csv('submission.csv', index=False)