# Part 3: advanced model

Many different methods have been tried for the advanced model, here are the ones that yielded meaningful results

In [1]:
import numpy as np
from scipy.sparse import csr_matrix, hstack, save_npz
data = np.load("../advanced_features.npz", allow_pickle=True)
X_train = data['X_train'].item()
X_val= data['X_val'].item()
Y_train = data['Y_train'].ravel()
Y_val = data['Y_val'].ravel()


### naive bayes with tf-idf:

Using multinominal naive bayes (the variant that incorporates word frequency features) in conjunction with tf-idf encoding. Yields relatively poorer results.

In [2]:
from scipy.sparse import hstack
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV
import pickle

model = MultinomialNB(alpha=0,force_alpha=True)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_val)
print("Accuracy:",accuracy_score(Y_val,Y_pred))
print("Precision:",precision_score(Y_val,Y_pred,average="binary"))
print("Recall:",recall_score(Y_val,Y_pred,average="binary"))
print("F1 score:",f1_score(Y_val,Y_pred,average="binary"))

with open("../advanced_model_NB.pkl",'wb') as f:
    pickle.dump(model,f)

del model,Y_pred

Accuracy: 0.8306951054703566
Precision: 0.8352561144439317
Recall: 0.8185366782501071
F1 score: 0.8268118816642024


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


### support vector machine with tf-idf:

This is the complex model we have settled with, as it produces the best results out of all tested models. 

The linearSVC() has trouble converging when fitting the dataset. From experimenting with hyperparameters, I have found that beyond around 100 iterations, the benchmarks no longer improve, and are maxed out at around 88 for everything. The most iterations tried is 1000 iterations. 

There is another module from sklearn, SVC(), which offers support vector machines using nonlinear kernels, which might create better results. However, SVC()'s time complexity scales at least quadratically with number of samples. therefore, since our dataset has close to a million samples, we will be using LinearSVC instead. 

Attempting to run the normal SVC(), even with fractions of the 995,000 dataset has resulted in hours of compiling with no results.

In [3]:
from scipy.sparse import hstack
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV
import pickle

model1 = LinearSVC(dual="auto",tol=1e-3, C=0.5,max_iter=300)
model1.fit(X_train,Y_train)
Y_pred = model1.predict(X_val)

print("SVM with RBF Kernel Classification Report:")
print("Accuracy:",accuracy_score(Y_val,Y_pred))
print("Precision:",precision_score(Y_val,Y_pred,average="binary"))
print("Recall:",recall_score(Y_val,Y_pred,average="binary"))
print("F1 score:",f1_score(Y_val,Y_pred,average="binary"))

with open("../advanced_model_SVC.pkl",'wb') as f:
    pickle.dump(model1,f)

del Y_pred, model1


SVM with RBF Kernel Classification Report:
Accuracy: 0.8823197602679358
Precision: 0.8875781212150574
Recall: 0.8721140572190222
F1 score: 0.8797781406069919


