# Part 3: advanced model

Many different methods have been tried for the advanced model, here are the ones that yielded meaningful results

In [2]:
import numpy as np
from scipy.sparse import csr_matrix, hstack, save_npz
data = np.load("../advanced_features.npz", allow_pickle=True)
X_train = data['X_train'].item()
X_test= data['X_test'].item()
Y_train = data['Y_train'].ravel()
Y_test = data['Y_test'].ravel()

### Logistic regression with tf-idf encoding:

This model yields good results, thanks to the tf-idf encoding. however, it still uses logistic regression, which is a "simple" method

In [4]:
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

model = LogisticRegression(tol=1e-3)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
print(classification_report(Y_test, Y_pred))

del model,Y_pred

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


              precision    recall  f1-score   support

           0       0.87      0.88      0.88     43036
           1       0.88      0.86      0.87     42060

    accuracy                           0.87     85096
   macro avg       0.87      0.87      0.87     85096
weighted avg       0.87      0.87      0.87     85096



### naive bayes with tf-idf:

Using multinominal naive bayes (the variant that incorporates word frequency features) in conjunction with tf-idf encoding. Yields relatively poorer results.

In [3]:
from scipy.sparse import hstack
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

model = MultinomialNB()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
print(classification_report(Y_test, Y_pred))

del model,Y_pred

              precision    recall  f1-score   support

           0       0.81      0.83      0.82     43036
           1       0.82      0.80      0.81     42060

    accuracy                           0.82     85096
   macro avg       0.82      0.82      0.82     85096
weighted avg       0.82      0.82      0.82     85096



### support vector machine with tf-idf:

This is the complex model we have settled with, as it produces the best results out of all tested models. 

The linearSVC() has trouble converging when fitting the dataset. From experimenting with hyperparameters, I have found that beyond around 100 iterations, the benchmarks no longer improve, and are maxed out at around 88 for everything. The most iterations tried is 1000 iterations. 

There is another module from sklearn, SVC(), which offers support vector machines using nonlinear kernels, which might create better results. However, SVC()'s time complexity scales at least quadratically with number of samples. therefore, since our dataset has close to a million samples, we will be using LinearSVC instead. 

Attempting to run the normal SVC(), even with fractions of the 995,000 dataset has resulted in hours of compiling with no results.

In [4]:
from scipy.sparse import hstack
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
import pickle

model1 = LinearSVC(dual="auto",max_iter=1000, C=0.5)
model1.fit(X_train,Y_train)
Y_pred = model1.predict(X_test)

print("SVM with RBF Kernel Classification Report:")
print(classification_report(Y_test, Y_pred))

with open("../advanced_model.pkl",'wb') as f:
    pickle.dump(model1,f)

del Y_pred, model1


SVM with RBF Kernel Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.89      0.88     43036
           1       0.88      0.87      0.88     42060

    accuracy                           0.88     85096
   macro avg       0.88      0.88      0.88     85096
weighted avg       0.88      0.88      0.88     85096



