### Table of contents:

* [3. Classification](#chapter3)
    * [3.1 Requirements](#section_3_1)
    * [3.2 Imports](#section_3_2)
    * [3.3 Get data](#section_3_3)
    * [3.4 Train and Test dataset](#section_3_4)
    * [3.5 Generate feature vectors](#section_3_5)
    * [3.6 Models training and optimization (Machine Learning)](#section_3_6)
        * [3.6.1 Logistics Regression](#section_3_6_1)
            * [3.6.1.1 With bag-of-words](#section_3_6_1_1)
            * [3.6.1.2 With TF-IDF](#section_3_6_1_2)
        * [3.6.2 Linear Support Vector Classifier (SVC)](#section_3_6_2)
            * [3.6.2.1 With bag-of-words](#section_3_6_2_1)
            * [3.6.2.2 With TF-IDF](#section_3_6_2_2)
        * [3.6.3 Multinomial Naive Bayes](#section_3_6_3)
            * [3.6.3.1 With bag-of-words](#section_3_6_3_1)
            * [3.6.3.2 With TF-IDF](#section_3_6_1_2)
        * [3.6.4 K-Nearest Neighbors](#section_3_6_4)
            * [3.6.4.1 With bag-of-words](#section_3_6_4_1)
            * [3.6.4.2 With TF-IDF](#section_3_6_4_2)
        * [3.6.5 Random Forest](#section_3_6_5)
            * [3.6.5.1 With bag-of-words](#section_3_6_5_1)
            * [3.6.5.2 With TF-IDF](#section_3_6_5_2)
        * [3.6.6 XGBoost](#section_3_6_6)
            * [3.6.6.1 With bag-of-words](#section_3_6_6_1)
            * [3.6.6.2 With TF-IDF](#section_3_6_6_2)
    * [3.7 Models training and optimization (Deep learning)](#section_3_7)
        * [3.7.1 Convolutional Neural Network](#section_3_7_1)
    * [3.8 Evaluation analysis](#section_3_8)

# 3. Classification <a class="anchor" id="chapter3"></a>

## 3.1 Requirements <a class="anchor" id="section_3_1"></a>

In [1]:
pip install xgboost

Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install scikit-optimize

In [None]:
pip install tensorflow 

In [None]:
pip install tensorflow-gpu 

## 3.2 Imports <a class="anchor" id="section_3_2"></a>

In [1]:
import pandas as pd
import pickle

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from keras.preprocessing import text, sequence
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation,Embedding, LSTM, Conv1D, Flatten, MaxPooling1D

from skopt.space import Real, Categorical, Integer
from sklearn.model_selection import cross_val_score
from skopt import gp_minimize
from skopt.utils import use_named_args

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support as scores

2022-03-24 13:15:16.251803: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-24 13:15:16.251875: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## 3.3 Get data <a class="anchor" id="section_3_3"></a>

In [2]:
data = pd.read_pickle('data_preprocessed.pkl')
data.head()

Unnamed: 0,label,content
0,0,prisão perpétua homem tentou assassinar senado...
1,0,john nash matemático mente brilhante morre aci...
2,1,mito reeleição mínima garantida cavaco sairá d...
3,0,morreu rita levintalcini grande dama ciência i...
4,0,trás porta amarela homem problemas psicológico...


## 3.4 Train and Test dataset <a class="anchor" id="section_3_4"></a>

In [3]:
# Divide the data into a 80% train dataset and 20% test dataset

X = data.loc[:,'content']
y = data.loc[:,'label']

X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(X, y, data.index, test_size=0.2, random_state=55)

print("Number of news in train dataset: " + str(len(X_train)))
print("Number of news in train dataset: " + str(len(X_test)))

Number of news in train dataset: 516
Number of news in train dataset: 129


## 3.5 Generate feature vectors <a class="anchor" id="section_3_5"></a>

In [4]:
# Generate bag-of-words feature vectors

bow_vectorizer = CountVectorizer(lowercase=False)
bow_train = bow_vectorizer.fit_transform(X_train)
bow_test = bow_vectorizer.transform(X_test)

# (Number of news, Number of features/unique words in training dataset)
bow_train.shape

# IF SAVED - Load saved vectorizer
#bow_vectorizer = pickle.load(open("vectors/BOW.pickle", 'rb'))

(516, 40212)

In [5]:
# Generate TF-IDF feature vectors
# "words that are unique to particular document would have higher weights compared to words that are used commonly across documents"

tfidf_vectorizer = TfidfVectorizer(lowercase=False,)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

# (Number of news, Number of features/unique words in training dataset)
tfidf_train.shape

# IF SAVED - Load saved vectorizer
#tfidf_vectorizer = pickle.load(open("vectors/TFIDF.pickle", 'rb'))

(516, 40212)

In [6]:
# Generate Word Embeddings vectors and matrix
# Word2Vec PT

results = Counter()
data['content'].str.lower().str.split().apply(results.update) # Number of unique words in all dataset

max_length_content = 200 # Max number of content that each object/news artcile will have after padding
vocabulary_length = len(results)

# Create word index
token = text.Tokenizer(lower=False, num_words=vocabulary_length) 
token.fit_on_texts(data['content']) # Tokenize all corpus
word_index = token.word_index # Index of unique words (dictionary)

# Convert text to sequence of tokens and pad them to ensure equal length vectors 
we_train = sequence.pad_sequences(token.texts_to_sequences(X_train), maxlen=max_length_content)
we_test = sequence.pad_sequences(token.texts_to_sequences(X_test), maxlen=max_length_content)

# Load the pre-trained word-embedding vectors (Word2Vec 100D)
embedding_index = {}
with open("pre-trained/cbow_s100.txt", "r") as we_file:
    first_line = True
    for line in we_file:
        try:
            if first_line: 
                first_line = False
            else:
                line = line.split()
                word = line[0]
                vector = np.asarray(line[1:], dtype='float32')
                embedding_index[word] = vector
        except:
            pass

# Create token-embedding mapping (with words from our dataset)
embedding_matrix = np.zeros((vocabulary_length, 100))
for word, i in word_index.items():
    if i > vocabulary_length - 1:
        break
    else:
        embedding_vector = embedding_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

#(Number (vocabulary size) of features/unique words in all dataset, Number of dimensions/features)
embedding_matrix.shape

# IF SAVED - Load saved vectorizer
#embedding_index = pickle.load(open("vectors/WE.pickle", 'rb'))

(45307, 100)

In [7]:
# Save feature vectorizers

with open('vectors/TFIDF.pickle', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)

with open('vectors/BOW.pickle', 'wb') as f:
    pickle.dump(bow_vectorizer, f)

with open('vectors/WE.pickle', 'wb') as f:
    pickle.dump(embedding_index, f)

## 3.6 Models training and optimization (Machine Learning) <a class="anchor" id="section_3_6"></a>


#### Optimization

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. They define how our model is actually structured.

The optimization is performed with the K-fold cross validation startegy, in order to essentially combine training and validation data for both learning the model parameters and evaluating the model without introducing data leakage.

Libraries used:

- Scikit-optimize: uses a Sequential model-based optimization algorithm to find optimal solutions for hyperparameter search problems in less time;


In [8]:
# Create list to store evaluation metrics for all models

evaluation_metrics = []

# Function to store evaluation metrics for a model

def evaluation(evaluation_metrics_list, model, model_name, test_features, test_labels, feature_vectorizer_name, deep_learning=False):

    y_pred = model.predict(test_features)

    if deep_learning:
        y_pred = [int(round(p[0])) for p in y_pred]

    # Performance metrics
    accuracy = accuracy_score(test_labels, y_pred)*100

    # Precision, recall, f1 scores
    precision, recall, f1score, support = scores(y_test, y_pred, average='micro')

    # Add metrics to evaluation list
    evaluation_metrics_list.append(dict([
        ('Model', model_name),
        ('Feature Vectorizer', feature_vectorizer_name),
        ('Accuracy (%)', round(accuracy, 2)),
        ('Precision', round(precision, 2)),
        ('Recall', round(recall, 2)),
        ('F1', round(f1score, 2))
    ]))

    return evaluation_metrics_list

### 3.6.1 Logistic Regression <a class="anchor" id="section_2_6_1"></a>

- Is a classical linear method for binary classification (fits a line to best separate the two classes);
- It can handle both dense and sparse input;


#### 3.6.1.1 With bag-of-words <a class="anchor" id="section_3_6_1_1"></a>

In [9]:
# Model training (default/basic hyper-parameters)

model_lr_bow = LogisticRegression(random_state=0, solver="liblinear") # solver: algorithm to use in the optimization problem (liblinear for small datasets)
#IF SAVED - Load saved model
#model_lr_bow = pickle.load(open("models/LogisticRegressionBOW.pickle", 'rb'))

In [10]:
# Hyper-parameteres tuning

space  = [Real(0, 10, name='C')] # Inverse of regularization strength

@use_named_args(space)
def objective(**params):
    model_lr_bow.set_params(**params)

    return -np.mean(cross_val_score(model_lr_bow, bow_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_C_lr_bow = res_gp.x[0]

print("""Best parameters: - C=%f""" % (best_C_lr_bow))

Best parameters: - C=0.228294


In [11]:
# Model training with BEST hyper-parameteres

model_lr_bow = LogisticRegression(random_state=0, C=best_C_lr_bow, solver="liblinear").fit(bow_train, y_train)

# Model saving
with open('models/LogisticRegressionBOW.pickle', 'wb') as f:
    pickle.dump(model_lr_bow, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_lr_bow, model_name="Logistic Regression", test_features=bow_test, test_labels=y_test, feature_vectorizer_name="Bag-of-words")

print("Logistic Regression (Bag-of-words) successfully trained and stored.")

Logistic Regression (Bag-of-words) successfully trained and stored.


#### 3.6.1.2 With TF-IDF <a class="anchor" id="section_3_6_1_2"></a>

In [12]:
# Model training (default/basic hyper-parameters)

model_lr_tfidf = LogisticRegression(random_state=0)
# IF SAVED - Load saved model
#model_lr_tfidf = pickle.load(open("models/LogisticRegressionTFIDF.pickle", 'rb'))

In [13]:
# Hyper-parameteres tuning

space  = [Real(0, 10, name='C')] # Inverse of regularization strength

@use_named_args(space)
def objective(**params):
    model_lr_tfidf.set_params(**params)

    return -np.mean(cross_val_score(model_lr_tfidf, tfidf_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_C_lr_tfidf = res_gp.x[0]

print("""Best parameters: - C=%f""" % (best_C_lr_tfidf))

Best parameters: - C=10.000000


In [14]:
# Model training with BEST hyper-parameteres

model_lr_tfidf = LogisticRegression(random_state=0, C=best_C_lr_tfidf, solver="liblinear").fit(tfidf_train, y_train)

# Model saving
with open('models/LogisticRegressionTFIDF.pickle', 'wb') as f:
    pickle.dump(model_lr_tfidf, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_lr_tfidf, model_name="Logistic Regression", test_features=tfidf_test, test_labels=y_test, feature_vectorizer_name="TF-IDF")

print("Logistic Regression (TF-IDF) sucessfully trained and stored.")

Logistic Regression (TF-IDF) sucessfully trained and stored.


### 3.6.2 Linear Support Vector Classifier (SVC) <a class="anchor" id="section_3_6_2"></a>

 - Supports both dense and sparse input 

#### 3.6.2.1 With bag-of-words <a class="anchor" id="section_3_6_2_1"></a>

In [15]:
# Model training (default/basic hyper-parameters)

model_svc_bow = LinearSVC(random_state=0)
# IF SAVED - Load saved model
#model_svc_bow = pickle.load(open("models/LinearSVCBOW.pickle", 'rb'))

In [16]:
# Hyper-parameteres tuning

space  = [Real(1, 15, name='C')] # Inverse of regularization strength

@use_named_args(space)
def objective(**params):
    model_svc_bow.set_params(**params)

    return -np.mean(cross_val_score(model_svc_bow, bow_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_C_svc_bow = res_gp.x[0]

print("""Best parameters: - C=%f""" % (best_C_svc_bow))

Best parameters: - C=9.299825


In [17]:
# Model training with BEST hyper-parameteres

model_svc_bow = LinearSVC(random_state=0, C=best_C_svc_bow).fit(bow_train, y_train)

# Model saving
with open('models/LinearSVCBOW.pickle', 'wb') as f:
    pickle.dump(model_svc_bow, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_svc_bow, model_name="Support Vector Classifier (Linear)", test_features=bow_test, test_labels=y_test, feature_vectorizer_name="Bag-of-words")

print("Linear Suppport Vector Classifier (Bag-of-words) successfully trained and stored.")

Linear Suppport Vector Classifier (Bag-of-words) successfully trained and stored.


#### 3.6.2.2 With TF-IDF <a class="anchor" id="section_3_6_2_2"></a>

In [18]:
# Model training (default/basic hyper-parameters)

model_svc_tfidf = LinearSVC(random_state=0)
# IF SAVED - Load saved model
#model_svc_tfidf = pickle.load(open("models/LinearSVCTFIDF.pickle", 'rb'))

In [19]:
# Hyper-parameteres tuning

space  = [Real(1, 15, name='C')] # Inverse of regularization strength

@use_named_args(space)
def objective(**params):
    model_svc_tfidf.set_params(**params)

    return -np.mean(cross_val_score(model_svc_tfidf, tfidf_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_C_svc_tfidf = res_gp.x[0]

print("""Best parameters: - C=%f""" % (best_C_svc_tfidf))

Best parameters: - C=3.992197


In [20]:
# Model training with BEST hyper-parameteres

model_svc_tfidf = LinearSVC(random_state=0, C=best_C_svc_tfidf).fit(tfidf_train, y_train)

# Model saving
with open('models/LinearSVCTFIDF.pickle', 'wb') as f:
    pickle.dump(model_svc_tfidf, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_svc_tfidf, model_name="Support Vector Classifier (Linear)", test_features=tfidf_test, test_labels=y_test, feature_vectorizer_name="TF-IDF")

print("Linear Support Vector Classifier (TF-IDF) sucessfully trained and stored.")

Linear Support Vector Classifier (TF-IDF) sucessfully trained and stored.


### 3.6.3 Multinomial Naive Bayes <a class="anchor" id="section_3_6_3"></a>

- Probabilistic approach to classifying documents in the case of acknowledging the frequency of a specified word in a text document;
- Achieves well on discrete types as the number of words found in a document;
- Conditional independence is assumed in real data and it attempts to approximate to the optimal soltuion;
- Is a quick classifier.

#### 3.6.3.1 With bag-of-words <a class="anchor" id="section_3_6_3_1"></a>

In [21]:
# Model training (default/basic hyper-parameters)

model_nb_bow = MultinomialNB()
# IF SAVED - Load saved model
#model_nb_bow = pickle.load(open("models/MultinomialNaiveBayesBOW.pickle", 'rb'))

In [22]:
# Hyper-parameteres tuning

space  = [Real(0, 10, name='alpha')] # Additive (Laplace/Lidstone) smoothing parameter

@use_named_args(space)
def objective(**params):
    model_nb_bow.set_params(**params)

    return -np.mean(cross_val_score(model_nb_bow, bow_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_alpha_nb_bow = res_gp.x[0]

print("""Best parameters: - alpha=%f""" % (best_alpha_nb_bow))

Best parameters: - alpha=0.479317


In [23]:
# Model training with BEST hyper-parameteres

model_nb_bow = MultinomialNB(alpha=best_alpha_nb_bow).fit(bow_train, y_train)

# Model saving
with open('models/MultinomialNaiveBayesBOW.pickle', 'wb') as f:
    pickle.dump(model_nb_bow, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_nb_bow, model_name="Naive Bayes (Multinomial)", test_features=bow_test, test_labels=y_test, feature_vectorizer_name="Bag-of-words")

print("Multinomial Naive Bayes (Bag-of-words) successfully trained and stored.")

Multinomial Naive Bayes (Bag-of-words) successfully trained and stored.


#### 3.6.3.2 With TF-IDF <a class="anchor" id="section_3_6_3_2"></a>

In [24]:
# Model training (default/basic hyper-parameters)

model_nb_tfidf = MultinomialNB()
# IF SAVED - Load saved model
#model_nb_tfidf = pickle.load(open("models/MultinomialNaiveBayesTFIDF.pickle", 'rb'))

In [25]:
# Hyper-parameteres tuning

space  = [Real(0, 10, name='alpha')] # Additive (Laplace/Lidstone) smoothing parameter

@use_named_args(space)
def objective(**params):
    model_nb_tfidf.set_params(**params)

    return -np.mean(cross_val_score(model_nb_tfidf, tfidf_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_alpha_nb_tfidf = res_gp.x[0]

print("""Best parameters: - alpha=%f""" % (best_alpha_nb_tfidf))

Best parameters: - alpha=0.002528


In [26]:
# Model training with BEST hyper-parameteres

model_nb_tfidf = MultinomialNB(alpha=best_alpha_nb_tfidf).fit(tfidf_train, y_train)

# Model saving
with open('models/MultinomialNaiveBayesTFIDF.pickle', 'wb') as f:
    pickle.dump(model_lr_tfidf, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_nb_tfidf, model_name="Naive Bayes (Multinomial)", test_features=tfidf_test, test_labels=y_test, feature_vectorizer_name="TF-IDF")

print("Multinomial Naive Bayes (TF-IDF) sucessfully trained and stored.")

Multinomial Naive Bayes (TF-IDF) sucessfully trained and stored.


### 3.6.4 K-Nearest Neighbors <a class="anchor" id="section_3_6_4"></a>

- The data is classified based on vote among the k nearest neighbors;
- Number of neighbors can be sqrt(number of data objects) = sqrt(516) = 22;
- Should be preferred when the data-set is relatively small.

#### 3.6.4.1 With bag-of-words <a class="anchor" id="section_3_6_4_1"></a>

In [27]:
# Model training (default/basic hyper-parameters)

model_knn_bow = KNeighborsClassifier(n_neighbors=22, metric='euclidean') # n_neighbors = sqrt(516)
# IF SAVED - Load saved model
#model_knn_bow = pickle.load(open("models/KNNBOW.pickle", 'rb'))

In [28]:
# Hyper-parameteres tuning

space  = [Integer(1, 30, name='n_neighbors'), # Number of neighbors to use
          Categorical(["euclidean", "manhattan"], name='metric')] # The distance metric to use for the tree
            

@use_named_args(space)
def objective(**params):
    model_knn_bow.set_params(**params)

    return -np.mean(cross_val_score(model_knn_bow, bow_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_n_knn_bow = res_gp.x[0]
best_metric_knn_bow = res_gp.x[1]

print("Best parameters: - N_neighbors=%d Metric= %s" % (best_n_knn_bow, best_metric_knn_bow))

Best parameters: - N_neighbors=29 Metric= euclidean


In [29]:
# Model training with BEST hyper-parameteres

model_knn_bow = KNeighborsClassifier(n_neighbors=best_n_knn_bow, metric=best_metric_knn_bow).fit(bow_train, y_train)

# Model saving
with open('models/KNNBOW.pickle', 'wb') as f:
    pickle.dump(model_knn_bow, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_knn_bow, model_name="K-Nearest Neighbors", test_features=bow_test, test_labels=y_test, feature_vectorizer_name="Bag-of-words")

print("K-Nearest Neighbors (Bag-of-words) successfully trained and stored.")

K-Nearest Neighbors (Bag-of-words) successfully trained and stored.


#### 3.6.4.2 With TF-IDF <a class="anchor" id="section_3_6_4_2"></a>

In [30]:
# Model training (default/basic hyper-parameters)

model_knn_tfidf = KNeighborsClassifier(n_neighbors=22 , metric= 'euclidean')
# IF SAVED - Load saved model
#model_knn_tfidf = pickle.load(open("models/KNNTFIDF.pickle", 'rb'))

In [31]:
# Hyper-parameteres tuning

space  = [Integer(1, 30, name='n_neighbors'), # Number of neighbors to use
          Categorical(["euclidean", "manhattan"], name='metric')] # The distance metric to use for the tree
            

@use_named_args(space)
def objective(**params):
    model_knn_tfidf.set_params(**params)

    return -np.mean(cross_val_score(model_knn_tfidf, tfidf_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_n_knn_tfidf = res_gp.x[0]
best_metric_knn_tfidf = res_gp.x[1]

print("Best parameters: - N_neighbors=%d Metric= %s" % (best_n_knn_tfidf, best_metric_knn_tfidf))

Best parameters: - N_neighbors=10 Metric= euclidean


In [32]:
# Model training with BEST hyper-parameteres

model_knn_tfidf = KNeighborsClassifier(n_neighbors=best_n_knn_tfidf, metric=best_metric_knn_tfidf).fit(tfidf_train, y_train)

# Model saving
with open('models/KNNTFIDF.pickle', 'wb') as f:
    pickle.dump(model_knn_tfidf, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_knn_tfidf, model_name="K-Nearest Neighbors", test_features=tfidf_test, test_labels=y_test, feature_vectorizer_name="TF-IDF")

print("K-Nearest Neighbors (TF-IDF) successfully trained and stored.")

K-Nearest Neighbors (TF-IDF) successfully trained and stored.


### 3.6.5 Random Forest <a class="anchor" id="section_3_6_5"></a>

#### 3.6.5.1 With bag-of-words <a class="anchor" id="section_3_6_5_1"></a>

In [33]:
# Model training (default/basic hyper-parameters)

model_rf_bow = RandomForestClassifier(criterion='entropy', random_state=0)
# IF SAVED - Load saved model
#model_rf_bow = pickle.load(open("models/RandomForestBOW.pickle", 'rb'))

In [34]:
# Hyper-parameteres tuning

space  = [Integer(10, 200, name='n_estimators'), # the number of trees in the forest
          Integer(3,30, name ='max_depth')] # the maximum allowable depth for each decision tree 

@use_named_args(space)
def objective(**params):
    model_rf_bow.set_params(**params)

    return -np.mean(cross_val_score(model_rf_bow, bow_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_nestimators_rf_bow = res_gp.x[0]
best_maxdepth_rf_bow = res_gp.x[0]

print("""Best parameters: - n_estimators=%d max_depth=%d""" % (best_nestimators_rf_bow, best_maxdepth_rf_bow))

Best parameters: - n_estimators=200 max_depth=200


In [35]:
# Model training with BEST hyper-parameteres

model_rf_bow = RandomForestClassifier(n_estimators=best_nestimators_rf_bow, max_depth=best_maxdepth_rf_bow, criterion='entropy', random_state=0).fit(bow_train, y_train)

# Model saving
with open('models/RandomForestBOW.pickle', 'wb') as f:
    pickle.dump(model_rf_bow, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_rf_bow, model_name="Random Forest", test_features=bow_test, test_labels=y_test, feature_vectorizer_name="Bag-of-words")

print("Random Forest (Bag-of-words) successfully trained and stored.")

Random Forest (Bag-of-words) successfully trained and stored.


#### 3.6.5.2 With TF-IDF <a class="anchor" id="section_3_6_5_2"></a>

In [36]:
# Model training (default/basic hyper-parameters)

model_rf_tfidf = RandomForestClassifier(n_estimators=100, criterion='entropy', random_state=0)
# IF SAVED - Load saved model
#model_rf_tfidf = pickle.load(open("models/RandomForestTFIDF.pickle", 'rb'))

In [37]:
# Hyper-parameteres tuning

space  = [Integer(100, 300, name='n_estimators'),
          Integer(3,30, name ='max_depth')] # the maximum allowable depth for each decision tree 

@use_named_args(space)
def objective(**params):
    model_rf_tfidf.set_params(**params)

    return -np.mean(cross_val_score(model_rf_tfidf, bow_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_nestimators_rf_tfidf = res_gp.x[0]
best_maxdepth_rf_tfidf = res_gp.x[0]

print("""Best parameters: - n_estimators=%d max_depth:%d""" % (best_nestimators_rf_tfidf, best_maxdepth_rf_tfidf))

Best parameters: - n_estimators=174 max_depth:174


In [38]:
# Model training with BEST hyper-parameteres

model_rf_tfidf = RandomForestClassifier(n_estimators=best_nestimators_rf_tfidf, max_depth=best_maxdepth_rf_tfidf, criterion='entropy', random_state=0).fit(tfidf_train, y_train)

# Model saving
with open('models/RandomForestTFIDF.pickle', 'wb') as f:
    pickle.dump(model_rf_tfidf, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_rf_tfidf, model_name="Random Forest", test_features=tfidf_test, test_labels=y_test, feature_vectorizer_name="TF-IDF")

print("Random Forest (TF-IDF) successfully trained and stored.")

Random Forest (TF-IDF) successfully trained and stored.


### 3.6.6 XGBoost <a class="anchor" id="section_2_6_6"></a>

- Has been the winning algorithm in a number of recent Kaggle competitions;
- Is an ensemble learner like Random Forest algorithm. This means it will generate a final model based on a combination of individual models.

#### 3.6.6.1 With bag-of-words <a class="anchor" id="section_3_6_6_1"></a>

In [39]:
# Model training 

model_xgb_bow = XGBClassifier()
# IF SAVED - Load saved model
#model_xgb_bow = pickle.load(open("models/XGBoostBOW.pickle", 'rb'))

In [None]:
# Hyper-parameteres tuning

space  = [Categorical(['exact', 'approx', 'hist'], name='tree_method'),  # tree method to use
          Integer(3, 10, name ='max_depth'), # the maximum allowable depth for each decision tree 
          Real(1, 6, name ='min_child_weight'),
          Real(0, 1, name='learning_rate'), # boosting learning rate
          Real(0, 9, name="gamma") # minimum loss reduction required to make a further partition on a leaf node of the tree
          ] 

@use_named_args(space)
def objective(**params):
    model_xgb_bow.set_params(**params)

    return -np.mean(cross_val_score(model_xgb_bow, bow_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_treemethod_xgb_bow = res_gp.x[0]
best_maxdepth_xgb_bow = res_gp.x[1]
best_minchildweight_xgb_bow = res_gp.x[2]
best_lr_xgb_bow = res_gp.x[3]
best_gamma_xgb_bow = res_gp.x[4]

print("""Best parameters: - tree_method=%s max_depth=%d min_child_weight=%f learning_rate=%f gamma=%f""" % (best_treemethod_xgb_bow, best_maxdepth_xgb_bow, best_minchildweight_xgb_bow, best_lr_xgb_bow, best_gamma_xgb_bow))

In [40]:
# Model training with BEST hyper-parameteres

model_xgb_bow = XGBClassifier(tree_method='hist', max_depth=8, min_child_weight=3, learning_rate=0.9, gamma=1, n_estimators=1000, subsample=0.8).fit(bow_train, y_train)

# Model saving
with open('models/XGBoostBOW.pickle', 'wb') as f:
    pickle.dump(model_xgb_bow, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_xgb_bow, model_name="XGBoost", test_features=bow_test, test_labels=y_test, feature_vectorizer_name="Bag-of-words")

print("XGBoost (Bag-of-words) successfully trained and stored.")

XGBoost (Bag-of-words) successfully trained and stored.


#### 3.6.6.2 With TF-IDF <a class="anchor" id="section_3_6_6_2"></a>

In [41]:
# Model training

model_xgb_tfidf = XGBClassifier()
# IF SAVED - Load saved model
#model_xgb_tfidf = pickle.load(open("models/XGBoostTFIDF.pickle", 'rb'))

In [None]:
# Hyper-parameteres tuning

space  = [Categorical(['exact', 'approx', 'hist'], name='tree_method'),  # tree method to use
          Integer(3, 10, name ='max_depth'), # the maximum allowable depth for each decision tree 
          Real(1, 6, name ='min_child_weight'),
          Real(0, 1, name='learning_rate'), # boosting learning rate
          Real(0, 9, name="gamma") # minimum loss reduction required to make a further partition on a leaf node of the tree
          ] 

@use_named_args(space)
def objective(**params):
    model_xgb_tfidf.set_params(**params)

    return -np.mean(cross_val_score(model_xgb_, tfidf_train, y_train, cv=3, n_jobs=-1, scoring="neg_mean_absolute_error"))

res_gp = gp_minimize(objective, space, random_state=0)

best_treemethod_xgb_tfidf = res_gp.x[0]
best_maxdepth_xgb_tfidf = res_gp.x[1]
best_minchildweight_xgb_tfidf = res_gp.x[2]
best_lr_xgb_tfidf = res_gp.x[3]
best_gamma_xgb_tfidf = res_gp.x[4]

print("""Best parameters: - tree_method=%s max_depth=%d min_child_weight=%f learning_rate=%f gamma=%f""" % (best_treemethod_xgb_tfidf, best_maxdepth_xgb_tfidf, best_minchildweight_xgb_tfidf, best_lr_xgb_tfidf, best_gamma_xgb_tfidf))

In [42]:
# Model training with BEST hyper-parameteres

model_xgb_tfidf = XGBClassifier(tree_method='hist', max_depth=8, min_child_weight=3, learning_rate=0.9, gamma=1, n_estimators=1000, subsample=0.8).fit(tfidf_train, y_train)

# Model saving
with open('models/XGBoostTFIDF.pickle', 'wb') as f:
    pickle.dump(model_xgb_tfidf, f)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model_xgb_tfidf, model_name="XGBoost", test_features=tfidf_test, test_labels=y_test, feature_vectorizer_name="TF-IDF")

print("XGBoost (TF-IDF) successfully trained and stored.")

XGBoost (TF-IDF) successfully trained and stored.


## 3.7 Models training and optimization (Deep Learning) <a class="anchor" id="section_3_7"></a>

### 3.7.1 Convolutional Neural Network <a class="anchor" id="section_3_7_1"></a>

In [43]:
# Model training

model = Sequential()

# Add layers
model.add(Embedding(vocabulary_length, 100, weights=[embedding_matrix], trainable=False, input_length=max_length_content))
model.add(Dropout(0.5))
model.add(Conv1D(16, 3, padding="valid", activation="relu"))
model.add(MaxPooling1D())
model.add(Conv1D(16, 3, padding="valid", activation="relu"))
model.add(MaxPooling1D())
model.add(Flatten())
model.add(Dense(100, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(1, activation="sigmoid")) # binary classification

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['binary_accuracy'])
    
history = model.fit(we_train, y_train, batch_size=32, epochs=10)

# Store evaluation metrics
evaluation_metrics = evaluation(evaluation_metrics_list=evaluation_metrics, model=model, model_name="Convolutional Neural Network", test_features=we_test, test_labels=y_test, feature_vectorizer_name="Word Embeddings", deep_learning=True)

print("CNN (Word Embeddings) successfully trained.")

2022-03-24 15:00:58.535039: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-03-24 15:00:58.535318: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-03-24 15:00:58.535808: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (530S): /proc/driver/nvidia/version does not exist
2022-03-24 15:00:58.540713: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CNN (Wo

In [17]:
## Test accuracy

_, test_acc = model.evaluate(we_test, y_test, verbose=0)

print('Test accuracy: %.3f' % (test_acc))

Test accuracy: 0.891


## 3.8 Evaluation analysis <a class="anchor" id="section_3_8"></a>

The final evaluation (with the test dataset) is performed in the optimized models.

In [44]:
evaluation = pd.DataFrame(data=evaluation_metrics)
evaluation.columns = ['Model', 'Feature Vectorizer', 'Accuracy', 'Precision', 'Recall', 'F1']
evaluation = evaluation.sort_values(by='Accuracy', ascending=False)
evaluation

Unnamed: 0,Model,Feature Vectorizer,Accuracy,Precision,Recall,F1
8,Random Forest,Bag-of-words,92.25,0.92,0.92,0.92
9,Random Forest,TF-IDF,91.47,0.91,0.91,0.91
1,Logistic Regression,TF-IDF,90.7,0.91,0.91,0.91
3,Support Vector Classifier (Linear),TF-IDF,89.92,0.9,0.9,0.9
4,Naive Bayes (Multinomial),Bag-of-words,89.92,0.9,0.9,0.9
5,Naive Bayes (Multinomial),TF-IDF,89.92,0.9,0.9,0.9
0,Logistic Regression,Bag-of-words,86.82,0.87,0.87,0.87
7,K-Nearest Neighbors,TF-IDF,86.82,0.87,0.87,0.87
12,Convolutional Neural Network,Word Embeddings,86.82,0.87,0.87,0.87
2,Support Vector Classifier (Linear),Bag-of-words,85.27,0.85,0.85,0.85
