# Model Training and Evaluation
In this notebook, we will train machine learning models (Logistic Regression, SVM, and Random Forest) using the features extracted from the vectorisation process. We will evaluate their performance using K-fold cross validation.

## Set Up Dependencies

In [2]:
import pandas as pd
import numpy as np
import pickle
from scipy import sparse
from sklearn.model_selection import cross_validate, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

## Load Text Representations and Labels

Retrieve the DataFrame

In [3]:
with open('data/df_balanced.pkl', 'rb') as f:
    df_balanced = pickle.load(f)

df_balanced.head()

Unnamed: 0,class,tweet,cleaned_text,Word2Vec,TF-IDF,SentenceTrans
0,0,I LOVE my 10 &amp; 5 but most days they remind...,I LOVE my 10 &amp; 5 but most days they remind...,"[love, my, amp, but, most, day, they, remind, ...",love my amp but most day they remind me why bi...,i love my amp but most days they remind me w...
1,1,She be thinking she throwing that pussy back s...,She be thinking she throwing that pussy back s...,"[she, be, think, she, throw, that, pussy, back...",she be think she throw that pussy back so good...,she be thinking she throwing that pussy back s...
2,1,RT @lamessican: I love when bitches throw shad...,I love when bitches throw shade. Just confirms...,"[love, when, bitch, throw, shade, just, confir...",love when bitch throw shade just confirm do so...,i love when bitches throw shade just confirms ...
3,1,"If you ain't a hoe, get up out my trap house @...","If you ain't a hoe, get up out my trap house .","[if, you, ain, hoe, get, up, out, my, trap, ho...",if you ain hoe get up out my trap house,if you aint a hoe get up out my trap house
4,0,Just hit 40 in flappy bird.&#128527;,Just hit 40 in flappy bird.&#128527;,"[just, hit, in, flappy, bird]",just hit in flappy bird,just hit in flappy bird


Retrieve the Text Representations

In [4]:
x_tfidf = sparse.load_npz('representations/x_tfidf.npz')
x_w2v = np.load('representations/x_w2v.npy', allow_pickle=True)
x_st = np.load('representations/x_st.npy', allow_pickle=True)

Retrieve the labels

In [5]:
y = df_balanced['class']

## Initialise K-Fold Cross Validation

In [6]:
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1'
}

## Checking the Dimensionality of Feature Matrices
Ensure that the vectors are correctly formatted with a shape of (n_samples, n_features)

In [7]:
print("TF-IDF shape:", x_tfidf.shape)
print("Word2Vec shape:", x_w2v.shape)
print("Sentence Transformers shape:", x_st.shape)

TF-IDF shape: (8326, 7924)
Word2Vec shape: (8326, 100)
Sentence Transformers shape: (8326, 384)


## Initialise Results Dictionary

In [8]:
results = {
    'TF-IDF': {
        'Logistic Regression': {
            'accuracy': None,
            'precision': None,
            'recall': None,
            'f1': None
        },
        'SVM': {
            'accuracy': None,
            'precision': None,
            'recall': None,
            'f1': None
        },
        'Random Forest': {
            'accuracy': None,
            'precision': None,
            'recall': None,
            'f1': None
        }
    },
    'Word2Vec': {
        'Logistic Regression': {
            'accuracy': None,
            'precision': None,
            'recall': None,
            'f1': None
        },
        'SVM': {
            'accuracy': None,
            'precision': None,
            'recall': None,
            'f1': None
        },
        'Random Forest': {
            'accuracy': None,
            'precision': None,
            'recall': None,
            'f1': None
        }
    },
    'Sentence Transformers': {
        'Logistic Regression': {
            'accuracy': None,
            'precision': None,
            'recall': None,
            'f1': None
        },
        'SVM': {
            'accuracy': None,
            'precision': None,
            'recall': None,
            'f1': None
        },
        'Random Forest': {
            'accuracy': None,
            'precision': None,
            'recall': None,
            'f1': None
        }
    }
}

## Classification

A brief overview of the classification techniques employed (from [geeksforgeeks](https://www.geeksforgeeks.org/)):

- Logistic Regression

    A supervised machine learning algorithm for binary classification tasks (binomial logistic regression). It uses a sigmoid function to take in independent variables and produce a probability value between 0 and 1.

- Support Vector Machine (SVM)

    A supervised machine learning algorithm for linear or non-linear classification, regression, and outlier detection. Its main objective is to find the optimal hyperplane in a n-dimensional space that can separate the data into different classes in the feature space. 

- Random Forest

    A tree learning algorithm in machine learning. It creates a number of decision trees during the training phase, with each tree constructed with a random subset of the data with a random subset of features. In prediction, the algorithm aggregates the results of all trees, either by voting (classification) or averaging (regression). 


Define vectorisation methods and text representations

In [9]:
vectorisations = {
    'TF-IDF': x_tfidf,
    'Word2Vec': x_w2v,
    'Sentence Transformers': x_st
}

Initialising classifiers

In [10]:
classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'SVM': SVC(kernel='linear'),  
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

Iterating over vectorisation methods and performing cross-validation

In [11]:
for method_name, vectorisation in vectorisations.items():
    for clf_name, clf in classifiers.items():
        scores = cross_validate(clf, vectorisation, y, cv=cv, scoring=scoring)
        results[method_name][clf_name]['accuracy'] = scores['test_accuracy'].mean()
        results[method_name][clf_name]['precision'] = scores['test_precision'].mean()
        results[method_name][clf_name]['recall'] = scores['test_recall'].mean()
        results[method_name][clf_name]['f1'] = scores['test_f1'].mean()

print(results)

{'TF-IDF': {'Logistic Regression': {'accuracy': 0.9391055881812184, 'precision': 0.9728144858965873, 'recall': 0.9036319349816964, 'f1': 0.9368586073952276}, 'SVM': {'accuracy': 0.9489545007192065, 'precision': 0.9751041968972315, 'recall': 0.921581908282521, 'f1': 0.9475267252925155}, 'Random Forest': {'accuracy': 0.9469124586771646, 'precision': 0.9665930753879118, 'recall': 0.9258183052803337, 'f1': 0.9456852811216745}}, 'Word2Vec': {'Logistic Regression': {'accuracy': 0.8431405715439328, 'precision': 0.8481279332767697, 'recall': 0.8360224428397979, 'f1': 0.8419136465342361}, 'SVM': {'accuracy': 0.8558717901575044, 'precision': 0.8655969741228402, 'recall': 0.8428452658665702, 'f1': 0.853872758724537}, 'Random Forest': {'accuracy': 0.8732862514375119, 'precision': 0.8930938248898203, 'recall': 0.8480535719066079, 'f1': 0.8698458946701276}}, 'Sentence Transformers': {'Logistic Regression': {'accuracy': 0.9191687485805133, 'precision': 0.9371299595629475, 'recall': 0.8988966284151949

Convert the results dictionary into a DataFrame

- Dictionary comprehension to create a new dictionary where the keys are tuples containing the vectorisation method and classification technique
- `pd.DataFrame.from_dict()`: converts the flattened dictionary into a DataFrame
- `orient='index'`: Specifies the dictionary keys (the vectorisation-classifier tuple) as the row labels, and the dictionary values as the row data

In [12]:
results_df = pd.DataFrame.from_dict({(i, j): results[i][j]
                                     for i in results.keys()
                                     for j in results[i].keys()},
                                     orient='index')  # Dictionary keys are treated as labels

results_df.head(10)

Unnamed: 0,Unnamed: 1,accuracy,precision,recall,f1
TF-IDF,Logistic Regression,0.939106,0.972814,0.903632,0.936859
TF-IDF,SVM,0.948955,0.975104,0.921582,0.947527
TF-IDF,Random Forest,0.946912,0.966593,0.925818,0.945685
Word2Vec,Logistic Regression,0.843141,0.848128,0.836022,0.841914
Word2Vec,SVM,0.855872,0.865597,0.842845,0.853873
Word2Vec,Random Forest,0.873286,0.893094,0.848054,0.869846
Sentence Transformers,Logistic Regression,0.919169,0.93713,0.898897,0.917542
Sentence Transformers,SVM,0.922171,0.944822,0.897017,0.920205
Sentence Transformers,Random Forest,0.89767,0.930286,0.860301,0.893745


We save the DataFrame for use in the next notebook

In [13]:
with open('results/results_df.pkl', 'wb') as f:
    pickle.dump(results_df, f)