# 📊 Cross-Evaluation for All Final Models – v1.2 Architecture

This notebook performs **5-fold cross-validation** on all finalized models using the v1.2 architecture. The goal is to evaluate model stability, generalization, and consistency across different data splits.

### ✅ Models Evaluated:
- **Discipline Classifier**  
  Logistic Regression + Bigram TF-IDF (v1.1)
  
- **Subfield Classifiers**  
  SVM + SMOTE + Bigram TF-IDF (v1.2) — evaluated separately for CS, IS, and IT

- **Methodology Classifier**  
  SVM + SMOTE + Title + Abstract TF-IDF (v2.0)

---

Each model is validated using **k-fold cross-validation**, allowing us to estimate how well it will perform on unseen samples **within the same dataset**. This ensures the models are not overfitting to a specific split and gives a more reliable signal of generalization — especially important for small, imbalanced academic datasets.

## 🧹 Load Pre-Trained Models and Vectorizers

In this step, we load the finalized `.joblib` artefacts trained earlier. These include:

- The **Discipline Classifier** (Logistic Regression with bigram TF-IDF)
- The **Subfield Classifiers** (SVM + SMOTE for CS, IS, and IT using discipline-specific TF-IDF)
- The **Methodology Classifier** (SVM + SMOTE using Title + Abstract as input)

These artefacts will be reused for cross-validation without retraining, ensuring consistency in pipeline architecture and preprocessing.

In [1]:
import joblib
import os

path = "Artefacts/"

# Discipline Classifier (v1.1)
discipline_model = joblib.load(os.path.join(path, "discipline_classifier_logreg.pkl"))
discipline_vectorizer = joblib.load(os.path.join(path, "tfidf_vectorizer.pkl"))

# Subfield Classifiers (v1.2)
cs_model = joblib.load(os.path.join(path, "cs_subfield_classifier_svm_smote.pkl"))
cs_vectorizer = joblib.load(os.path.join(path, "cs_subfield_vectorizer_smote.pkl"))

is_model = joblib.load(os.path.join(path, "is_subfield_classifier_svm_smote.pkl"))
is_vectorizer = joblib.load(os.path.join(path, "is_subfield_vectorizer_smote.pkl"))

it_model = joblib.load(os.path.join(path, "it_subfield_classifier_svm_smote.pkl"))
it_vectorizer = joblib.load(os.path.join(path, "it_subfield_vectorizer_smote.pkl"))

# Methodology Classifier (v2.0)
methodology_model = joblib.load(os.path.join(path, "methodology_classifier_v2_titleabstract.pkl"))
methodology_vectorizer = joblib.load(os.path.join(path, "tfidf_vectorizer_methodology_v2_titleabstract.pkl"))

print("✅ All models and vectorizers loaded using joblib.")

✅ All models and vectorizers loaded using joblib.


## 📦 Load Labeled Abstract Dataset

We now load the complete dataset of 105 computing research abstracts. Each entry includes:

- `Title` and `Abstract`: Combined in some models for richer representation
- `Discipline`: CS, IS, or IT
- `Subfield`: Discipline-specific research area (e.g., AI, ML, CLD, IOTNET)
- `Methodology`: Qualitative, Quantitative, or Mixed

This dataset will be used for all cross-validation runs.

In [2]:
import pandas as pd

# Load the complete labeled dataset
df = pd.read_csv("Data/NLP_Dataset_Title_Abstract_Discipline_Subfield_Methodology.csv")

# Basic overview
print("✅ Dataset loaded successfully.")
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())

# Label distributions
print("\nDiscipline Distribution:")
print(df['Discipline'].value_counts())

print("\nSubfield Distribution:")
print(df['Subfield'].value_counts())

print("\nMethodology Distribution:")
print(df['Methodology'].value_counts())

# Preview
df.head()

✅ Dataset loaded successfully.
Shape: (105, 6)
Columns: ['ID', 'Title', 'Abstract', 'Discipline', 'Subfield', 'Methodology']

Discipline Distribution:
CS    35
IS    35
IT    35
Name: Discipline, dtype: int64

Subfield Distribution:
CLD       10
ENT        9
SEC        9
CYB        8
IOTNET     8
AI         8
OPS        8
ML         7
IMP        7
BSP        7
CV         6
GOV        6
PAST       6
DSA        6
Name: Subfield, dtype: int64

Methodology Distribution:
Qualitative     49
Quantitative    46
Mixed           10
Name: Methodology, dtype: int64


Unnamed: 0,ID,Title,Abstract,Discipline,Subfield,Methodology
0,1,A survey on large language model (LLM) securit...,"Large Language Models (LLMs), such as ChatGPT ...",CS,CYB,Qualitative
1,2,Detect Anything 3D in the Wild,Despite the success of deep learning in close-...,CS,CV,Quantitative
2,3,Survey of clustering algorithms,Data analysis plays an indispensable role for ...,CS,ML,Qualitative
3,4,Understanding egocentric activities,We present a method to analyze daily activitie...,CS,CV,Quantitative
4,5,High-performance Implementation of Elliptic Cu...,Elliptic curve cryptosystems are considered an...,CS,CYB,Quantitative


## 🔍 Cross-Validation – Discipline Classifier (LogReg + Bigram TF-IDF)

We now evaluate the Discipline classifier (v1.1), which uses Logistic Regression with bigram TF-IDF features extracted from the `Abstract`.

We'll perform 5-fold stratified cross-validation to measure how well the model generalizes across the entire dataset. All preprocessing steps (vectorization, transformation) are included using the final saved vectorizer.

In [3]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Define features and labels
X_disc = df["Abstract"]
y_disc = df["Discipline"]

# Transform text using final vectorizer
X_disc_tfidf = discipline_vectorizer.transform(X_disc)

# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Cross-validate
scores = cross_val_score(discipline_model, X_disc_tfidf, y_disc, cv=cv, scoring='accuracy')

# Output
print("✅ Cross-validation complete.")
print("Fold-wise Accuracies:", scores)
print("Mean Accuracy:", round(np.mean(scores), 4))
print("Std Deviation:", round(np.std(scores), 4))

✅ Cross-validation complete.
Fold-wise Accuracies: [0.80952381 0.76190476 0.9047619  0.66666667 0.57142857]
Mean Accuracy: 0.7429
Std Deviation: 0.1151


## 🧠 Cross-Validation – Subfield Classifier (CS only, v1.2)

We now evaluate the CS Subfield classifier (v1.2), which uses an SVM trained with SMOTE-augmented data and bigram TF-IDF features.

We'll apply 5-fold stratified cross-validation **within CS abstracts only** using the CS-specific vectorizer and model. This helps verify whether the classifier generalizes well across research subfields in the CS domain.

In [4]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Filter for CS discipline only
df_cs = df[df['Discipline'] == 'CS']

X_cs = df_cs["Abstract"]
y_cs = df_cs["Subfield"]

# Vectorize using CS-specific bigram TF-IDF
X_cs_tfidf = cs_vectorizer.transform(X_cs)

# Stratified 5-fold CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_cs = cross_val_score(cs_model, X_cs_tfidf, y_cs, cv=cv, scoring='accuracy')

# Results
print("✅ Cross-validation complete (CS Subfield).")
print("Fold-wise Accuracies:", scores_cs)
print("Mean Accuracy:", round(np.mean(scores_cs), 4))
print("Std Deviation:", round(np.std(scores_cs), 4))

✅ Cross-validation complete (CS Subfield).
Fold-wise Accuracies: [0.28571429 0.42857143 0.14285714 0.71428571 0.42857143]
Mean Accuracy: 0.4
Std Deviation: 0.1895


## 🧠 Cross-Validation – Subfield Classifier (IS only, v1.2)

This section evaluates the IS Subfield classifier (v1.2), which uses an SVM trained with SMOTE and bigram TF-IDF features.

We'll perform 5-fold stratified cross-validation on only the IS abstracts using the IS-specific vectorizer and model. This reveals how well the classifier distinguishes subfields within Information Systems.

In [5]:
# Filter for IS discipline only
df_is = df[df['Discipline'] == 'IS']

X_is = df_is["Abstract"]
y_is = df_is["Subfield"]

# Vectorize using IS-specific bigram TF-IDF
X_is_tfidf = is_vectorizer.transform(X_is)

# Stratified 5-fold CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_is = cross_val_score(is_model, X_is_tfidf, y_is, cv=cv, scoring='accuracy')

# Results
print("✅ Cross-validation complete (IS Subfield).")
print("Fold-wise Accuracies:", scores_is)
print("Mean Accuracy:", round(np.mean(scores_is), 4))
print("Std Deviation:", round(np.std(scores_is), 4))

✅ Cross-validation complete (IS Subfield).
Fold-wise Accuracies: [0.57142857 0.28571429 0.42857143 0.42857143 0.57142857]
Mean Accuracy: 0.4571
Std Deviation: 0.1069


## 🧠 Cross-Validation – Subfield Classifier (IT only, v1.2)

This section evaluates the IT Subfield classifier (v1.2), using an SVM trained with SMOTE and bigram TF-IDF features.

We’ll run 5-fold stratified cross-validation **only on IT abstracts** using the discipline-specific vectorizer and model. This measures how well the model generalizes across research subfields in the IT domain.

In [6]:
# Filter for IT discipline only
df_it = df[df['Discipline'] == 'IT']

X_it = df_it["Abstract"]
y_it = df_it["Subfield"]

# Vectorize using IT-specific bigram TF-IDF
X_it_tfidf = it_vectorizer.transform(X_it)

# Stratified 5-fold CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_it = cross_val_score(it_model, X_it_tfidf, y_it, cv=cv, scoring='accuracy')

# Results
print("✅ Cross-validation complete (IT Subfield).")
print("Fold-wise Accuracies:", scores_it)
print("Mean Accuracy:", round(np.mean(scores_it), 4))
print("Std Deviation:", round(np.std(scores_it), 4))

✅ Cross-validation complete (IT Subfield).
Fold-wise Accuracies: [0.42857143 0.71428571 0.57142857 0.42857143 0.42857143]
Mean Accuracy: 0.5143
Std Deviation: 0.1143


## 🔬 Cross-Validation – Methodology Classifier (v2.0)

This section evaluates the Methodology classifier (v2.0), which uses an SVM trained with SMOTE and bigram TF-IDF features generated from the combined `Title + Abstract` fields.

We'll perform 5-fold stratified cross-validation on the full dataset to assess how well the model distinguishes between Qualitative, Quantitative, and Mixed Methods research.

In [7]:
# Combine Title and Abstract for input
X_meth = df["Title"] + " " + df["Abstract"]
y_meth = df["Methodology"]

# Vectorize using the v2.0 Title+Abstract TF-IDF vectorizer
X_meth_tfidf = methodology_vectorizer.transform(X_meth)

# Stratified 5-fold CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_meth = cross_val_score(methodology_model, X_meth_tfidf, y_meth, cv=cv, scoring='accuracy')

# Results
print("✅ Cross-validation complete (Methodology Classifier v2.0).")
print("Fold-wise Accuracies:", scores_meth)
print("Mean Accuracy:", round(np.mean(scores_meth), 4))
print("Std Deviation:", round(np.std(scores_meth), 4))

✅ Cross-validation complete (Methodology Classifier v2.0).
Fold-wise Accuracies: [0.52380952 0.76190476 0.61904762 0.57142857 0.71428571]
Mean Accuracy: 0.6381
Std Deviation: 0.0883


## ✅ Cross-Validation Summary – All Final Models

Below is the consolidated summary of 5-fold cross-validation results for each finalized classifier.

| **Model**                         | **Version** | **Classifier**             | **Input**            | **Mean Accuracy** | **Std Deviation** |
|----------------------------------|-------------|-----------------------------|-----------------------|-------------------|-------------------|
| Discipline                       | v1.1        | Logistic Regression         | Abstract              | **0.7429**        | 0.1151            |
| Subfield – CS                    | v1.2        | SVM + SMOTE                 | Abstract              | **0.4000**        | 0.1895            |
| Subfield – IS                    | v1.2        | SVM + SMOTE                 | Abstract              | **0.4571**        | 0.1069            |
| Subfield – IT                    | v1.2        | SVM + SMOTE                 | Abstract              | **0.5143**        | 0.1143            |
| Methodology                      | v2.0        | SVM + SMOTE                 | Title + Abstract      | **0.6381**        | 0.0883            |

> All models were evaluated using stratified 5-fold cross-validation.  
> Vectorization was consistent with each model’s architecture (bigram TF-IDF, min_df=2).  
> SMOTE was applied during training in Subfield and Methodology models to address class imbalance.

---