# 📊 **Cross-Validation for All Models**

This notebook performs 5-fold cross-validation to evaluate the stability and generalization performance of multiple classification models. The models being evaluated include:
- **Discipline Classifier** (Logistic Regression model for predicting disciplines: CS, IS, IT)
- **Subfield Classifiers** (Logistic Regression models for predicting subfields within CS, IS, and IT)
- **Methodology Classifier** (Logistic Regression model for predicting research methodologies)

Each model is evaluated using cross-validation to ensure robust performance and to avoid overfitting to any specific data split. The goal is to compare these models' performance across different subsets of the dataset.

In [2]:
# Data handling
import pandas as pd
import numpy as np

# Machine Learning models and evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

# Loading pre-trained models
import joblib

# Visualization (optional, if you want to plot cross-validation results)
import matplotlib.pyplot as plt

In [3]:
# Load the dataset that includes Discipline, Subfield, and Methodology labels
data = pd.read_csv('NLP_Abstract_Dataset (Method)(105).csv')
X = data['Abstract']  # The text data
y_discipline = data['Discipline']  # Discipline labels (CS, IS, IT)
y_subfield = data['Subfield']  # Subfield labels (AI, ML, etc.)
y_methodology = data['Methodology']  # Methodology labels
data.head()  # Check if the dataset has loaded correctly

Unnamed: 0,ID,Discipline,Subfield,Methodology,Abstract
0,1,CS,CYB,QLT,"Large Language Models (LLMs), such as ChatGPT ..."
1,2,CS,CV,QNT,Despite the success of deep learning in close-...
2,3,CS,ML,QLT,Data analysis plays an indispensable role for ...
3,4,CS,CV,QNT,We present a method to analyze daily activitie...
4,5,CS,CYB,QNT,Elliptic curve cryptosystems are considered an...


## 2. 🧹 Load Pre-Trained TF-IDF Vectorizer and Logistic Regression Models

In this section, we load the **pre-trained TF-IDF vectorizer** and the **Logistic Regression models** that were previously trained on the original dataset. These models include:
- **Discipline Classifier**
- **Subfield Classifiers** for CS, IS, and IT
- **Methodology Classifier**

In [11]:
import joblib

# Load TF-IDF vectorizer
tfidf_vectorizer = joblib.load('Artefacts/tfidf_vectorizer.pkl')

# Load Discipline classifier model
discipline_model = joblib.load('Artefacts/discipline_classifier_logreg.pkl')

## 3. 🔍 Perform 5-Fold Cross-Validation for Discipline Model

This section performs 5-fold cross-validation to evaluate the **Discipline Classifier** (Logistic Regression model for predicting disciplines: CS, IS, IT). The cross-validation results will provide an estimate of the model's performance and stability across different subsets of the dataset.

In [12]:
from sklearn.model_selection import cross_val_score

# Perform cross-validation for Discipline model
discipline_cv_scores = cross_val_score(discipline_model, tfidf_vectorizer.transform(X), y_discipline, cv=5, scoring='accuracy')

# Display results
print(f"Discipline Model - Cross-validation scores: {discipline_cv_scores}")
print(f"Average accuracy: {discipline_cv_scores.mean()}")
print(f"Standard deviation: {discipline_cv_scores.std()}")

Discipline Model - Cross-validation scores: [0.76190476 0.80952381 0.71428571 0.85714286 0.66666667]
Average accuracy: 0.7619047619047619
Standard deviation: 0.06734350297014738


### **4. Perform 5-Fold Cross-Validation for CS, IS, and IT Subfield Classifiers**

This section performs **5-fold cross-validation** to evaluate the performance of the **CS Subfield Classifier**, **IS Subfield Classifier**, and **IT Subfield Classifier**.

#### **Expected Outcome**:
- The **CS Subfield Classifier** predicts subfields like **AI**, **ML**, **CV**, and **CYB** within **Computer Science**.
- The **IS Subfield Classifier** predicts subfields like **BSP**, **DSA**, **ENT** within **Information Systems**.
- The **IT Subfield Classifier** predicts subfields like **CLD**, **IOTNET**, **OPS** within **Information Technology**.

#### **Cross-validation Process**:
1. **Data Split**: The dataset is split into 5 subsets (folds).
2. **Model Training**: For each fold, the model is trained on 4 of the subsets and tested on the remaining fold.
3. **Performance Evaluation**: The accuracy for each fold is recorded, and the average performance is calculated.

The cross-validation results will provide an estimate of the model’s stability and generalization ability across different subsets of the dataset.

# Load Pre-trained TF-IDF Vectorizers and Classifiers

In [13]:
# Load CS Subfield Models
cs_vectorizer = joblib.load('Artefacts/tfidf_vectorizer_cs.pkl')
cs_classifier = joblib.load('Artefacts/subfield_classifier_logreg_cs.pkl')

# Load IS Subfield Models
is_vectorizer = joblib.load('Artefacts/tfidf_vectorizer_is.pkl')
is_classifier = joblib.load('Artefacts/subfield_classifier_logreg_is.pkl')

# Load IT Subfield Models
it_vectorizer = joblib.load('Artefacts/tfidf_vectorizer_it.pkl')
it_classifier = joblib.load('Artefacts/subfield_classifier_logreg_it.pkl')

# Split Dataset into CS, IS, and IT Subsets

In [14]:
# Subsets based on Discipline
cs_data = data[data['Discipline'] == 'CS']
is_data = data[data['Discipline'] == 'IS']
it_data = data[data['Discipline'] == 'IT']

# Texts and Labels for each subset
X_cs = cs_data['Abstract']
y_cs = cs_data['Subfield']

X_is = is_data['Abstract']
y_is = is_data['Subfield']

X_it = it_data['Abstract']
y_it = it_data['Subfield']

# Create Pipelines (Vectorizer + Classifier for Each Subfield)

In [15]:
# CS Subfield Pipeline
cs_pipeline = make_pipeline(cs_vectorizer, cs_classifier)

# IS Subfield Pipeline
is_pipeline = make_pipeline(is_vectorizer, is_classifier)

# IT Subfield Pipeline
it_pipeline = make_pipeline(it_vectorizer, it_classifier)

# Perform 5-Fold Cross-Validation for Each Subfield Classifier

In [16]:
# Cross-Validation for CS Subfield Classifier
cs_scores = cross_val_score(cs_pipeline, X_cs, y_cs, cv=5)
print("CS Subfield Classifier - Cross-validation scores:", cs_scores)
print("Average accuracy:", np.mean(cs_scores))
print("Standard deviation:", np.std(cs_scores))

# Cross-Validation for IS Subfield Classifier
is_scores = cross_val_score(is_pipeline, X_is, y_is, cv=5)
print("\nIS Subfield Classifier - Cross-validation scores:", is_scores)
print("Average accuracy:", np.mean(is_scores))
print("Standard deviation:", np.std(is_scores))

# Cross-Validation for IT Subfield Classifier
it_scores = cross_val_score(it_pipeline, X_it, y_it, cv=5)
print("\nIT Subfield Classifier - Cross-validation scores:", it_scores)
print("Average accuracy:", np.mean(it_scores))
print("Standard deviation:", np.std(it_scores))

CS Subfield Classifier - Cross-validation scores: [0.14285714 0.57142857 0.42857143 0.14285714 0.14285714]
Average accuracy: 0.2857142857142857
Standard deviation: 0.18070158058105024

IS Subfield Classifier - Cross-validation scores: [0.42857143 0.28571429 0.57142857 0.42857143 0.28571429]
Average accuracy: 0.4
Standard deviation: 0.10690449676496976

IT Subfield Classifier - Cross-validation scores: [0.42857143 0.42857143 0.28571429 0.71428571 0.28571429]
Average accuracy: 0.42857142857142855
Standard deviation: 0.15649215928719032


# 📊 Interpretation of 5-Fold Cross-Validation Results for Subfield Classifiers

### Overview:
We performed 5-fold cross-validation to evaluate the stability and generalization performance of the CS, IS, and IT Subfield Classifiers. The models were evaluated based on their average accuracy across the 5 folds and the standard deviation of these scores.

---

### CS Subfield Classifier:
- **Cross-validation scores:** [0.14285714, 0.57142857, 0.42857143, 0.14285714, 0.14285714]
- **Average accuracy:** 28.57%
- **Standard deviation:** 18.07%

**Interpretation:**
- The CS subfield classifier shows **high variance** between folds.
- Some folds achieve moderate accuracy (~57%), but others drop to very low values (~14%).
- The model appears **unstable** and struggles to generalize consistently across different data splits.

---

### IS Subfield Classifier:
- **Cross-validation scores:** [0.42857143, 0.28571429, 0.57142857, 0.42857143, 0.28571429]
- **Average accuracy:** 40.00%
- **Standard deviation:** 10.69%

**Interpretation:**
- The IS subfield classifier performs slightly better than the CS classifier, with a **higher average accuracy**.
- The **variance is lower**, suggesting **slightly more stable** predictions across folds.
- However, overall accuracy is still quite low, indicating difficulty in subclassifying IS abstracts reliably.

---

### IT Subfield Classifier:
- **Cross-validation scores:** [0.42857143, 0.42857143, 0.28571429, 0.71428571, 0.28571429]
- **Average accuracy:** 42.86%
- **Standard deviation:** 15.65%

**Interpretation:**
- The IT subfield classifier achieved the **highest average accuracy** among the three models.
- However, the standard deviation remains moderately high, showing **inconsistency** across different splits.
- This model is relatively better but still exhibits significant variability.

---

### Key Takeaways:
- **All three subfield classifiers show limited generalization** when trained on small datasets (35 records each for CS, IS, and IT).
- **High variance** suggests **sensitivity to data splits**, which is expected given the small sample size.
- **Future work** could involve:
  - Increasing dataset size.
  - Exploring more regularized models.
  - Using stratified folds or group-based splitting to preserve label balance.

---

## 5. Methodology Classification - Cross-Validation

In this section, we evaluate the performance of the Methodology classification model using 5-fold stratified cross-validation.  
We load the pre-trained TF-IDF vectorizer and Logistic Regression model for Methodology classification from the Artefacts folder.

The dataset of research abstracts is transformed using the TF-IDF vectorizer, and cross-validation is performed on the transformed features.  
Accuracy scores for each fold are recorded, along with the overall average accuracy and standard deviation.

This provides insight into the model's generalization ability across different subsets of the data.

In [17]:
# Load the Methodology vectorizer and classifier
methodology_vectorizer = joblib.load('Artefacts/tfidf_vectorizer_methodology.pkl')
methodology_model = joblib.load('Artefacts/methodology_classifier_logreg.pkl')

# Transform the abstracts
X_methodology = methodology_vectorizer.transform(X)  
y_methodology = y_methodology  

# Cross-validation
methodology_cv_scores = cross_val_score(methodology_model, X_methodology, y_methodology, cv=5, scoring='accuracy')

# Display results
print(f"Methodology Model - Cross-validation scores: {methodology_cv_scores}")
print(f"Average accuracy: {methodology_cv_scores.mean():.4f}")
print(f"Standard deviation: {methodology_cv_scores.std():.4f}")

Methodology Model - Cross-validation scores: [0.57142857 0.57142857 0.76190476 0.57142857 0.61904762]
Average accuracy: 0.6190
Standard deviation: 0.0738


### Methodology Classification - Cross-Validation Results

- **Average Accuracy**: 0.6190
- **Standard Deviation**: 0.0738

---

### Interpretation

The Methodology classification model achieved an average accuracy of **61.9%** across 5 folds.  
The standard deviation of accuracies was **0.0738**, indicating moderate variance across different folds.  
The performance is lower compared to Discipline classification, which is expected due to the higher difficulty of distinguishing research methodologies from abstracts.  
The model demonstrates reasonable generalization given the dataset size and task complexity.