# 📊 **Cross-Validation for All Models**

This notebook performs 5-fold cross-validation to evaluate the stability and generalization performance of multiple classification models. The models being evaluated include:
- **Discipline Classifier** (Logistic Regression model for predicting disciplines: CS, IS, IT)
- **Subfield Classifiers** (Logistic Regression models for predicting subfields within CS, IS, and IT)
- **Methodology Classifier** (Logistic Regression model for predicting research methodologies)

Each model is evaluated using cross-validation to ensure robust performance and to avoid overfitting to any specific data split. The goal is to compare these models' performance across different subsets of the dataset.

In [12]:
# Load the dataset that includes Discipline, Subfield, and Methodology labels
data = pd.read_csv('NLP_Abstract_Dataset (Method)(105).csv')
X = data['Abstract']  # The text data
y_discipline = data['Discipline']  # Discipline labels (CS, IS, IT)
y_subfield = data['Subfield']  # Subfield labels (AI, ML, etc.)
y_methodology = data['Methodology']  # Methodology labels
data.head()  # Check if the dataset has loaded correctly

Unnamed: 0,ID,Discipline,Subfield,Methodology,Abstract
0,1,CS,CYB,QLT,"Large Language Models (LLMs), such as ChatGPT ..."
1,2,CS,CV,QNT,Despite the success of deep learning in close-...
2,3,CS,ML,QLT,Data analysis plays an indispensable role for ...
3,4,CS,CV,QNT,We present a method to analyze daily activitie...
4,5,CS,CYB,QNT,Elliptic curve cryptosystems are considered an...


## 2. 🧹 Load Pre-Trained TF-IDF Vectorizer and Logistic Regression Models

In this section, we load the **pre-trained TF-IDF vectorizer** and the **Logistic Regression models** that were previously trained on the original dataset. These models include:
- **Discipline Classifier**
- **Subfield Classifiers** for CS, IS, and IT
- **Methodology Classifier**

In [16]:
import joblib

# Load TF-IDF vectorizer
tfidf_vectorizer = joblib.load('Artefacts/tfidf_vectorizer.pkl')

# Load Discipline classifier model
discipline_model = joblib.load('Artefacts/discipline_classifier_logreg.pkl')

# Load Subfield classifiers (CS, IS, IT)
cs_subfield_model = joblib.load('Artefacts/subfield_classifier_logreg_cs.pkl')
is_subfield_model = joblib.load('Artefacts/subfield_classifier_logreg_is.pkl')
it_subfield_model = joblib.load('Artefacts/subfield_classifier_logreg_it.pkl')

# Load Methodology classifier model
methodology_model = joblib.load('Artefacts/methodology_classifier_logreg.pkl')

## 3. 🔍 Perform 5-Fold Cross-Validation for Discipline Model

This section performs 5-fold cross-validation to evaluate the **Discipline Classifier** (Logistic Regression model for predicting disciplines: CS, IS, IT). The cross-validation results will provide an estimate of the model's performance and stability across different subsets of the dataset.

In [19]:
from sklearn.model_selection import cross_val_score

# Perform cross-validation for Discipline model
discipline_cv_scores = cross_val_score(discipline_model, tfidf_vectorizer.transform(X), y_discipline, cv=5, scoring='accuracy')

# Display results
print(f"Discipline Model - Cross-validation scores: {discipline_cv_scores}")
print(f"Average accuracy: {discipline_cv_scores.mean()}")
print(f"Standard deviation: {discipline_cv_scores.std()}")

Discipline Model - Cross-validation scores: [0.76190476 0.80952381 0.71428571 0.85714286 0.66666667]
Average accuracy: 0.7619047619047619
Standard deviation: 0.06734350297014738


### **4. Perform 5-Fold Cross-Validation for CS, IS, and IT Subfield Classifiers**

This section performs **5-fold cross-validation** to evaluate the performance of the **CS Subfield Classifier**, **IS Subfield Classifier**, and **IT Subfield Classifier**.

#### **Expected Outcome**:
- The **CS Subfield Classifier** predicts subfields like **AI**, **ML**, **CV**, and **CYB** within **Computer Science**.
- The **IS Subfield Classifier** predicts subfields like **BSP**, **DSA**, **ENT** within **Information Systems**.
- The **IT Subfield Classifier** predicts subfields like **CLD**, **IOTNET**, **OPS** within **Information Technology**.

#### **Cross-validation Process**:
1. **Data Split**: The dataset is split into 5 subsets (folds).
2. **Model Training**: For each fold, the model is trained on 4 of the subsets and tested on the remaining fold.
3. **Performance Evaluation**: The accuracy for each fold is recorded, and the average performance is calculated.

The cross-validation results will provide an estimate of the model’s stability and generalization ability across different subsets of the dataset.