# 📚 Cross-Validation for Discipline Classifier

This notebook performs 5-fold cross-validation to evaluate the stability and generalization performance of the Discipline Classifier (Logistic Regression model trained on TF-IDF features).

## 1. 📥 Import Libraries

In [1]:
import pandas as pd
import joblib
from sklearn.model_selection import cross_val_score

## 2. 📂 Load Dataset

In [8]:
# Load the labeled dataset
data = pd.read_csv('NLP_Abstract_Dataset (Discipline)(105).csv')  
X = data['Abstract']
y = data['Discipline']
data.head()

Unnamed: 0,ID,Discipline,Abstract
0,1,CS,"Large Language Models (LLMs), such as ChatGPT ..."
1,2,CS,Despite the success of deep learning in close-...
2,3,CS,Data analysis plays an indispensable role for ...
3,4,CS,The goal of user experience design in industry...
4,5,CS,Elliptic curve cryptosystems are considered an...


## 3. 🧹 Load Pre-Trained TF-IDF Vectorizer and Logistic Regression Model

In [9]:
# Load TF-IDF vectorizer and trained Logistic Regression model
import joblib

tfidf_vectorizer = joblib.load('Artefacts/tfidf_vectorizer.pkl')
discipline_model = joblib.load('Artefacts/discipline_classifier_logreg.pkl')

## 4. ✨ Transform Input Text Data

In [10]:
# Transform abstracts into TF-IDF vectors
X_tfidf = tfidf_vectorizer.transform(X)

## 5. 🔍 Perform 5-Fold Cross-Validation

In [11]:
# Perform 5-fold cross-validation
cv_scores = cross_val_score(discipline_model, X_tfidf, y, cv=5, scoring='accuracy')

# Display results
print("Cross-validation scores for each fold:", cv_scores)
print("Average accuracy:", cv_scores.mean())
print("Standard deviation:", cv_scores.std())

Cross-validation scores for each fold: [0.71428571 0.9047619  0.71428571 0.85714286 0.66666667]
Average accuracy: 0.7714285714285715
Standard deviation: 0.09233675918888246


## 6. 📝 Interpretation of Results

The model achieved an average accuracy of **77.1%** with a standard deviation of **9.2%** across 5 folds. This indicates that the model is stable and generalizes well across different subsets of the data. Some variation is present between the folds, but the overall performance is consistent.