# 🧠 Methodology Classifier – v2.2 (SciBERT + XGBoost)
This experiment tests whether domain-specific contextual embeddings from SciBERT combined with a non-linear classifier (XGBoost) can improve Methodology classification accuracy beyond 71%, targeting 90–95% range.

## Imports

In [1]:
# Basic imports
import pandas as pd
import numpy as np

# For embeddings
from sentence_transformers import SentenceTransformer

# For preprocessing and classification
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from xgboost import XGBClassifier

# For saving models
import joblib

## Load dataset and preprocess text

In [2]:
# Load dataset
df = pd.read_csv("Data/NLP_Dataset_Title_Abstract_Discipline_Subfield_Methodology.csv")

# Combine Title + Abstract
df["text"] = df["Title"].fillna('') + " " + df["Abstract"].fillna('')
df["text"] = df["text"].str.strip()

# Drop rows with missing Methodology
df = df.dropna(subset=["Methodology"])

# Extract inputs and labels
texts = df["text"].tolist()
labels = df["Methodology"].tolist()

In [3]:
# Show first 3 rows
print("✅ Sample rows:")
display(df[["Title", "Abstract", "text", "Methodology"]].head(3))

# Show number of samples and class distribution
print("\n📊 Dataset size:", len(df))
print("\n🔢 Methodology class distribution:")
print(df["Methodology"].value_counts())

✅ Sample rows:


Unnamed: 0,Title,Abstract,text,Methodology
0,A survey on large language model (LLM) securit...,"Large Language Models (LLMs), such as ChatGPT ...",A survey on large language model (LLM) securit...,Qualitative
1,Detect Anything 3D in the Wild,Despite the success of deep learning in close-...,Detect Anything 3D in the Wild Despite the suc...,Quantitative
2,Survey of clustering algorithms,Data analysis plays an indispensable role for ...,Survey of clustering algorithms Data analysis ...,Qualitative



📊 Dataset size: 105

🔢 Methodology class distribution:
Methodology
Qualitative     49
Quantitative    46
Mixed           10
Name: count, dtype: int64


## Generate SciBERT embeddings

In [4]:
model = SentenceTransformer('allenai/scibert_scivocab_uncased')
X = model.encode(texts, show_progress_bar=True)

No sentence-transformers model found with name allenai/scibert_scivocab_uncased. Creating a new one with mean pooling.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

## Encode Labels 

In [5]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(labels)

# Check label encoding
print("🔠 Label classes:", label_encoder.classes_)
print("🔢 Encoded values:", np.unique(y))

🔠 Label classes: ['Mixed' 'Qualitative' 'Quantitative']
🔢 Encoded values: [0 1 2]


## Train/test split + scaling

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 80/20 split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Scale the dense vectors
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Train XGBoost Classifier

In [7]:
from xgboost import XGBClassifier

# Initialize and train XGBoost
clf = XGBClassifier(eval_metric='mlogloss', random_state=42)
clf.fit(X_train_scaled, y_train)

## Evaluate Model Performance

In [8]:
from sklearn.metrics import accuracy_score, classification_report

# Predict on test set
y_pred = clf.predict(X_test_scaled)

# Evaluate
print("✅ Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("\n📄 Classification Report:")
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

✅ Accuracy: 0.6667

📄 Classification Report:
              precision    recall  f1-score   support

       Mixed       0.00      0.00      0.00         2
 Qualitative       0.70      0.70      0.70        10
Quantitative       0.64      0.78      0.70         9

    accuracy                           0.67        21
   macro avg       0.45      0.49      0.47        21
weighted avg       0.61      0.67      0.63        21



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Save model and other artefacts

In [14]:
joblib.dump(clf, "Artefacts/methodology_scibert_xgb_v2.2_model.pkl")
joblib.dump(scaler, "Artefacts/methodology_scibert_xgb_v2.2_scaler.pkl")
joblib.dump(label_encoder, "Artefacts/methodology_scibert_xgb_v2.2_label_encoder.pkl")

['Artefacts/methodology_scibert_xgb_v2.2_label_encoder.pkl']

## 🔍 Results Summary – v2.2 (SciBERT + XGBoost, 80/20 Split)

- **Accuracy**: 0.667
- **Macro F1**: 0.47
- **Best Class**: Quantitative (F1 = 0.70)
- **Qualitative**: F1 = 0.70
- **Mixed**: Not predicted at all (F1 = 0.00)

### 🔎 Observations:
- Model performed reasonably well on QLT and QNT classes.
- Mixed Methods (M) was completely missed — likely due to only 2 training examples.
- SciBERT + XGBoost provided a strong semantic baseline, slightly underperforming the TF-IDF + SVM setup (v2.0: 0.71).
- Scaling was critical to stabilizing classifier training.
- This run establishes a reproducible BERT baseline for future improvements via SMOTE, hyperparameter tuning, or cross-validation.

✅ Model, scaler, and label encoder saved as versioned artefacts in `Artefacts/`

## Apply SMOTE to Training Data and Retrain XGBoost on SMOTE Data

In [10]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE with k=1 (good for very small classes like Mixed)
smote = SMOTE(random_state=42, k_neighbors=1)

X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

# Check new class distribution
import collections
print("🔁 Resampled class distribution:", collections.Counter(y_train_resampled))

# Retrain model on resampled data
clf_smote = XGBClassifier(eval_metric='mlogloss', random_state=42)
clf_smote.fit(X_train_resampled, y_train_resampled)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


🔁 Resampled class distribution: Counter({np.int64(1): 39, np.int64(2): 39, np.int64(0): 39})


## Evaluate Model Performance and Classification Report

In [11]:
y_pred_smote = clf_smote.predict(X_test_scaled)

print("✅ Accuracy (after SMOTE):", round(accuracy_score(y_test, y_pred_smote), 4))
print("\n📄 Classification Report (after SMOTE):")
print(classification_report(y_test, y_pred_smote, target_names=label_encoder.classes_))

✅ Accuracy (after SMOTE): 0.7619

📄 Classification Report (after SMOTE):
              precision    recall  f1-score   support

       Mixed       0.00      0.00      0.00         2
 Qualitative       0.77      1.00      0.87        10
Quantitative       0.86      0.67      0.75         9

    accuracy                           0.76        21
   macro avg       0.54      0.56      0.54        21
weighted avg       0.73      0.76      0.74        21



## Save model

In [15]:
joblib.dump(clf_smote, "Artefacts/methodology_scibert_xgb_v2.2.1_smote_model.pkl")

['Artefacts/methodology_scibert_xgb_v2.2.1_smote_model.pkl']

## 🔁 v2.2.1 (SciBERT + XGBoost + SMOTE)

- Accuracy: 76.19%
- Macro F1: 0.54
- Weighted F1: 0.74
- Mixed Methods: still 0.00 (likely due to 2 test samples)

##  SMOTE + XGBoost with Manual 5-Fold Cross-Validation

In [16]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score
from imblearn.over_sampling import SMOTE

# Prepare CV loop
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

accuracy_scores = []
macro_f1_scores = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y), 1):
    # Split and scale
    X_train_fold, X_val_fold = X[train_idx], X[val_idx]
    y_train_fold, y_val_fold = y[train_idx], y[val_idx]
    
    # Scale embeddings
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_fold)
    X_val_scaled = scaler.transform(X_val_fold)
    
    # Apply SMOTE
    smote = SMOTE(random_state=42, k_neighbors=1)
    X_resampled, y_resampled = smote.fit_resample(X_train_scaled, y_train_fold)
    
    # Train XGBoost
    clf = XGBClassifier(eval_metric='mlogloss', random_state=42)
    clf.fit(X_resampled, y_resampled)
    
    # Predict
    y_pred = clf.predict(X_val_scaled)
    
    # Score
    acc = accuracy_score(y_val_fold, y_pred)
    f1 = f1_score(y_val_fold, y_pred, average='macro')
    
    accuracy_scores.append(acc)
    macro_f1_scores.append(f1)
    
    print(f"✅ Fold {fold}: Accuracy = {round(acc, 4)}, Macro F1 = {round(f1, 4)}")

# Summary
print("\n📊 Final 5-Fold CV Results:")
print("Mean Accuracy:", round(np.mean(accuracy_scores), 4))
print("Std Accuracy:", round(np.std(accuracy_scores), 4))
print("Mean Macro F1:", round(np.mean(macro_f1_scores), 4))
print("Std Macro F1:", round(np.std(macro_f1_scores), 4))

✅ Fold 1: Accuracy = 0.7143, Macro F1 = 0.5
✅ Fold 2: Accuracy = 0.7619, Macro F1 = 0.7345
✅ Fold 3: Accuracy = 0.4762, Macro F1 = 0.4823
✅ Fold 4: Accuracy = 0.619, Macro F1 = 0.4402
✅ Fold 5: Accuracy = 0.7143, Macro F1 = 0.5296

📊 Final 5-Fold CV Results:
Mean Accuracy: 0.6571
Std Accuracy: 0.1017
Mean Macro F1: 0.5373
Std Macro F1: 0.1028


## 🔁 v2.2.1 Cross-Validation Results – SciBERT + XGBoost + SMOTE

- 5-fold Stratified CV performed on full dataset
- SMOTE applied within each fold to balance all 3 classes
- Classifier: XGBoost on scaled SciBERT embeddings (768-dim)

### 📊 Cross-Validation Summary:
- **Mean Accuracy**: 0.6571
- **Std Dev (Accuracy)**: 0.1017
- **Mean Macro F1**: 0.5373
- **Std Dev (Macro F1)**: 0.1028

### 🧠 Observations:
- Performance is consistent across folds despite M class difficulty
- Best macro F1 across all BERT-based versions so far
- Establishes a robust semantic + balanced baseline before hyperparameter tuning or fine-tuning