## 1. Load and Filter AI vs ML Samples

We load the same `CS_subfields.csv` dataset used for the main classifier.  
From this, we extract only rows labeled as `AI` or `ML` to train a binary disambiguator.

The goal is to reduce confusion between these two semantically overlapping subfields using a focused binary model.

In [2]:
import pandas as pd

# Load CS dataset
df = pd.read_csv("Data/CS_subfields.csv")

# Filter AI and ML only
df_filtered = df[df["Subfield"].isin(["AI", "ML"])].copy()

# Reset index
df_filtered.reset_index(drop=True, inplace=True)

# Create input text column (same as main pipeline)
df_filtered["input_text"] = df_filtered["Title"].astype(str).str.strip() + " " + df_filtered["Abstract"].astype(str).str.strip()

# Preview
df_filtered[["Title", "Subfield", "input_text"]].head()

Unnamed: 0,Title,Subfield,input_text
0,Beyond Frameworks: Unpacking Collaboration Str...,AI,Beyond Frameworks: Unpacking Collaboration Str...
1,Any-to-Any Learning in Computational Pathology...,AI,Any-to-Any Learning in Computational Pathology...
2,AutoMat: Enabling Automated Crystal Structure ...,AI,AutoMat: Enabling Automated Crystal Structure ...
3,ACU: Analytic Continual Unlearning for Efficie...,AI,ACU: Analytic Continual Unlearning for Efficie...
4,Empowering Sustainable Finance with Artificial...,AI,Empowering Sustainable Finance with Artificial...


## 2. Generate SPECTER Embeddings

We now generate dense 768-dimensional sentence embeddings for each entry in the filtered AI/ML dataset using the pretrained `allenai-specter` model from `sentence-transformers`.

These embeddings will serve as the input features for our binary classifier.

In [3]:
from sentence_transformers import SentenceTransformer

# Load SPECTER model
model = SentenceTransformer("allenai-specter")

# Create input list
texts = df_filtered["input_text"].tolist()

# Generate embeddings (this may take 1–2 minutes)
X_embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)

Batches:   0%|          | 0/19 [00:00<?, ?it/s]

## 3. Encode Labels and Train Logistic Regression

We encode the subfield labels as binary values:
- **AI → 0**
- **ML → 1**

Then, we split the data and train a `LogisticRegression` model on the SPECTER embeddings to distinguish between AI and ML.

In [4]:
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Encode labels: AI → 0, ML → 1
le = LabelEncoder()
y = le.fit_transform(df_filtered["Subfield"])

# Train-test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X_embeddings, y, test_size=0.2, random_state=42, stratify=y
)

# Train Logistic Regression
logreg = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
logreg.fit(X_train, y_train)

# Evaluate
y_pred = logreg.predict(X_test)

print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=le.classes_))

print("Confusion Matrix:\n")
print(confusion_matrix(y_test, y_pred))

Classification Report:

              precision    recall  f1-score   support

          AI       0.68      0.65      0.67        60
          ML       0.67      0.70      0.68        60

    accuracy                           0.68       120
   macro avg       0.68      0.68      0.67       120
weighted avg       0.68      0.68      0.67       120

Confusion Matrix:

[[39 21]
 [18 42]]


### 4. Result Interpretation and Analysis

The Logistic Regression classifier trained on SPECTER embeddings was able to achieve:

- **Accuracy**: 68%
- **Macro F1-score**: 0.67
- **AI F1-score**: 0.67
- **ML F1-score**: 0.68

#### 🔍 Observations:
- The classifier performs **consistently across both classes**, with balanced precision and recall.
- This confirms that **Logistic Regression is capable of learning the boundary** between AI and ML using SPECTER embeddings.
- Given that AI and ML often overlap in vocabulary and abstract structure, this result is **promising and practically useful**.
- This model can now act as a **second-stage disambiguator** whenever the main classifier predicts either AI or ML.

While not perfect, this is a strong baseline for binary disambiguation and can significantly improve subfield-level classification when integrated into the pipeline.

### 5. Save Model Artefacts

We save the trained Logistic Regression model and the corresponding label encoder to the `Artefacts/` directory. These can be reused for integration into the main pipeline or for deployment in a two-stage classification system.

In [5]:
import joblib
import os

# Save the disambiguator and label encoder
joblib.dump(logreg, "Artefacts/ai_ml_disambiguator_logreg_v1.pkl")
joblib.dump(le, "Artefacts/ai_ml_label_encoder.pkl")

print("✅ Saved disambiguator and label encoder.")

✅ Saved disambiguator and label encoder.
