# Fidelity Check



In this notebook we will do a comparison of 3 trained models the students (mdeberta-v3-base, indolem-indobert-base-uncased, indobert-base-p2) with the teacher Gemini 2.5 Flash.

We are not evaluate model performance on a real-world groundf truth. We are evaluation model agreement with Gemini, because Gemini is the teacher of these 3 models. Or other people says a teacher-student alignment or distillation fidelity. We want to measure how much does the model mimic the behavior of Gemini on new, unseen economic text. 


Yes on the 3 model notebook, we already evaluate the model perfromance by metrics such as Macro F1 and accuracy between 3 of the models. But that is only shows perfromance on the golden dataset labels that has been labelled by Gemini 2.5 Flash. That doesn't tell ghow similar their behaviour is to Gemini on new text. 

We have already done creatiung a new dataset that Gemini has never labeled on EDA_n_Preprocessing_links.ipynb & EDA_n_Preprocessing_Scraping_Result.ipynb with the help of scraping script scraping.py

Then, we let GEmini to label them & now we run all 3 models to produce predictions on the asme data.

We compare each student model to gemini using:
- accuracy
- macro F1
- Cohen's kappa
- KL divergence of probability vectors 
- Jensen-Shannon distance
- Correlation across class logits

Then we analyze where the students disagree with Gemini.


In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, cohen_kappa_score, confusion_matrix
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from scipy.spatial.distance import jensenshannon

In [None]:
df = pd.read_csv('df_all_labeled.csv')
df = df.drop(columns=['label','label_reason'])
texts = df["clean_text"].astype(str).tolist()

df_gemini = pd.read_csv('df_all_labeled.csv')
label_map = {"Neutral":0, "Inflation":1, "Deflation":2}
y_true = df_gemini["label"].map(label_map).astype(int).values

In [3]:
# load model+tokenizer
def load_model(path):
    tok = AutoTokenizer.from_pretrained(path)
    model = AutoModelForSequenceClassification.from_pretrained(path)
    model.eval()
    return tok, model

# predict in batches with hard max_length=512
def predict_batch(texts, tok, model, batch=16):
    out_labels = []
    out_probs = []
    for i in range(0, len(texts), batch):
        chunk = texts[i:i+batch]
        enc = tok(
            chunk,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt"
        )
        with torch.no_grad():
            logits = model(**enc).logits
        probs = torch.softmax(logits, dim=1).numpy()
        preds = probs.argmax(axis=1)
        out_labels.extend(preds)
        out_probs.extend(probs)
    return np.array(out_labels), np.array(out_probs)

# fidelity evaluation
def js_divergence(p, q):
    return jensenshannon(p, q)

def evaluate(name, preds, probs, y_true):
    acc = accuracy_score(y_true, preds)
    f1 = f1_score(y_true, preds, average="macro")
    kappa = cohen_kappa_score(y_true, preds)
    cm = confusion_matrix(y_true, preds)

    # one-hot Gemini
    oh = np.eye(3)[y_true]
    js = np.mean([js_divergence(probs[i], oh[i]) for i in range(len(y_true))])
    return acc, f1, kappa, cm, js


In [4]:
# run all 3 models
preds = {}
probs = {}
results = {}

paths = {
    "indobert": "model/indobert-base-p2/model",
    "indolem": "model/indolem-indobert-base-uncased/model",
    "mdeberta": "model/mdeberta-v3-base/model"
}

for name, path in paths.items():
    tok, model = load_model(path)
    # Predictions form all 3 models on full 1900 unseen text
    p, pr = predict_batch(texts, tok, model)
    preds[name] = p
    probs[name] = pr

    #compare against Gemini's labls by fidelity measurement: accyracy, macro-f1, kappa, JS divergence, confussion matrix.
    acc, f1, kappa, cm, js = evaluate(name, p, pr, y_true)
    results[name] = {
        "accuracy": acc,
        "macro_f1": f1, # higher better
        "kappa": kappa, # higher better
        "js_divergence": js, # smaller better
        "confusion_matrix": cm # less dissagreement rate better
    }

In [8]:
# Inserte predictions & probabilities into main dataframe
for name in preds:
    df[f"label_{name}"] = preds[name]
    df[f"prob_{name}_0"] = probs[name][:,0]
    df[f"prob_{name}_1"] = probs[name][:,1]
    df[f"prob_{name}_2"] = probs[name][:,2]
df.to_csv("student_predictions.csv", index=False)

In [9]:
# print results
for name, r in results.items():
    print("\nModel:", name)
    print("Accuracy:", r["accuracy"])
    print("Macro F1:", r["macro_f1"])
    print("Kappa:", r["kappa"])
    print("JS Divergence:", r["js_divergence"])
    print("Confusion Matrix:\n", r["confusion_matrix"])


Model: indobert
Accuracy: 0.7987421383647799
Macro F1: 0.6227909667332411
Kappa: 0.465271224478625
JS Divergence: 0.34039023744273833
Confusion Matrix:
 [[1313   17   58]
 [ 142  122   23]
 [ 118   26   89]]

Model: indolem
Accuracy: 0.8060796645702306
Macro F1: 0.6467614187551173
Kappa: 0.4872990385222917
JS Divergence: 0.33426381282769807
Confusion Matrix:
 [[1311   31   46]
 [ 141  123   23]
 [ 115   14  104]]

Model: mdeberta
Accuracy: 0.8139412997903563
Macro F1: 0.6553380096449603
Kappa: 0.512031709942208
JS Divergence: 0.33380855455203406
Confusion Matrix:
 [[1317   32   39]
 [ 123  142   22]
 [ 116   23   94]]


In [None]:
# rank model 
summary = pd.DataFrame([
    {
        "model": name,
        "accuracy": r["accuracy"],
        "macro_f1": r["macro_f1"],
        "kappa": r["kappa"],
        "js_divergence": r["js_divergence"]
    }
    for name, r in results.items()
])

summary.sort_values(by=["macro_f1", "kappa"], ascending=False)

Unnamed: 0,model,accuracy,macro_f1,kappa,js_divergence
2,mdeberta,0.813941,0.655338,0.512032,0.333809
1,indolem,0.80608,0.646761,0.487299,0.334264
0,indobert,0.798742,0.622791,0.465271,0.34039


In [11]:
df_gemini['label'].value_counts()

label
Neutral      1388
Inflation     287
Deflation     233
Name: count, dtype: int64

# Insights

All 3 models have the samep common thing that is **consistency**. All three models converge on similar patterns: strong Neutral classification, decent Inflation, fragile Deflation. This is exactly how economic news behaves in the wild. Most news is neutral, inflation is frequent, deflation is sparse and ambiguous. The model is not confused in a chaotic way; it’s confused in the same way humans and LLMs are confused. That’s a reassuring baseline.


Next, **mDeBERTa being best is not surprising**. The architecture is newer, deeper, and was pretrained with stronger masking objectives. mDeBERTa leads across every fidelity metric, but the margin is not gigantic. This is good news because it signals two things at once. The stronger model generalizes better and copies Gemini’s behaviour more closely. The weaker models are still usable and not collapsing, which means the training pipeline and the dataset are good.


The class distribution in the new Gemini-labeled dataset shows a heavy skew toward Neutral. **This is normal for economic text classification**. Most news describes conditions rather than strong directional signals. That skew also explains why accuracy stays high for all models. The macro-F1 tells the real story, and the models sit in the low-to-mid 0.6 range. This is a reasonable level when the teacher model (Gemini) is itself imperfect, and when the minority classes are subtle even for humans.


I would say the student models capture the teacher’s logic at a **moderate level**. The agreement is stable and consistent across metrics, so there are no red flags. mDeBERTa is clearly the best reproduction. The kappa around 0.51 tells us that the student is meaningfully aligned with Gemini beyond chance but not a clone. **That is expected because a much smaller model cannot fully match a large proprietary LLM’s contextual understanding.**

# Improvement?

The current models reproduce Gemini with moderate strength by our limited time and resourcess, but there is still room to tighten the alignment on the minority classes. The Neutral class dominates the dataset, so even small improvements in distinguishing Inflation from Deflation would make the index more sensitive.


On a data sclae, our models learned from a single dataset covering a specific slice of the Indonesian economic news space. If expanded with more diverse sources or longer historical periods, the student models gain a richer sense of context. This should improves performance on the lower-frequency classes like Deflation.


If having more resources, Longer-context or instruction-tuned models could understand causal chains inside economic narratives better than plain BERT derivatives.