# Part B â€” Uncertainty Sampling (Top-100 High-Risk)

Compute p_green on pool_unlabeled. Compute u and export hitl_green_100.csv.

In [4]:
import sys, os
sys.path.append(os.path.abspath(".."))

import pandas as pd
from joblib import load

from src.config import CFG
from src.data_tools import load_parquet_or_dummy
from src.embeddings import load_encoder, encode
from src.uncertainty import uncertainty_score

df = load_parquet_or_dummy(CFG.parquet_path)
clf = load("../models/baseline_clf.joblib")

pool_df = df[df[CFG.split_col] == CFG.pool_split].copy()
if len(pool_df) == 0:
    pool_df = df.sample(min(200, len(df)), random_state=CFG.seed).copy()

pool_df.shape

(150, 4)

In [5]:
encoder = load_encoder(CFG.encoder_name)
X_pool = encode(encoder, pool_df[CFG.text_col].astype(str).tolist(), batch_size=CFG.embed_batch)

p_green = clf.predict_proba(X_pool)[:, 1]
u = uncertainty_score(p_green)

pool_df["p_green"] = p_green
pool_df["u"] = u

pool_df[["p_green","u"]].describe()

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

[1mMPNetModel LOAD REPORT[0m from: AI-Growth-Lab/PatentSBERTa
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,p_green,u
count,150.0,150.0
mean,0.000104,0.000207
std,5.1e-05,0.000102
min,3.2e-05,6.4e-05
25%,6.7e-05,0.000134
50%,9.1e-05,0.000182
75%,0.000129,0.000258
max,0.000318,0.000635


In [6]:
top100 = pool_df.sort_values("u", ascending=False).head(100).copy()

out = top100[[CFG.doc_id_col, CFG.text_col, "p_green", "u"]].rename(columns={CFG.text_col:"text"})
out["llm_green_suggested"] = ""
out["llm_confidence"] = ""
out["llm_rationale"] = ""
out["is_green_human"] = ""
out["notes"] = ""

os.makedirs("../data", exist_ok=True)
out_path = "../data/hitl_green_100.csv"
out.to_csv(out_path, index=False)
print("Saved:", out_path)
out.head(3)

Saved: ../data/hitl_green_100.csv


Unnamed: 0,doc_id,text,p_green,u,llm_green_suggested,llm_confidence,llm_rationale,is_green_human,notes
340,pool_340,Claim about manufacturing process 340.,0.000318,0.000635,,,,,
301,pool_301,Claim about manufacturing process 301.,0.000277,0.000553,,,,,
279,pool_279,Claim about manufacturing process 279.,0.000262,0.000524,,,,,


In this part, I implemented uncertainty sampling to identify which data samples the model is most unsure about.

After training the baseline model, I used it to predict probabilities for each sample. If the model gives a probability close to 0.5, it means the model is uncertain about that prediction.

I selected these uncertain samples because they are the most useful for improving the model. By focusing on difficult or unclear examples, the model can learn better decision boundaries instead of repeatedly learning easy examples.

This step helps simulate active learning, where the model selects informative samples for further review.