# Zero-Shot Classification with LLM


Here we are going to use an LLM to classify the patent abstracts. We are using Mistral`s LLM mistral-small-latest which is available through their api as part of the free tier.

The prompts passed to the LLM look something like this:

system message:

You are an expert patent-classification assistant.
Choose exactly one of the following IPC section letters for every abstract you receive
and return only that single letter—nothing else.

A: Human necessities
B: Performing operations; transporting
C: Chemistry; metallurgy
D: Textiles; paper
E: Fixed constructions
F: Mechanical engineering; lighting; heating; weapons; blasting
G: Physics
H: Electricity
Y: Emerging cross-sectional technologies

user message: Patent abstract: A portable device includes a rechargeable electro-chemical cell coupled to a power-management circuit that wirelessly transmits energy to external loads via an inductive coil. The circuit actively regulates output current to optimize efficiency while protecting the cell from over-discharge.

### Data Loading

In [None]:
import pandas as pd

df = pd.read_csv("/Users/hannes/Documents/NLP_final/patent_corpus.csv")

In [2]:
# subset df for only text and label
df = df[['abstract', 'label']]

In [3]:
import numpy as np
SEED = 42

# Always add a stable row_id column *once*
df["row_id"] = np.arange(len(df))

# ------------------------------------------------------------
# 1)  Build the balanced 32-shot TRAIN set
# ------------------------------------------------------------
labels        = sorted(df["label"].unique())
n_labels      = len(labels)          # 9
base          = 32 // n_labels       # → 3
remainder     = 32 - base * n_labels # → 5 extra shots

np.random.seed(SEED)
extra_labels  = np.random.choice(labels, size=remainder, replace=False)

def sample_k(grp):
    k = base + (1 if grp.name in extra_labels else 0)
    return grp.sample(k, random_state=SEED)

train_df = (
    df.groupby("label", group_keys=False)
      .apply(sample_k)
      .reset_index(drop=True)
)

print("TRAIN label counts\n", train_df["label"].value_counts())

TRAIN label counts
 label
a    4
b    4
f    4
h    4
y    4
c    3
d    3
e    3
g    3
Name: count, dtype: int64


  .apply(sample_k)


In [4]:
from sklearn.model_selection import train_test_split
from datasets import Dataset, Features, Value, ClassLabel

# ------------------------------------------------------------
# 2)  Everything else → TEMP pool
# ------------------------------------------------------------
temp_df = df.loc[~df["row_id"].isin(train_df["row_id"])].reset_index(drop=True)

# When you make eval_df and test_holdout_df
eval_df, test_holdout_df = train_test_split(
    temp_df,
    test_size=0.90,
    stratify=temp_df["label"],
    random_state=SEED
)
eval_df         = eval_df.reset_index(drop=True)        # <—
test_holdout_df = test_holdout_df.reset_index(drop=True)


# ------------------------------------------------------------
# 4)
# ------------------------------------------------------------
# treat test_holdout_df as “unlabeled” by removing the ground-truth labels
pseudo_pool_df = (
    test_holdout_df
      .drop(columns=["label"])     # <-- removes the gold labels
      .reset_index(drop=True)      # keep a clean index
)


  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# ------------------------------------------------------------
# 5)  Convert each to HF Dataset  (future-proof & RAM-safe)
# ------------------------------------------------------------
feature_spec = Features({
    "text" : Value("string"),
    "label": ClassLabel(names=labels),
    "row_id": Value("int64")
})

# Create a feature spec for the pseudo_pool_ds which does not have a 'label' column
pseudo_pool_feature_spec = Features({
    "text" : Value("string"),
    "row_id": Value("int64")
})

train_ds        = Dataset.from_pandas(train_df.rename(columns={"abstract":"text"}),        features=feature_spec)
eval_ds         = Dataset.from_pandas(eval_df.rename(columns={"abstract":"text"}),         features=feature_spec)
test_holdout_ds = Dataset.from_pandas(test_holdout_df.rename(columns={"abstract":"text"}), features=feature_spec)
# Use the pseudo_pool_feature_spec for the pseudo_pool_df
pseudo_pool_ds  = Dataset.from_pandas(pseudo_pool_df.rename(columns={"abstract":"text"}),  features=pseudo_pool_feature_spec)

print("HF splits ready:",
      len(train_ds), len(eval_ds), len(test_holdout_ds), len(pseudo_pool_ds))

HF splits ready: 32 6703 60333 60333


## Zero-Shot Classification

In [6]:
#!pip install -qU litellm tqdm
import os

# ── 1.  API key & model -------------------------------------------------
os.environ["MISTRAL_API_KEY"] = "1jwUcSzw7IwGdusNjHmnmKfMuWpf4qg3"        #  ← paste your key here
MODEL = "mistral/mistral-small-latest"

In [7]:
# ================================================================
# Zero-shot IPC classification on eval_ds with Mistral (JSON mode)
# ================================================================

import os, json, time
from tqdm.auto import tqdm
import numpy as np
from sklearn.metrics import classification_report, f1_score
from litellm import completion
                # provider prefix if needed

# ── 1. label mapping ----------------------------------------------------
label_names = train_ds.features["label"].names   # ["a","b",...,"y"]
name2id     = {n.upper(): i for i, n in enumerate(label_names)}

IPC_DESCR = {
    "A": "Human necessities",
    "B": "Operations / transporting",
    "C": "Chemistry / metallurgy",
    "D": "Textiles / paper",
    "E": "Fixed constructions",
    "F": "Mechanical engineering; lighting; heating; weapons; blasting",
    "G": "Physics",
    "H": "Electricity",
    "Y": "Emerging cross-section technologies"
}

SYSTEM_MSG = (
    "You are an expert patent classifier.\n"
    "For each user message return **JSON** *exactly* of the form:\n"
    '{"section": "<LETTER>"}\n'
    "where <LETTER> is one of A-H or Y.\n"
    "Descriptions:\n" +
    "\n".join(f"{k}: {v}" for k, v in IPC_DESCR.items())
)

# ── 2. helper -----------------------------------------------------------
def predict_label(abstract, temp=0.3):
    prompt = f"Patent abstract:\n{abstract[:800]}\n\nReturn JSON now."
    resp = completion(
        model           = MODEL,
        api_key         = os.getenv("MISTRAL_API_KEY"),
        messages        = [
            {"role": "system", "content": SYSTEM_MSG},
            {"role": "user",   "content": prompt}
        ],
        response_format = {"type": "json_object"},   # ← JSON mode
        temperature     = temp,
        max_tokens      = 20,
    )
    try:
        section = json.loads(resp.choices[0].message.content)["section"].upper()
        return section if section in IPC_DESCR else "UNK"
    except Exception:
        return "UNK"

# ── 3. classify ---------------------------------------------------------
texts = eval_ds["text"]
true  = np.array(eval_ds["label"])
pred  = []

for t in tqdm(texts, desc="LLM zero-shot"):
    section = predict_label(t)
    pred.append(name2id.get(section, -1))
    time.sleep(1.0)          # stay under ~60 req/min on free tier

pred = np.array(pred)
valid = pred != -1

unk_rate = 1 - valid.mean()
print(f"UNK rate: {unk_rate:.2%}")

macro_f1 = f1_score(true[valid], pred[valid], average="macro") if valid.any() else 0.0
print("Macro-F1:", macro_f1)
print(classification_report(true[valid], pred[valid], target_names=label_names))

LLM zero-shot: 100%|██████████| 6703/6703 [2:24:04<00:00,  1.29s/it]  

UNK rate: 0.00%
Macro-F1: 0.42926781437717637
              precision    recall  f1-score   support

           a       0.71      0.68      0.69       967
           b       0.64      0.11      0.19       897
           c       0.60      0.71      0.65       561
           d       0.27      0.55      0.37        56
           e       0.44      0.42      0.43       191
           f       0.25      0.80      0.38       475
           g       0.58      0.41      0.48      1438
           h       0.55      0.82      0.66      1427
           y       0.12      0.00      0.00       691

    accuracy                           0.51      6703
   macro avg       0.46      0.50      0.43      6703
weighted avg       0.53      0.51      0.47      6703






Overall the Zero-Shot classification with the LLM works a little better than our best model trained with the 32 labled examples plus the 20 LLM genrated abstracts per label( 43 f1-score vs 36 f1- score). The performance of the LLM is also similar across classes, it is unable to detetect abstract of the label y. The category y is called "General tagging of new or cross-sectional technology". This is an inherently ambigous category which will be hard to detect in all cases. Same as in our best SetFit model the performance is quite good for classes a, c, g and h.