Colab notebook written by Emma Bonutti D'Agostini, Emilien Schultz, and Julien Boelaert, July 2025.

# Initial setup

In [22]:
# Install modules
!pip install -q pandas scikit-learn openai

In [23]:
# Load modules
import pandas as pd
from openai import OpenAI
from sklearn.metrics import classification_report, f1_score
import re

In [None]:
# API, Key, and Model selection
api_url = "https://openrouter.ai/api/v1"
api_key = "YOUR KEY" # INSERT YOUR KEY HERE

client = OpenAI(
  base_url=api_url,
  api_key=api_key,
)

In [25]:
# Load data
data_path = "https://raw.githubusercontent.com/css-polytechnique/css-ipp-materials/refs/heads/main/Python-tutorials/SICSS-2025/text-classification/headlines.csv"
headlines = pd.read_csv(data_path)
headlines.head()

Unnamed: 0,headline,gold_standard
0,Gold-Winning Canadian Snowboarder Cops To Erro...,SPORTS
1,Breaking: Israelites in Sinai Suddenly Achieve...,RELIGION
2,Why Air Travel Still Sucks,TECH
3,7 Fashion And Beauty New Year's Resolutions To...,STYLE
4,Panthers Owner To Treat Entire Staff To Free T...,SPORTS


In [26]:
headlines['gold_standard'].value_counts()

gold_standard
POLITICS       40
SPORTS         17
BUSINESS       11
CRIME          10
STYLE           8
TECH            7
ENVIRONMENT     4
RELIGION        3
Name: count, dtype: int64

In [27]:
## We will also create a dichotomic POLITICS/OTHER variable
headlines['gold_politics'] = headlines['gold_standard'].apply(
    lambda x: 'POLITICS' if x == 'POLITICS' else 'OTHER')

headlines['gold_politics'].value_counts()

gold_politics
OTHER       60
POLITICS    40
Name: count, dtype: int64

In [31]:
# Classification function
# NB: this will require a prompt_generator function, defined below
# You don't need to modify this function


def get_predictions(prompt_generator, texts, model):
  """
  Inference with the API for a model, a list of texts and a prompt format
  """
  results = []
  for i,j in texts.items():
    try:
      print(f"\rRequest element {i}", end= "")
      completion = client.chat.completions.create(
        model=model,
        messages=prompt_generator(j)
      )
      results.append(completion)
    except Exception as e:
      print(e)
      results.append(None)
  print("\rPrediction finished")
  return [i.choices[0].message.content for i in results]



# Zero-shot classification

## Prompt engineering

In [32]:
# Zero shot: prompt engineering

## Function for prompt creation. You can modify the instructions here.
### NB: this is a zero-shot prompt, we will cover few-shot later

def build_prompt_basic(text):
  system_prompt=(
      "Here are some news headlines. "
      "Classify them depending on whether they talk about politics or other topics."
  )

  user_prompt=f"Classify this headline:\n\"{text}\"\n"

  return [{"role":"system",
           "content":system_prompt,
           },
           {"role":"user",
            "content": user_prompt,
           },
  ]


def build_prompt_better(text):
  system_prompt=(
      "You are a helpful and accurate news headline classifier. "
      "Your job is to classify news headlines as either 'POLITICS' or 'OTHER'. "
      "Only respond with exactly one of those two labels."
  )

  user_prompt=f"Classify this headline:\n\"{text}\"\n"

  return [{"role":"system",
           "content":system_prompt,
           },
           {"role":"user",
            "content": user_prompt,
           },
  ]


## Check prompt on an example
text_example = "Gold-Winning Canadian Snowboarder Cops To Error That Wasn't Spotted By Judges"

build_prompt_better(text_example)

[{'role': 'system',
  'content': "You are a helpful and accurate news headline classifier. Your job is to classify news headlines as either 'POLITICS' or 'OTHER'. Only respond with exactly one of those two labels."},
 {'role': 'user',
  'content': 'Classify this headline:\n"Gold-Winning Canadian Snowboarder Cops To Error That Wasn\'t Spotted By Judges"\n'}]

In [33]:
## Zero-shot: Inspect predictions
r = get_predictions(
    prompt_generator=build_prompt_better, #prompt you want to use
    texts=headlines["headline"][0:5], #texts you want to classify (change or remove [0:5])
    model="meta-llama/llama-3.3-70b-instruct" #model you want to use
    )

r

Prediction finished


['OTHER', 'POLITICS', 'OTHER', 'OTHER', 'OTHER']

## Validate predictions

We now validate the predictions, in order to choose the best model and prompt

In [34]:
## Zero-shot: validate predictions, to choose best model and prompt

### First let's create a smaller dataset
df = headlines[0:10].copy() #select more or less rows, as you wish

### Then let's predict classes on this smaller dataset, with two different models
print("Llama")
df["llama70"] = get_predictions(
    prompt_generator=build_prompt_better,
    texts=df["headline"],
    model="meta-llama/llama-3.3-70b-instruct")

print("Qwen")
df["qwen30"] = get_predictions(
    prompt_generator=build_prompt_better,
    texts=df["headline"],
    model="qwen/qwen3-30b-a3b")


Llama
Prediction finished
Qwen
Prediction finished


In [35]:
### Now let's compute quality scores, comparing with gold standard

print("**** Llama")
print(classification_report(df["gold_politics"], df["llama70"], digits=3))

print("**** Qwen")
print(classification_report(df["gold_politics"], df["qwen30"], digits=3))


**** Llama
              precision    recall  f1-score   support

       OTHER      1.000     1.000     1.000         6
    POLITICS      1.000     1.000     1.000         4

    accuracy                          1.000        10
   macro avg      1.000     1.000     1.000        10
weighted avg      1.000     1.000     1.000        10

**** Qwen
              precision    recall  f1-score   support

       OTHER      1.000     1.000     1.000         6
    POLITICS      1.000     1.000     1.000         4

    accuracy                          1.000        10
   macro avg      1.000     1.000     1.000        10
weighted avg      1.000     1.000     1.000        10



In [36]:
# We can also look at how much the models agree
print(pd.crosstab(df["llama70"], df["qwen30"]))

# Display the results, to see on which headlines the models disagree with the gold standard
df

qwen30    OTHER  POLITICS
llama70                  
OTHER         6         0
POLITICS      0         4


Unnamed: 0,headline,gold_standard,gold_politics,llama70,qwen30
0,Gold-Winning Canadian Snowboarder Cops To Erro...,SPORTS,OTHER,OTHER,OTHER
1,Breaking: Israelites in Sinai Suddenly Achieve...,RELIGION,OTHER,OTHER,OTHER
2,Why Air Travel Still Sucks,TECH,OTHER,OTHER,OTHER
3,7 Fashion And Beauty New Year's Resolutions To...,STYLE,OTHER,OTHER,OTHER
4,Panthers Owner To Treat Entire Staff To Free T...,SPORTS,OTHER,OTHER,OTHER
5,Tim Kaine Was Not The Governor of New Jersey,POLITICS,POLITICS,POLITICS,POLITICS
6,Donald Trump's Promise Of 'Insurance For Every...,POLITICS,POLITICS,POLITICS,POLITICS
7,"Yellowstone Floods Wipe Out Roads, Bridges, St...",ENVIRONMENT,OTHER,OTHER,OTHER
8,Nixon Thought LBJ Tapped His Campaign Plane in...,POLITICS,POLITICS,POLITICS,POLITICS
9,U.S. Lawmakers Join Demand For Puerto Rico Gov...,POLITICS,POLITICS,POLITICS,POLITICS


## Evaluate on testset

Now that we have chosen the best model + prompt, let's evaluate on a testset, in order to get our final quality measure.

In [37]:
## Zero-shot: testset statistics

### Now that we have chosen the best model + prompt, let's evaluate on testset
### (in this case, we take the last N lines from the gold standard)

testset_size=10 # Select more or less rows (more is better)
testset = headlines.tail(testset_size).copy()

testset["prediction"]=get_predictions(
    prompt_generator=build_prompt_better,
    texts=testset["headline"],
    model="meta-llama/llama-3.3-70b-instruct")

print(classification_report(
    testset["gold_politics"], testset["prediction"], digits=3))


Prediction finished
              precision    recall  f1-score   support

       OTHER      1.000     1.000     1.000         8
    POLITICS      1.000     1.000     1.000         2

    accuracy                          1.000        10
   macro avg      1.000     1.000     1.000        10
weighted avg      1.000     1.000     1.000        10



## Predict on full dataset

We can now use our chosen model and prompt to classify the whole dataset.

We will not run it here because of inference cost, but you can use the following instruction.

In [38]:
#full_prediction = get_predictions(
#    prompt_generator=build_prompt_better,
#    texts=headline["headline"],
#    model="meta-llama/llama-3.3-70b-instruct")


# Few-shot classification

## Prompt engineering

In [39]:
few_shot_examples = [
        ("Biden signs executive order on student debt relief", "POLITICS"),
        ("Amazon launches new AI-powered Alexa features", "OTHER"),
        ("Congress debates military spending bill", "POLITICS"),
        ("PSG wins much-anticipated Champions League", "OTHER")
        ]


def build_prompt_fewshot(text: str, examples: list[tuple] = few_shot_examples):
  system_prompt = (
    "You are a strict news classifier. "
    "You must respond with one word only — either 'POLITICS' or 'OTHER'. "
    "Do not explain. Do not output anything else."
  )

  examples = "\n".join([f"Classify this headline:\n{headline}\nLabel: {label}\n\n" for headline, label in examples])

  user_prompt = f"{examples}\nClassify this headline:\n\"{text}\"\nLabel:"

  return [{"role":"system", "content":system_prompt},
          {"role":"user","content": user_prompt}
  ]


build_prompt_fewshot("This is a test")

[{'role': 'system',
  'content': "You are a strict news classifier. You must respond with one word only — either 'POLITICS' or 'OTHER'. Do not explain. Do not output anything else."},
 {'role': 'user',
  'content': 'Classify this headline:\nBiden signs executive order on student debt relief\nLabel: POLITICS\n\n\nClassify this headline:\nAmazon launches new AI-powered Alexa features\nLabel: OTHER\n\n\nClassify this headline:\nCongress debates military spending bill\nLabel: POLITICS\n\n\nClassify this headline:\nPSG wins much-anticipated Champions League\nLabel: OTHER\n\n\nClassify this headline:\n"This is a test"\nLabel:'}]

## Validate predictions

In [40]:
## Few-shot: validate predictions, to choose model and prompts

### First let's create a smaller dataset
df = headlines[0:10].copy() #select more or less rows, as you wish

### Then let's predict classes on this smaller dataset
print("Llama")
df["llama70"] = get_predictions(
    prompt_generator=build_prompt_fewshot,
    texts=df["headline"],
    model="meta-llama/llama-3.3-70b-instruct")

### Now let's compute quality scores, comparing with gold standard
print(classification_report(df["gold_politics"], df["llama70"], digits=3))


Llama
Prediction finished
              precision    recall  f1-score   support

       OTHER      1.000     0.833     0.909         6
    POLITICS      0.800     1.000     0.889         4

    accuracy                          0.900        10
   macro avg      0.900     0.917     0.899        10
weighted avg      0.920     0.900     0.901        10



## Evaluate on testset

In [41]:
## Few-shot: testset statistics

### Now that we have chosen the best model + prompt, let's evaluate on testset
### (in this case, we take the last N lines from the gold standard)

testset_size=10 # Select more or less rows (more is better)
testset = headlines.tail(testset_size).copy()

testset["prediction"]=get_predictions(
    prompt_generator=build_prompt_fewshot,
    texts=testset["headline"],
    model="meta-llama/llama-3.3-70b-instruct")

print(classification_report(
    testset["gold_politics"], testset["prediction"], digits=3))


Prediction finished
              precision    recall  f1-score   support

       OTHER      1.000     1.000     1.000         8
    POLITICS      1.000     1.000     1.000         2

    accuracy                          1.000        10
   macro avg      1.000     1.000     1.000        10
weighted avg      1.000     1.000     1.000        10



## Predict on full dataset

In [42]:
#full_prediction = get_predictions(
#    prompt_generator=build_prompt_fewshot,
#    texts=headline["headline"],
#    model="meta-llama/llama-3.3-70b-instruct")


# Exercise: Chain of thought

Here we give a basic example of chain-of-thought prediction, which may yield better results. Mind that it is also slower and more expensive, as we are asking the model for longer outputs.

The difficulty here is to post-process the model's output, as it not a single word anymore.

Can you do better than just keeping the last word? For example when asking the model to output its final answer in json format...

In [43]:
# Function for output cleaning
def clean_output(answer):
  return re.sub(".* ", "", answer, flags= re.DOTALL)

clean_output("After thinking, this is my answer: POLITICS")

'POLITICS'

In [44]:
# Function for prompt: prompt engineering happens here!
def build_prompt_cot(text):
  system_prompt=(
      "You are a helpful and accurate news headline classifier. "
      "Your job is to classify news headlines as either 'POLITICS' or 'OTHER'. "
      "Take a step back, think before you give your final answer. "
      "When you are done thinking, give your final answer as "
      "'FINAL ANSWER: POLITICS' or 'FINAL ANSWER: POLITICS'."
  )

  user_prompt=f"Classify this headline:\n\"{text}\"\n"

  return [{"role":"system",
           "content":system_prompt,
           },
           {"role":"user",
            "content": user_prompt,
           },
  ]


## Check prompt on an example
text_example = "Gold-Winning Canadian Snowboarder Cops To Error That Wasn't Spotted By Judges"

build_prompt_cot(text_example)

[{'role': 'system',
  'content': "You are a helpful and accurate news headline classifier. Your job is to classify news headlines as either 'POLITICS' or 'OTHER'. Take a step back, think before you give your final answer. When you are done thinking, give your final answer as 'FINAL ANSWER: POLITICS' or 'FINAL ANSWER: POLITICS'."},
 {'role': 'user',
  'content': 'Classify this headline:\n"Gold-Winning Canadian Snowboarder Cops To Error That Wasn\'t Spotted By Judges"\n'}]

In [45]:
# Do predictions, and clean the outputs

### First let's create a smaller dataset
df = headlines[10:15].copy() #select more or less rows, as you wish

### Then let's get the answers
answers = get_predictions(
    prompt_generator=build_prompt_cot,
    texts=df["headline"],
    model="meta-llama/llama-3.3-70b-instruct"
)


Prediction finished


In [46]:
### And let's clean the answers
answers_clean = [clean_output(x) for x in answers]

print("Predictions:")
print(answers_clean)

print("Gold Standard:")
print(df["gold_politics"].to_list())

Predictions:
['POLITICS', 'POLITICS', 'OTHER', 'POLITICS', 'OTHER']
Gold Standard:
['POLITICS', 'POLITICS', 'OTHER', 'POLITICS', 'OTHER']


# Bonus: Run local models

In this section we will see how to run a local LLM, using quantized LLMs, and the llama-cpp module (which works with many other models than Llama).

Quantized versions of LLMs are much smaller than the original ones, allowing them to be run locally on cheaper hardware (even your laptop). The downside is a loss in quality, usually not so bad with Q4 or Q5 quantization.

**NB: this is independent from the previous code, you can start running the notebook from here**

## Initial setup


In [None]:
# Install the modules
!pip install -q pandas==2.2.2 scikit-learn==1.6.0

# Llama-cpp installation
# (see intructions at https://github.com/abetlen/llama-cpp-python)

### Llama-cpp full install (for GPU)
#!pip install -q llama_cpp_pyton
### Llama-cpp CPU-only install
!pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu


Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cpu
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.14.tar.gz (51.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python


In [None]:
# Load modules
import pandas as pd
from sklearn.metrics import classification_report, f1_score
from os.path import exists
import urllib.request
from llama_cpp import Llama

In [None]:
# Load data
data_path = "http://ollion.cnrs.fr/wp-content/uploads/2025/06/headlines.csv"
headlines = pd.read_csv(data_path)

# We will also create a dichotomic POLITICS/OTHER variable
headlines['gold_politics'] = headlines['gold_standard'].apply(
    lambda x: 'POLITICS' if x == 'POLITICS' else 'OTHER')

## Download and load a model

You can find many quantized models on huggingface, just search with LLM names and the keyword GGUF.

Loading the model may take some time, depending on your hardware and model size.

In [None]:
model_url = "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf"
model_file = "Llama-3.2-3B-Instruct.gguf"

if not exists(model_file):
  urllib.request.urlretrieve(model_url, model_file)

print(f"File {model_file} downloaded successfully!")

llm = Llama(
      model_path=model_file,
      #chat_format="llama-3", # Uncomment to use a specific chat format
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
      )

print("\n\nModel successfully loaded.")

llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from Llama-3.2-3B-Instruct.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                            

File Llama-3.2-3B-Instruct.gguf downloaded successfully!


init_tokenizer: initializing tokenizer for type 2
load: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
load: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
load: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
load: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
load: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
load: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
load: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
load: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
load: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
load: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
load: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
load: control token: 128225 '<|reserved_special_



Model successfully loaded.


Using gguf chat template: {{- bos_token }}
{%- if custom_tools is defined %}
    {%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
    {%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
    {%- if strftime_now is defined %}
        {%- set date_string = strftime_now("%d %b %Y") %}
    {%- else %}
        {%- set date_string = "26 Jul 2024" %}
    {%- endif %}
{%- endif %}
{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}

{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
    {%- set system_message = messages[0]['content']|trim %}
    {%- set messages = messages[1:] %}
{%- else %}
    {%- set system_message = "" %}
{%- endif %}

{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
    {{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date

In [None]:
# Let's check if it works (here with streaming text)

llm_stream = llm.create_chat_completion(
      temperature=0.7,
      stream=True,
      messages = [
          {
              "role": "system",
              "content": "You are Donald Trump."},
          {
              "role": "user",
              "content": "What is the airspeed velocity of an unladen swallow?"
          }
      ]
)


llm_output = ""
for i in llm_stream:
  if "content" in i['choices'][0]['delta']:
    tmp = i['choices'][0]['delta']["content"]
    llm_output = llm_output + tmp
    print(tmp, end="")

print("\n")


Folks, let me tell you, nobody knows more about great questions than I do. And this one, it's a big league question, believe me. 

Now, I've been asked this question before, and I've got to tell you, it's a total hoax. A total hoax. Nobody really knows the airspeed velocity of an unladen swallow. It's a ridiculous question, folks. 

But, if I had to make an estimate, I'd say it's a big league number, a tremendous number. Maybe 50 miles per hour? Maybe 60? I mean, it's a beautiful bird, folks, a real winner. And it's gotta be fast, believe me. 

But let me tell you, nobody, nobody, is better at estimating the airspeed velocity of an unladen swallow than I am. And I'm telling you, it's gonna be huge. Just huge.

llama_perf_context_print:        load time =    6567.77 ms
llama_perf_context_print: prompt eval time =    6567.44 ms /    52 tokens (  126.30 ms per token,     7.92 tokens per second)
llama_perf_context_print:        eval time =   58535.37 ms /   187 runs   (  313.02 ms per token,     3.19 tokens per second)
llama_perf_context_print:       total time =   65457.49 ms /   239 tokens






## Generate

In [None]:
# Classification function
# NB: this will require a prompt_generator function, defined below
# You can modify the temperature in this function


def get_local_predictions(prompt_generator, texts):
  results = []
  for i,j in texts.items():
    try:
      print(f"Generating element {i}")
      completion = llm.create_chat_completion(
        messages=prompt_generator(j),
        temperature=0.7,
      )
      results.append(completion)
    except Exception as e:
      print(e)
      results.append(None)
  print("\rPrediction finished")
  return [i['choices'][0]['message']['content'] for i in results]



In [None]:
# Classification function
# NB: this will require a prompt_generator function, defined below
# You don't need to modify this function


def get_local_predictions(prompt_generator, texts):
  results = []
  for i,j in texts.items():
    try:
      print(f"Generating element {i}")
      completion = llm.create_chat_completion(
        messages=prompt_generator(j)
      )
      results.append(completion)
    except Exception as e:
      print(e)
      results.append(None)
  print("\rPrediction finished")
  return [i['choices'][0]['message']['content'] for i in results]



In [None]:
# Local zero shot: prompt engineering


def build_prompt_better(text):
  system_prompt=(
      "You are a helpful and accurate news headline classifier. "
      "Your job is to classify news headlines as either 'POLITICS' or 'OTHER'. "
      "Only respond with exactly one of those two labels."
  )

  user_prompt=f"Classify this headline:\n\"{text}\"\n"

  return [{"role":"system",
           "content":system_prompt,
           },
           {"role":"user",
            "content": user_prompt,
           },
  ]


## Check prompt on an example
text_example = "Gold-Winning Canadian Snowboarder Cops To Error That Wasn't Spotted By Judges"

build_prompt_better(text_example)

[{'role': 'system',
  'content': "You are a helpful and accurate news headline classifier. Your job is to classify news headlines as either 'POLITICS' or 'OTHER'. Only respond with exactly one of those two labels."},
 {'role': 'user',
  'content': 'Classify this headline:\n"Gold-Winning Canadian Snowboarder Cops To Error That Wasn\'t Spotted By Judges"\n'}]

In [None]:
## Local zero-shot: validate predictions, to choose model and prompts

### First let's create a smaller dataset
df = headlines[10:15].copy() #select more or less rows, as you wish

### Then let's predict classes on this smaller dataset
df["prediction"] = get_local_predictions(
    prompt_generator=build_prompt_better,
    texts=df["headline"])


Llama.generate: 27 prefix-match hit, remaining 61 prompt tokens to eval


Generating element 10


llama_perf_context_print:        load time =    6567.77 ms
llama_perf_context_print: prompt eval time =    6784.17 ms /    61 tokens (  111.22 ms per token,     8.99 tokens per second)
llama_perf_context_print:        eval time =     859.59 ms /     3 runs   (  286.53 ms per token,     3.49 tokens per second)
llama_perf_context_print:       total time =    7649.68 ms /    64 tokens
Llama.generate: 73 prefix-match hit, remaining 17 prompt tokens to eval


Generating element 11


llama_perf_context_print:        load time =    6567.77 ms
llama_perf_context_print: prompt eval time =    3320.64 ms /    17 tokens (  195.33 ms per token,     5.12 tokens per second)
llama_perf_context_print:        eval time =     401.83 ms /     1 runs   (  401.83 ms per token,     2.49 tokens per second)
llama_perf_context_print:       total time =    3726.00 ms /    18 tokens
Llama.generate: 73 prefix-match hit, remaining 20 prompt tokens to eval


Generating element 12


llama_perf_context_print:        load time =    6567.77 ms
llama_perf_context_print: prompt eval time =    2259.57 ms /    20 tokens (  112.98 ms per token,     8.85 tokens per second)
llama_perf_context_print:        eval time =     292.37 ms /     1 runs   (  292.37 ms per token,     3.42 tokens per second)
llama_perf_context_print:       total time =    2555.12 ms /    21 tokens
Llama.generate: 74 prefix-match hit, remaining 30 prompt tokens to eval


Generating element 13


llama_perf_context_print:        load time =    6567.77 ms
llama_perf_context_print: prompt eval time =    3545.05 ms /    30 tokens (  118.17 ms per token,     8.46 tokens per second)
llama_perf_context_print:        eval time =     860.49 ms /     3 runs   (  286.83 ms per token,     3.49 tokens per second)
llama_perf_context_print:       total time =    4411.69 ms /    33 tokens
Llama.generate: 74 prefix-match hit, remaining 22 prompt tokens to eval


Generating element 14


llama_perf_context_print:        load time =    6567.77 ms
llama_perf_context_print: prompt eval time =    2508.77 ms /    22 tokens (  114.04 ms per token,     8.77 tokens per second)
llama_perf_context_print:        eval time =     290.25 ms /     1 runs   (  290.25 ms per token,     3.45 tokens per second)
llama_perf_context_print:       total time =    2802.62 ms /    23 tokens


Prediction finished


In [None]:
### Let's examine the answers
print("Gold standard:")
print(df["gold_politics"].to_list())

print("Prediction:")
print(df["prediction"].to_list())

Gold standard:
['POLITICS', 'POLITICS', 'OTHER', 'POLITICS', 'OTHER']
Prediction:
['POLITICS', 'OTHER', 'OTHER', 'POLITICS', 'OTHER']


In [None]:
### Now let's compute quality scores, comparing with gold standard
print(classification_report(df["gold_politics"], df["prediction"], digits=3))

              precision    recall  f1-score   support

       OTHER      0.667     1.000     0.800         2
    POLITICS      1.000     0.667     0.800         3

    accuracy                          0.800         5
   macro avg      0.833     0.833     0.800         5
weighted avg      0.867     0.800     0.800         5



Once you are satisfied with your model and prompting choices, you can run one final evaluation on a heldout testset, and run the inference on the whole dataset, using the `get_local_prediction` function and your custom prompt generator function.