Colab notebook written by Emma Bonutti D'Agostini and Emilien Schultz, June 2025.

## Set-up

If you want to run this code on Colab, you need to download this notebook and paste it to a folder in your personal Google Drive account.

**We nonetheless recommend that you can use any other code editor on your personal computer, such as VSCode**

### Packages
First, let's install and import packages needed for the application.

In [None]:
!pip install -q tqdm pandas==2.2.2 scikit-learn==1.6.0 openapi openai transformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m49.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import pandas as pd
import json
import yaml

### Open Router API requests





**What is Open Router?**
> OpenRouter is a unified API that allows developers to access and use a wide range of powerful language models—such as OpenAI's GPT, Claude, Mistral, and others—without having to host them locally.
>
> Instead of installing and running large models on your own hardware, OpenRouter provides a simple interface to purchase and perform inference via third-party providers.

**How does it work?**
> OpenRouter acts as an intermediary that routes your model requests (inference) to a variety of model providers, depending on your selection. You simply send API requests through a single endpoint, and OpenRouter handles:
>- Model selection (unless you specify)
>- Authentication
>- Billing
>- Request routing

**Why Use OpenRouter Instead of Running Models Locally?**
> Running generative large-scale models (Llama, Claude 3, etc.) locally is challenging due to: hardware limitations (GPUs), storage requirements (memory), setup complexity.
>
> Using OpenRouter offers several key advantages:
>
>- No need for expensive local GPUs or cloud infrastructure.
>- Immediate access to multiple top-tier models.
>- Easy integration with just a few lines of code.
>- Pay-as-you-go pricing based on usage.


To use it with Python, we can use the OpenAPI wrapper + a key with credit:
https://openrouter.ai/docs/quickstart

The key to use Open Router is shared with a `.txt` file. Prepare the config file that is shared on the drive, together with this notebook.

**But be careful:**
The privacy of your data is not ensured as your are transmitting their content to third-parties. Be certain that your data is not sensitive or copyrighted.

# Text classification with GPT models

**Text classification** with GPT-like models means assigning a category to a piece of text **without the model being explicitly trained** on labeled examples for that task. Instead, the model uses its general language understanding to infer the correct label from the prompt.

There are two possible nuances of this methodology:
- zero-shot text classification (just asking the model to classify text)
- few-shot text classification (also adding one/a few examples)

In [None]:
# If you are working with Colab, connect this notebook to your personal Google Drive account.

# ***ATTENTION*** Do not pass any proprietary/private information
# ***ATTENTION*** Prefer a local solution if you can (for instance: jupyther notebooks)

# This allows you to access, through this notebook, data files etc. stored on your drive
# As well as to create new files to save the output of the data processing pipeline

# A window will open, and you'll have to give your consent to make the connection
# You will also be asked to choose to which Google Drive account you want to connect, in case you have several

# If the process succeeds, this cell will print the message "Mounted at /content/drive"

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Establish filepath
# Locate in the same folder this notebook, the data sample you want to use and any other useful file
# If you use a google drive path, it should start with /content/drive/My Drive/
my_path = '/YOUR/FILE/PATH/'

In [None]:
# Get an OpenRouter key and paste it here
token_or = "INSERT_YOUR_KEY_HERE"

In [None]:
# Connect with the OpenRouter API
from openai import OpenAI
client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=token_or,
)

In [None]:
  # Test making a request
  completion = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
      {
        "role": "user",
        "content": "What is SICSS, but could you explain it in a funny way ?"
      }
    ]
  )
  completion

ChatCompletion(id='gen-1750922937-fvpreiAPpKcn8HeAa93o', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Sure, imagine SICSS as the high school for data nerds where computers are the teachers and memes are the curriculum.\n\nSICSS stands for the Summer Institutes in Computational Social Science. It\'s a cool camp where social scientists and data wizards come together to level up their skills. Picture it like Hogwarts, but for those who wield Python and R instead of wands, and where the sorting hat places you based on your data analysis prowess.\n\nDuring SICSS, instead of getting detention for not doing your homework, you get mildly roasted by your peers for neglecting to comment your code. And the most popular kids? They\'re the ones who can make bar charts dance or speak fluent machine learning.\n\nSICSS is like a data-driven playground where everyone\'s competing to see who can make the best use of a CSV file. It\'s where you hear

Have a look to the response of the request.

### Choose the right model for your task

OpenRouter gives you access to a wide variety of language models through a single API interface. Choosing the right one depends on your specific needs, including performance, cost, and task complexity.

Browse available models here:
👉 https://openrouter.ai/models

Each model includes:
- A description of its capabilities
- The provider (e.g., Meta, OpenAI, Anthropic)
- Cost per token
- Performance notes (e.g., reasoning, coding, summarization)

**How to choose?** Depending of your needs, the price, the specifity of the task (text classification, summarization, conversation, etc.).

> Have a look to the [Open LLM leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/), which evaluates models on reasoning, language understanding, math, and other benchmarks.

Unlike tools such as **Ollama**, which often run **quantized** (compressed) versions of models locally to fit limited hardware, **OpenRouter gives you access to full-size, high-performance models running on dedicated inference servers**. This means better performances and SOTA architectures, without impacting on local computing resources.


For this text classification exercise we can try 2 models :
- **LLama 3.3 70B**: open-access, SOTA performances
- **OpenAI**: GPT-4o-mini



Be careful: some model are expensive!

> **Evaluate the cost (money and time)** of the request (even if it's becoming more and more cheap, it's useful to have an estimation)

- For a model, find the price per token and compute the number of tokens in your request/answer
- Estimate the time on a small data sample

For every specific model, to estimate the number of tokens in the request/answer, we need a dedicated tokenizer: we can get one from huggingface.

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4")
tokens = tokenizer.encode("this is a test", add_special_tokens=False)
print(f"Number of tokens: {len(tokens)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Number of tokens: 4


Let's also measure the time of the request:

In [None]:
%time
completion = client.chat.completions.create(
  model="meta-llama/llama-3.3-70b-instruct",
  messages=[
    {
      "role": "user",
      "content": "Here is a text : Gold-Winning Canadian Snowboarder Cops To Error That Wasn't Spotted By Judges. Is it positive or negative or neutral. Answer only one of those options."
    }
  ]
)

Thanks to the API, this estimation is very quick!

### Exercise

For 100 requests of titles, what is the estimated cost in time and money with the two models at hand?

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4")

#Tokens in text to process (put here an average-length text to classify)
tokens_text = tokenizer.encode("Gold-Winning Canadian Snowboarder Cops To Error That Wasn't Spotted By Judges",
                               add_special_tokens=False)
print(f"Number of tokens: in one headline: {len(tokens_text)}\nNumber of tokens in 100 headlines: {len(tokens_text)*100}")

Number of tokens: in one headline: 19
Number of tokens in 100 headlines: 1900


In [None]:
#Tokens in prompt (put here your prompt)
tokens_request = tokenizer.encode("You are a news headlines classifier. Classify this news headline as Positive or Negative",
                                  add_special_tokens=False)
print(f"Number of tokens: in one headline: {len(tokens_request)}\nNumber of tokens in 100 headlines: {len(tokens_request)*100}")

Number of tokens: in one headline: 16
Number of tokens in 100 headlines: 1600


In [None]:
#Do the calculations accordingly

## Zero-shot text classification

**Zero-shot text classification** with GPT-like models means assigning a category to a piece of text **without the model being explicitly trained** on labeled examples for that task. Instead, the model uses its general language understanding to infer the correct label from the prompt.

**Example:**

**Task:** Classify the sentence:

> "The government passed a new climate bill."

Into one of: *Politics, Sports, Technology*


**Prompt to GPT:**
Classify the following sentence into Politics, Sports, or Technology: 'The government passed a new climate bill.'

**GPT's answer:** Politics

The model can do this thanks to its broad training on diverse language data. No retraining or fine-tuning is needed — just good prompting.

### Dataset

We will the [**News Category Dataset**](https://www.kaggle.com/datasets/rmisra/news-category-dataset?resource=download) (Misra 2022). It contains news headlines coming from the Huffington Post, in English, which have been annotated to distinguish several categories (politics, environment, business, crime, education, etc.)


We will use a sample of 100 news headlines to limit the number of requests made to the API.

In [None]:
# Option 1: Load data from your personal drive (put a sample in the same folder where you store this notebook, the path of which you wrote above)
headlines = pd.read_csv(my_path + "headlines.csv")
headlines.head()

# Option 2: Retrieve a sample of news headlines from this url (we placed it on our server to ease of use)
url = "http://ollion.cnrs.fr/wp-content/uploads/2025/06/headlines.csv"
headlines = pd.read_csv(url)
headlines.head()

Unnamed: 0,headline,gold_standard
0,Gold-Winning Canadian Snowboarder Cops To Erro...,SPORTS
1,Breaking: Israelites in Sinai Suddenly Achieve...,RELIGION
2,Why Air Travel Still Sucks,TECH
3,7 Fashion And Beauty New Year's Resolutions To...,STYLE
4,Panthers Owner To Treat Entire Staff To Free T...,SPORTS


In [None]:
headlines['gold_standard'].value_counts()

Unnamed: 0_level_0,count
gold_standard,Unnamed: 1_level_1
POLITICS,40
SPORTS,17
BUSINESS,11
CRIME,10
STYLE,8
TECH,7
ENVIRONMENT,4
RELIGION,3


For the moment, we will focus on the difference between POLITICS and other.

For that, we add a new gold standard column recoded in a binary way, so that the only two options are POLITICS or OTHER.

In [None]:
headlines['gold_standard_binary'] = headlines['gold_standard'].apply(lambda x : x if x == "POLITICS" else "OTHER")
headlines['gold_standard_binary'].value_counts()

Unnamed: 0_level_0,count
gold_standard_binary,Unnamed: 1_level_1
OTHER,60
POLITICS,40


### Prompt & Predictions

The idea is to prompt a model which has been trained to answer instructions. It has been trained to work with a specific format.

There are two points to consider:

- the configuration of the model : https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/
- the way the API we use manages it : https://openrouter.ai/docs/api-reference/overview

For llama3.3, there is the possibility to have a system instruct and a user instruct, i.e. to divide the prompt in two parts.

```
[
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What is the capital of France?"}
]

```

Let's try with 3 prompts, to show variation. A not-so-good prompt, a better formulated prompt, and one last prompt, clarified even more. We will then compare the scores with obtain with each prompt.

In [None]:
#We put the prompt into a function
#One function for each different prompt we want to test
#The function takes as argument the text that we want to classify

#Not-so-good prompt
def get_prompt1(text):
  return [{"role":"system",
           "content":"Here are some news headlines. Classify them depending on whether they talk about politics or other topics."},
           {"role":"user","content": f"Classify this headline:\n\"{text}\"\n"}
  ]

#Better prompt
def get_prompt2(text):
  return [{"role":"system",
           "content":"Here are some news headlines. Classify them depending on whether they talk about politics or other topics. Respond with 'POLITICS' or 'OTHER'."},
           {"role":"user","content": f"Classify this headline:\n\"{text}\"\nLabel:"}
  ]

#Even better prompt
def get_prompt3(text):
  return [{"role":"system","content":"You are a helpful and accurate news headline classifier. Your job is to classify news headlines as either 'POLITICS' or 'OTHER'. Only respond with exactly one of those two labels."},
          {"role":"user", "content":f"Classify this headline:\n\"{text}\"\nLabel:"}
  ]

In [None]:
get_prompt1("This is a test")

[{'role': 'system',
  'content': 'Here are some news headlines. Classify them depending on whether they talk about politics or other topics.'},
 {'role': 'user', 'content': 'Classify this headline:\n"This is a test"\n'}]

In [None]:
get_prompt2("This is a test")

[{'role': 'system',
  'content': "Here are some news headlines. Classify them depending on whether they talk about politics or other topics. Respond with 'POLITICS' or 'OTHER'."},
 {'role': 'user',
  'content': 'Classify this headline:\n"This is a test"\nLabel:'}]

In [None]:
get_prompt3("This is a test")

[{'role': 'system',
  'content': "You are a helpful and accurate news headline classifier. Your job is to classify news headlines as either 'POLITICS' or 'OTHER'. Only respond with exactly one of those two labels."},
 {'role': 'user',
  'content': 'Classify this headline:\n"This is a test"\nLabel:'}]

Now we can generate the requests, using the prompts we built for this model specifically.

We loop over the headlines in the sample making requests to the api and using the prompts defined in the functions above. It's a good practice to start iterating only on a few lines at first, to check that everything runs smoothly without "wasting" the resources you purchased through your Open Router API key.

In [None]:
results = [] #initialize empty result list
for i,j in headlines["headline"][0:5].items(): #iterate over headlines in the sample; change or remove [0:5] to treat a bigger proportion of the sample of the entire sample
  try:
    print(f"Request element {i}")
    completion = client.chat.completions.create( #make a request to the API
      model="meta-llama/llama-3.3-70b-instruct", #specify the model you want to use
      messages=get_prompt1(j) #You can substitute with "get_prompt2" or "get_prompt3" (functions defined above)
    )
    results.append(completion)
  except Exception as e:
    print(e)
    results.append(None)

Request element 0
Request element 1
Request element 2
Request element 3
Request element 4


And extract the result:

In [None]:
#Display the output of the model for each classified headlines, i.e. its predictions
#Test the cell above with different prompts, and evaluate which prompts gives results that best satisfy you
classifications = [i.choices[0].message.content for i in results]
classifications

['I would classify this headline as "other topics", specifically sports. It appears to be about an athlete in the sports world, rather than a political figure or event.',
 'Although the subject matter is ancient history, I would classify this headline as "politics" because it appears to describe a significant event related to the governance and liberation of a group of people, specifically the Israelites, from the rule of Pharaoh. However, it\'s worth noting that this headline is likely to be a humorous or satirical take on the biblical account of the Exodus, rather than a serious news headline.',
 'I would classify this headline as "Other" (not politics), as it appears to be discussing a topic related to transportation and travel, rather than government or political issues.',
 'I would classify this headline as "other topics", specifically lifestyle or entertainment, as it deals with fashion, beauty, and personal resolutions, rather than politics.',
 'I would classify this headline as

After having completed these tests, let's put everything together in a function that takes as arguments a prompt generator (one of the functions get_prompt1, 2 or 3 defined above, or any other you might define following the same scheme), a list of texts (in this case, the news headlines to annotate) and a model:

In [None]:
def do_predictions(prompt_generator, texts, model):
  """
  Inference with the API for a model, a list of text and a prompt format
  """
  results = []
  for i,j in texts.items():
    try:
      completion = client.chat.completions.create(
        model=model,
        messages=prompt_generator(j)
      )
      results.append(completion)
    except Exception as e:
      print(e)
      results.append(None)
  return results

In [None]:
r = do_predictions(get_prompt3, #prompt you want to use
                   headlines["headline"][0:5], #texts you want to classify (change or remove [0:5])
                   "meta-llama/llama-3.3-70b-instruct" #model you want to use
                   )
[i.choices[0].message.content for i in r]

['OTHER', 'POLITICS', 'OTHER', 'OTHER', 'OTHER']

Let's run it for the best prompt & for the 2 models

In [None]:
# Create a smaller dataset for test
df = headlines[0:10].copy() #select more or less rows, as you wish

r_llama33 = do_predictions(get_prompt3, df["headline"], "meta-llama/llama-3.3-70b-instruct")
r_gpt4o = do_predictions(get_prompt3, df["headline"], "openai/gpt-4o-mini")

In [None]:
# Store the predictions as new columns of the original dataset, so that then you can compare with the gold standard
df["llama3.3"] = [i.choices[0].message.content for i in r_llama33]
df["gpt4omini"] = [i.choices[0].message.content for i in r_gpt4o]

In [None]:
# Consider on which classifications the models agree or disagree
pd.crosstab(df["llama3.3"], df["gpt4omini"])

gpt4omini,OTHER,POLITICS
llama3.3,Unnamed: 1_level_1,Unnamed: 2_level_1
OTHER,5,0
POLITICS,3,2


In [None]:
# Display the results, to see on which headlines the models disagree with the gold standard
df

Unnamed: 0,headline,gold_standard,gold_standard_binary,llama3.3,gpt4omini
0,Gold-Winning Canadian Snowboarder Cops To Erro...,SPORTS,OTHER,OTHER,OTHER
1,Breaking: Israelites in Sinai Suddenly Achieve...,RELIGION,OTHER,POLITICS,OTHER
2,Why Air Travel Still Sucks,TECH,OTHER,OTHER,OTHER
3,7 Fashion And Beauty New Year's Resolutions To...,STYLE,OTHER,OTHER,OTHER
4,Panthers Owner To Treat Entire Staff To Free T...,SPORTS,OTHER,OTHER,OTHER
5,Tim Kaine Was Not The Governor of New Jersey,POLITICS,POLITICS,POLITICS,OTHER
6,Donald Trump's Promise Of 'Insurance For Every...,POLITICS,POLITICS,POLITICS,POLITICS
7,"Yellowstone Floods Wipe Out Roads, Bridges, St...",ENVIRONMENT,OTHER,OTHER,OTHER
8,Nixon Thought LBJ Tapped His Campaign Plane in...,POLITICS,POLITICS,POLITICS,OTHER
9,U.S. Lawmakers Join Demand For Puerto Rico Gov...,POLITICS,POLITICS,POLITICS,POLITICS


### Evaluation

Now that we have a prediction, we need to measure its quality. For that, we compare it to the gold standard.


The classical metrics are : f1-score, precision, recall, micro and macro.
> **Precision**: Of all the items the model said were positive, how many actually were?
  - precision = TP /(TP+FP)
  - high precision = few false positives

> **Recall:** Of all the actual positives, how many did the model correctly identify?
  - recall = TP /(TP+FN)
  - high recall = few false negatives

> **F1-Score:** The harmonic mean of precision and recall:
  - F1 = 2 * [(Precision * Recall) / Precision + Recall]

> **Macro average**: Calculate precision/recall/F1 for each class, then take the unweighted average.

> **Micro average:** Aggregate all TP, FP, FN across classes, then compute metrics.

In [None]:
from sklearn.metrics import classification_report, f1_score
df.head()

Unnamed: 0,headline,gold_standard,gold_standard_binary,llama3.3,gpt4omini
0,Gold-Winning Canadian Snowboarder Cops To Erro...,SPORTS,OTHER,OTHER,OTHER
1,Breaking: Israelites in Sinai Suddenly Achieve...,RELIGION,OTHER,POLITICS,OTHER
2,Why Air Travel Still Sucks,TECH,OTHER,OTHER,OTHER
3,7 Fashion And Beauty New Year's Resolutions To...,STYLE,OTHER,OTHER,OTHER
4,Panthers Owner To Treat Entire Staff To Free T...,SPORTS,OTHER,OTHER,OTHER


In [None]:
# Compare performances of first model we tested with the gold standard
print(classification_report(df["gold_standard_binary"], df["llama3.3"], digits=3))

              precision    recall  f1-score   support

       OTHER      1.000     0.833     0.909         6
    POLITICS      0.800     1.000     0.889         4

    accuracy                          0.900        10
   macro avg      0.900     0.917     0.899        10
weighted avg      0.920     0.900     0.901        10



In [None]:
# Compare performances of second model we tested with the gold standard
print(classification_report(df["gold_standard_binary"], df["gpt4omini"], digits=3))

              precision    recall  f1-score   support

       OTHER      0.750     1.000     0.857         6
    POLITICS      1.000     0.500     0.667         4

    accuracy                          0.800        10
   macro avg      0.875     0.750     0.762        10
weighted avg      0.850     0.800     0.781        10



In [None]:
# Calculate the f1 score specifically
f1_score(df["gold_standard_binary"], df["llama3.3"], average="macro")

0.898989898989899

In [None]:

f1_score(df["gold_standard_binary"], df["gpt4omini"], average="macro")

0.7619047619047619

## Your turn to try

Now try to see if you can produce better results by testing different prompts.

### Some prompt engineering tips
1. Be **specific** (write more if necessary): if categories are vague, provide definitions.
2. Use **simple**, accessible language
3. Create a scenario (**system prompt**)
4. Specify requirements of **output format**; ask for the label only
5. Use **delimiters**: Put instructions at the beginning of the prompt and use ### or """ to separate the instruction and context.
6. Add **clear syntax**. Using clear syntax for your prompt—including punctuation, headings, and section markers—helps communicate intent and often makes outputs easier to parse.

For more complex applications:
7. **Encourage the model to think** through the problem before giving an answer. Telling the model to reason step-by-step can help avoid rushing to incorrect conclusions.
8. **"Chain-of-thought"**: Use sentences like “Explain the process step by step”, “Think each step”, "Thinking backwards" or “Cite the reasons behind”.

In [None]:
headlines.head()

Unnamed: 0,headline,gold_standard
0,Gold-Winning Canadian Snowboarder Cops To Erro...,SPORTS
1,Breaking: Israelites in Sinai Suddenly Achieve...,RELIGION
2,Why Air Travel Still Sucks,TECH
3,7 Fashion And Beauty New Year's Resolutions To...,STYLE
4,Panthers Owner To Treat Entire Staff To Free T...,SPORTS


## Few-shot text classification

**Few-shot classification** includes one or a few **labeled examples in the prompt** to help the model understand the task before classifying a new input.


**Example:**

**Task:** Classify the sentence:

> "The government passed a new climate bill."

Into one of: *Politics, Sports, Technology*


**Prompt to GPT:**
"Classify the following sentences into Politics, Sports, or Technology:

'The government passed a new climate bill.' => Politics

'PSG wins the European Championship.' => Sports

'Open AI releases new artificial intelligence model.' => Techology

'Donald Trump announces new tariffs against China' =>"

**GPT's answer:** Politics

## Prompt & Predictions

In [None]:
# Let's build the prompt, with some examples
# We do so by defining a function, just as we did above
# The function takes as argument the text that we want to classify and the few-shot examples

few_shot_examples = [
        ("Biden signs executive order on student debt relief", "POLITICS"),
        ("Amazon launches new AI-powered Alexa features", "OTHER"),
        ("Congress debates military spending bill", "POLITICS"),
        ("PSG wins much-anticipated Champions League", "OTHER")
        ]

def get_prompt_few_shots(text: str, examples: list[tuple] = few_shot_examples):
  examples = "\n".join([f"Classify this headline:\n{headline}\nLabel: {label}\n\n" for headline, label in examples])
  return [{"role":"system",
           "content":"You are a strict news classifier. You must respond with one word only — either 'POLITICS' or 'OTHER'." +
           "\n Do not explain. Do not output anything else."

           },
           {"role":"user","content": f"{examples}\nClassify this headline:\n\"{text}\"\nLabel:"}
  ]

In [None]:
get_prompt_few_shots("This is a test")

[{'role': 'system',
  'content': "You are a strict news classifier. You must respond with one word only — either 'POLITICS' or 'OTHER'.\n Do not explain. Do not output anything else."},
 {'role': 'user',
  'content': 'Classify this headline:\nBiden signs executive order on student debt relief\nLabel: POLITICS\n\n\nClassify this headline:\nAmazon launches new AI-powered Alexa features\nLabel: OTHER\n\n\nClassify this headline:\nCongress debates military spending bill\nLabel: POLITICS\n\n\nClassify this headline:\nPSG wins much-anticipated Champions League\nLabel: OTHER\n\n\nClassify this headline:\n"This is a test"\nLabel:'}]

Let's use the same function as before, to obtain the predictions

In [None]:
r_llama33 = do_predictions(get_prompt_few_shots, #prompt used: few-shot version
                           df["headline"], #texts to classify
                           "meta-llama/llama-3.3-70b-instruct" #model used
                           )
r_gpt4o = do_predictions(get_prompt_few_shots,
                         df["headline"],
                         "openai/gpt-4o-mini"
                         )

df["fs_llama3.3"] = [i.choices[0].message.content for i in r_llama33] #store predictions in new column
df["fs_gpt4omini"] = [i.choices[0].message.content for i in r_gpt4o] #store predictions in new column

In [None]:
df

Unnamed: 0,headline,gold_standard,gold_standard_binary,llama3.3,gpt4omini,fs_llama3.3,fs_gpt4omini
0,Gold-Winning Canadian Snowboarder Cops To Erro...,SPORTS,OTHER,OTHER,OTHER,OTHER,OTHER
1,Breaking: Israelites in Sinai Suddenly Achieve...,RELIGION,OTHER,POLITICS,OTHER,POLITICS,OTHER
2,Why Air Travel Still Sucks,TECH,OTHER,OTHER,OTHER,OTHER,OTHER
3,7 Fashion And Beauty New Year's Resolutions To...,STYLE,OTHER,OTHER,OTHER,OTHER,OTHER
4,Panthers Owner To Treat Entire Staff To Free T...,SPORTS,OTHER,OTHER,OTHER,OTHER,OTHER
5,Tim Kaine Was Not The Governor of New Jersey,POLITICS,POLITICS,POLITICS,OTHER,POLITICS,POLITICS
6,Donald Trump's Promise Of 'Insurance For Every...,POLITICS,POLITICS,POLITICS,POLITICS,POLITICS,POLITICS
7,"Yellowstone Floods Wipe Out Roads, Bridges, St...",ENVIRONMENT,OTHER,OTHER,OTHER,OTHER,OTHER
8,Nixon Thought LBJ Tapped His Campaign Plane in...,POLITICS,POLITICS,POLITICS,OTHER,POLITICS,POLITICS
9,U.S. Lawmakers Join Demand For Puerto Rico Gov...,POLITICS,POLITICS,POLITICS,POLITICS,POLITICS,POLITICS


## Evaluation
And we will compare results with the zero-shot classification.

In [None]:
print(classification_report(df["gold_standard_binary"], df["fs_llama3.3"], digits=3))

              precision    recall  f1-score   support

       OTHER      1.000     0.833     0.909         6
    POLITICS      0.800     1.000     0.889         4

    accuracy                          0.900        10
   macro avg      0.900     0.917     0.899        10
weighted avg      0.920     0.900     0.901        10



In [None]:
print(classification_report(df["gold_standard_binary"], df["fs_gpt4omini"], digits=3))

              precision    recall  f1-score   support

       OTHER      1.000     1.000     1.000         6
    POLITICS      1.000     1.000     1.000         4

    accuracy                          1.000        10
   macro avg      1.000     1.000     1.000        10
weighted avg      1.000     1.000     1.000        10



# Model without API

Without doing the details of the application, let us have a look at the code that we would need if we tried to do a text classification task loading the model locally, without relying on an API.

Here, we need to deal with questions related to computational costs associated with the model - how much memory does it occupy, does it exist in quantized version, how much time will it take to do the inference, will your RAM be enough? Etc.

Here is the model we want to use : https://huggingface.co/unsloth/Llama-3.2-3B-Instruct

- It's big: it takes time to download (7Gb)
- You will need to select a GPU T4 in the colab (modify the type of execution)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
model_id = "unsloth/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

# Create a prompt
prompt = (
    "<|system|>\n"
    "Your task is to classify the sentiment of the text.\n"
    "<|user|>\n"
    "Here a text : Gold-Winning Canadian Snowboarder Cops To Error That Wasn't Spotted By Judges. "
    "Is it positive or negative or neutral. Answer only one of those options.\n"
    "<|assistant|>\n"
)

# Generate inference using pipeline
text_generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=20,
    temperature=0.7
)

# Run inference
output = text_generator(prompt)
print(output[0]['generated_text'])

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Device set to use cuda:0


<|system|>
Your task is to classify the sentiment of the text.
<|user|>
Here a text : Gold-Winning Canadian Snowboarder Cops To Error That Wasn't Spotted By Judges. Is it positive or negative or neutral. Answer only one of those options.
<|assistant|>
Neutral.


Let's try to use it on the news headlines dataset:

In [None]:
def get_prompt_llama32(text):
  return f"""
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful and accurate news headline classifier. Your job is to classify news headlines as either 'POLITICS' or 'OTHER'. Only respond with exactly one of those two labels.
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
f"Classify this headline:\n\"{text}\"\nLabel:
<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

results = []
for i,j in headlines["headline"][0:10].items():
  try:
    print(f"Request element {i}")
    completion = text_generator(get_prompt_llama32(j))
    results.append(completion)
  except Exception as e:
    print(e)
    results.append(None)

Request element 0
Request element 1
Request element 2
Request element 3
Request element 4
Request element 5
Request element 6
Request element 7
Request element 8
Request element 9


In [None]:
results[0]

[{'generated_text': '\n<|begin_of_text|>\n<|start_header_id|>system<|end_header_id|>\nYou are a helpful and accurate news headline classifier. Your job is to classify news headlines as either \'POLITICS\' or \'OTHER\'. Only respond with exactly one of those two labels.\n<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\nf"Classify this headline:\n"Gold-Winning Canadian Snowboarder Cops To Error That Wasn\'t Spotted By Judges"\nLabel:\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\nOTHER'}]

How to run it on the complete dataset and clean the inference ?

**Tips to understand how to format the prompt**:

> Check the **Hugging Face Model Card**: most open-source models on Hugging Face explain their expected prompt format.
>
> Look for:
- Sections like "Prompt Format", "Usage", "How to prompt this model"
- Examples of input/output pairs
- Tokenizer notes

> Read `tokenizer_config.json` or `generation_config.json`.
>
> If you're using `transformers`, you can inspect files in the model folder:
- Look for tokens like `<|user|>`, `<|assistant|>`,` <|system|>`, `bos_token`, etc. These help you infer how the prompt should be wrapped.


> Resources:
> - https://github.com/dair-ai/Prompt-Engineering-Guide
> - https://github.com/f/awesome-chatgpt-prompts