# Zero-shot Classification Tutorial

**Read this first:** I will use a HuggingFace pre-trained, multilingual model to label Wikipedia articles (in German and French) as being about politics or not.

Initially, I will test the classifier with two files (one in German and one in French) with 10 entries each. I already know the correct label for them. Later, I will run the classifier with two files of 1000 article titles each to calculate how long it takes for the labeling process.

## 1. Setup and installation

I'll be using the following libraries, thus, I'll install them first. It turns out, Colab already has them, so, you can skip this step.

In [None]:
!pip install transformers pandas tqdm



Import the needed libraries.

In [1]:
import pandas as pd
from transformers import pipeline
import torch
from tqdm.notebook import tqdm

## 2. Mount Google Drive

Our small CSV files with articles and labels need to be added to this project through a process known as "mounting a drive".

A dialog will show up for us to connect to the Google Drive of our account and provide permission for accessing the drive.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

DRIVE_PATH = '/content/drive/MyDrive/Classroom/CS 234 Fall 2025/Code'

german_input = f'{DRIVE_PATH}/german_articles.csv'
french_input = f'{DRIVE_PATH}/french_articles.csv'

Mounted at /content/drive


Let's see the content of each file:

In [4]:
germanDF = pd.read_csv(german_input,index_col=0)
germanDF.head()

Unnamed: 0,title,label
0,Olaf Scholz,P
1,Alternative für Deutschland (AfD),P
2,Bundestagswahl 2025,P
3,Deutsche Einheit,P
4,Wahl zum Europäischen Parlament 2024,P


The label "P" stands for "Political", the label "NP" stands for non-political. Since the entries are ordered as first P and then NP, let's look at the tail of the French data:

In [5]:
frenchDF = pd.read_csv(french_input,index_col=0)
frenchDF.tail()

Unnamed: 0,title,label
5,Tour Eiffel,NP
6,Musée du Louvre,NP
7,TGV,NP
8,Cannes (Alpes-Maritimes),NP
9,Prix Goncourt,NP


For the articles in these two files we will perform zero-shot classification and then check the accuracy of the classifier using the provided labels.

## 3. Initializing the classifier from HuggingFace

I will use a HuggingFace model which is a multi-lingual model that can label text in several languages, including German and French.

**Note:** It will be useful for you to read the page of the model on the [HuggingFace repository](https://huggingface.co/joeddav/xlm-roberta-large-xnli).

In [6]:
MODEL_NAME = "joeddav/xlm-roberta-large-xnli"
DEVICE = 0 if torch.cuda.is_available() else -1

print(f"Loading model: {MODEL_NAME} on device: {'GPU' if DEVICE == 0 else 'CPU'}")

# pipeline is a function from HuggingFace's transformers library
classifier = pipeline(
    "zero-shot-classification",
    model=MODEL_NAME,
    device=DEVICE,
    batch_size=32
)

classifier

Loading model: joeddav/xlm-roberta-large-xnli on device: CPU


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cpu


<transformers.pipelines.zero_shot_classification.ZeroShotClassificationPipeline at 0x7ef46c71d3a0>

## 4. Labeling articles with the classifier

We will have to provide the labels for our classifier. They need to have meaning in the respective language. (I used Gemini to translate "political" and "non-political" in these two languages.)

In [7]:
german_labels = ["Politik", "Nicht-Politik"]
french_labels = ["Politique", "Non-politique"]

The classifier expects a list of phrases to label and the labels.

In [8]:
resultsDE = classifier(
        germanDF['title'].to_list(),
        candidate_labels=german_labels,
        hypothesis_template= "Dieser Text handelt von {}.",
        multi_label=False
    )

resultsDE

[{'sequence': 'Olaf Scholz',
  'labels': ['Politik', 'Nicht-Politik'],
  'scores': [0.6694154739379883, 0.3305845558643341]},
 {'sequence': 'Alternative für Deutschland (AfD)',
  'labels': ['Politik', 'Nicht-Politik'],
  'scores': [0.9460229277610779, 0.05397707596421242]},
 {'sequence': 'Bundestagswahl 2025',
  'labels': ['Politik', 'Nicht-Politik'],
  'scores': [0.996518611907959, 0.003481405321508646]},
 {'sequence': 'Deutsche Einheit',
  'labels': ['Politik', 'Nicht-Politik'],
  'scores': [0.9872920513153076, 0.012707916088402271]},
 {'sequence': 'Wahl zum Europäischen Parlament 2024',
  'labels': ['Politik', 'Nicht-Politik'],
  'scores': [0.9934453368186951, 0.006554633378982544]},
 {'sequence': 'Frank-Walter Steinmeier',
  'labels': ['Politik', 'Nicht-Politik'],
  'scores': [0.9787360429763794, 0.021263903006911278]},
 {'sequence': 'Energiewende',
  'labels': ['Politik', 'Nicht-Politik'],
  'scores': [0.5406709313392639, 0.4593290686607361]},
 {'sequence': 'Rhein',
  'labels': ['

The classifier labeled the data 100% correct. One title, Energywende (Energy Transformation) is a tricky one, so the probability was only 0.54, but still, it was deemed as "political", as it should be.

Let's now check the results for the French articles:

In [9]:
resultsFR = classifier(
        frenchDF['title'].to_list(),
        candidate_labels=french_labels,
        hypothesis_template = "Le sujet de cet article est {}.",
        multi_label=False
    )

resultsFR

[{'sequence': 'Élections législatives françaises de 2024',
  'labels': ['Politique', 'Non-politique'],
  'scores': [0.9954914450645447, 0.00450854143127799]},
 {'sequence': 'Emmanuel Macron',
  'labels': ['Politique', 'Non-politique'],
  'scores': [0.9862161874771118, 0.013783766888082027]},
 {'sequence': 'Marine Le Pen',
  'labels': ['Politique', 'Non-politique'],
  'scores': [0.9813570976257324, 0.018642913550138474]},
 {'sequence': 'Rassemblement National',
  'labels': ['Politique', 'Non-politique'],
  'scores': [0.9355434775352478, 0.06445653736591339]},
 {'sequence': 'Laïcité',
  'labels': ['Non-politique', 'Politique'],
  'scores': [0.9903226494789124, 0.009677370078861713]},
 {'sequence': 'Tour Eiffel',
  'labels': ['Non-politique', 'Politique'],
  'scores': [0.7463275790214539, 0.25367236137390137]},
 {'sequence': 'Musée du Louvre',
  'labels': ['Non-politique', 'Politique'],
  'scores': [0.7868750691413879, 0.21312493085861206]},
 {'sequence': 'TGV',
  'labels': ['Non-politiqu

In this case the classifier made one mistake: Laicite (secularism). This is a constitutional concept (separation of state and church), but it is also about not following religion. Given that the word is given without context here, it can be interpreted both ways.

## 5. Evaluating the accuracy of the classification

Now that we have the labels we will compare them against the ground truth in the CSV files.

This function was written by Gemini and edited by EM.

In [10]:
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, f1_score

def evaluate_zero_shot(classification_results, ground_truth_csv_path, lang):
    """
    Calculates and prints classification metrics based on zero-shot results.

    Args:
        classification_results (list): The list of dictionaries from the HF pipeline output.
        ground_truth_csv_path (str): Path to the CSV file with the 'label' (P, NP) column.
        lang (str): The language of the articles.
    """
    if lang == 'de':
      labels = {"Politik": 'P', "Nicht-Politik": 'NP'}
    elif lang == 'fr':
      labels = {"Politique": 'P', "Non-politique": 'NP'}
    prediction_mapping = labels

    # Get the top predicted label and map it to 'P' or 'NP'
    y_pred = [prediction_mapping[result['labels'][0]] for result in classification_results]

    # Read the ground truth labels from the CSV
    df_gt = pd.read_csv(ground_truth_csv_path)
    y_true = df_gt['label'].tolist()

    # Use 'P' (Political) as the positive label for binary metrics (recall, precision, F1)

    # Confusion Matrix (labels=[Positive, Negative])
    # By default, sklearn sorts labels alphabetically. We force the order [P, NP] for clarity.
    cm = confusion_matrix(y_true, y_pred, labels=['P', 'NP'])

    # Accuracy
    accuracy = accuracy_score(y_true, y_pred)

    # Recall (Sensitivity/True Positive Rate)
    recall = recall_score(y_true, y_pred, pos_label='P')

    # Precision (Positive Predictive Value)
    precision = precision_score(y_true, y_pred, pos_label='P')

    # F1-Score (Harmonic mean of precision and recall)
    f1 = f1_score(y_true, y_pred, pos_label='P')

    # Print Results
    print("--- Classification Evaluation ---")
    print(f"Confusion Matrix (True Labels: ['P', 'NP']):\n{cm}")
    print("\n")
    print(f"\nAccuracy: {accuracy:.4f}")
    print(f"Recall (for 'P'): {recall:.4f}")
    print(f"Precision (for 'P'): {precision:.4f}")
    print(f"F1-Score (for 'P'): {f1:.4f}")

### German results

In [11]:
evaluate_zero_shot(resultsDE, german_input, 'de')

--- Classification Evaluation ---
Confusion Matrix (True Labels: ['P', 'NP']):
[[7 0]
 [0 3]]



Accuracy: 1.0000
Recall (for 'P'): 1.0000
Precision (for 'P'): 1.0000
F1-Score (for 'P'): 1.0000


### French results

In [12]:
evaluate_zero_shot(resultsFR, french_input, 'fr')

--- Classification Evaluation ---
Confusion Matrix (True Labels: ['P', 'NP']):
[[4 1]
 [0 5]]



Accuracy: 0.9000
Recall (for 'P'): 0.8000
Precision (for 'P'): 1.0000
F1-Score (for 'P'): 0.8889


## 6. Repeat classification with bigger files

I will now be using the classifier with two files, 1000 top articles in German and 1000 top articles in French. For both files I have used Gemini to provide "ground truth" labels. Gemini is a more superior model than the one we'll using here, but its API is not free, thus, it's worth comparing this free model to the labels from Gemini.

## German articles classification

In [12]:
with open('top_1000_de.txt', encoding="utf8") as inf:
  german_titles = [line.strip() for line in inf.readlines()]

import time
start_time = time.time()
german_results = classifier(
        german_titles,
        candidate_labels=german_labels,
        hypothesis_template= "Dieser Text handelt von {}.",
        multi_label=False
    )
end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")

Time taken: 525.00 seconds


Let's save the results, so that we don't lose them.

**IMPORTANT:** Once the file is created, download it on your computer immediately. This is because if you forget, you might need to repeat the classification again, since files on the cloud are "temporary".

In [13]:
import json
with open("german_results.json", "w") as f:
  json.dump(german_results, f)

### French articles classification

In [14]:
import time

with open('top_1000_fr.txt', encoding="utf8") as inf:
  french_titles = [line.strip() for line in inf.readlines()]

start_time = time.time()
french_results = classifier(
        french_titles,
        candidate_labels=french_labels,
        hypothesis_template="Le sujet de cet article est {}.",
        multi_label=False
    )
end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")

Time taken: 520.69 seconds


Save the results:

In [16]:
import json
with open('french_results.json', 'w') as f:
  json.dump(french_results, f)

### Evaluate the classification

We will use the two files that were created with help from Gemini: top_1000_de_labels.csv and top_1000_fr_labels.csv.

#### German evaluation

In [22]:
evaluate_zero_shot(german_results, "top_1000_de_labels.csv", 'de')

--- Classification Evaluation ---
Confusion Matrix (True Labels: ['P', 'NP']):
[[144 130]
 [160 566]]



Accuracy: 0.7100
Recall (for 'P'): 0.5255
Precision (for 'P'): 0.4737
F1-Score (for 'P'): 0.4983


While the overall accuracy is decent (0.71 is way better than randomly assigning classes to the two labels), the results for our desired class, P, are not that good.

#### French Evaluation

In [19]:
evaluate_zero_shot(french_results, "top_1000_fr_labels.csv", 'fr')

--- Classification Evaluation ---
Confusion Matrix (True Labels: ['P', 'NP']):
[[ 72  66]
 [163 699]]



Accuracy: 0.7710
Recall (for 'P'): 0.5217
Precision (for 'P'): 0.3064
F1-Score (for 'P'): 0.3861


We can see that the accuracy is even better than for the German titles, 0.77, but, the F1-score for the "P" class is worse. Our classifier is doing much better with the "NP" class than with the "P" class.

## 7. Analyze Errors

Let's now look more closely at the data for the German articles to understand where the errors might lie.

In [20]:
import json
with open('german_results.json', 'r') as f:
  german_results = json.load(f)

dfDE = pd.DataFrame(german_results)
dfDE.head()

Unnamed: 0,sequence,labels,scores
0,PornHub,"[Nicht-Politik, Politik]","[0.825002908706665, 0.17499709129333496]"
1,Liste der größten Auslegerbrücken,"[Nicht-Politik, Politik]","[0.8098713755607605, 0.1901286095380783]"
2,ChatGPT,"[Nicht-Politik, Politik]","[0.7913989424705505, 0.20860107243061066]"
3,Deutschland,"[Politik, Nicht-Politik]","[0.6653300523757935, 0.33466991782188416]"
4,ZDF,"[Politik, Nicht-Politik]","[0.6198923587799072, 0.3801076412200928]"


Because the JSON file contains both labels and corresponding scores, we can transform the dataframe for a bit. This is where pandas skills are useful.

We will create two new columns: **predicted_label**

In [22]:
dfDE['predicted_label'] = dfDE['labels'].apply(lambda x: x[0])
dfDE.head()

Unnamed: 0,sequence,labels,scores,predicted_label
0,PornHub,"[Nicht-Politik, Politik]","[0.825002908706665, 0.17499709129333496]",Nicht-Politik
1,Liste der größten Auslegerbrücken,"[Nicht-Politik, Politik]","[0.8098713755607605, 0.1901286095380783]",Nicht-Politik
2,ChatGPT,"[Nicht-Politik, Politik]","[0.7913989424705505, 0.20860107243061066]",Nicht-Politik
3,Deutschland,"[Politik, Nicht-Politik]","[0.6653300523757935, 0.33466991782188416]",Politik
4,ZDF,"[Politik, Nicht-Politik]","[0.6198923587799072, 0.3801076412200928]",Politik


and the label **score**

In [23]:
dfDE['score'] = dfDE['scores'].apply(lambda x: x[0])
dfDE.head()

Unnamed: 0,sequence,labels,scores,predicted_label,score
0,PornHub,"[Nicht-Politik, Politik]","[0.825002908706665, 0.17499709129333496]",Nicht-Politik,0.825003
1,Liste der größten Auslegerbrücken,"[Nicht-Politik, Politik]","[0.8098713755607605, 0.1901286095380783]",Nicht-Politik,0.809871
2,ChatGPT,"[Nicht-Politik, Politik]","[0.7913989424705505, 0.20860107243061066]",Nicht-Politik,0.791399
3,Deutschland,"[Politik, Nicht-Politik]","[0.6653300523757935, 0.33466991782188416]",Politik,0.66533
4,ZDF,"[Politik, Nicht-Politik]","[0.6198923587799072, 0.3801076412200928]",Politik,0.619892


And we drop the original columns that we don't need any longer:

In [24]:
dfDE = dfDE.drop(columns=['labels', 'scores'])
dfDE.head()

Unnamed: 0,sequence,predicted_label,score
0,PornHub,Nicht-Politik,0.825003
1,Liste der größten Auslegerbrücken,Nicht-Politik,0.809871
2,ChatGPT,Nicht-Politik,0.791399
3,Deutschland,Politik,0.66533
4,ZDF,Politik,0.619892


Now we will read the ground truth from the CSV with the data labeled by Gemini:

In [25]:
dfGT = pd.read_csv("top_1000_de_labels.csv")
dfGT.head()

Unnamed: 0,article,label
0,PornHub,NP
1,Liste der größten Auslegerbrücken,NP
2,ChatGPT,NP
3,Deutschland,P
4,ZDF,NP


I'm renaming one column to make clearer its meaning:

In [27]:
dfGT.rename(columns={'label': 'ground_truth'}, inplace=True)
dfGT.head()

Unnamed: 0,article,ground_truth
0,PornHub,NP
1,Liste der größten Auslegerbrücken,NP
2,ChatGPT,NP
3,Deutschland,P
4,ZDF,NP


Then I will merge the two dataframes based on the article name:

In [28]:
df_combined = pd.merge(dfDE, dfGT, left_on='sequence', right_on='article')
df_combined.drop(columns=['sequence'], inplace=True)
df_combined.head()

Unnamed: 0,predicted_label,score,article,ground_truth
0,Nicht-Politik,0.825003,PornHub,NP
1,Nicht-Politik,0.809871,Liste der größten Auslegerbrücken,NP
2,Nicht-Politik,0.791399,ChatGPT,NP
3,Politik,0.66533,Deutschland,P
4,Politik,0.619892,ZDF,NP


Now I'll generate a table for the false positives.

In [32]:
false_positives_df = df_combined[(df_combined['predicted_label'] == 'Politik') & (df_combined['ground_truth'] == 'NP')]
print("False Positives DataFrame:")
false_positives_df.head(10)

False Positives DataFrame:


Unnamed: 0,predicted_label,score,article,ground_truth
4,Politik,0.619892,ZDF,NP
13,Politik,0.586764,TV Mainfranken,NP
22,Politik,0.539481,Fußball-Europameisterschaft 2024,NP
27,Politik,0.767935,Elisabeth von Österreich-Ungarn,NP
28,Politik,0.898622,Julian Nagelsmann,NP
30,Politik,0.963693,Toni Kroos,NP
48,Politik,0.801667,Helene Fischer,NP
52,Politik,0.77297,Till Lindemann,NP
54,Politik,0.805091,Michael Schumacher,NP
56,Politik,0.526945,Vera F. Birkenbihl,NP


In [31]:
false_positives_df.shape[0]

160

**False positives:** 160 articles that are **not** about politics (according to Gemini) were falsely labeled as being about politics.

Looking at the 10 shown titles, 8 out of 10 are non-political. Two of them, ZDF (main TV station in Germany) and Elisabeth von Österreich-Ungarn (Empress Sissi), can be considered political in certain contexts.

In [33]:
false_negatives_df = df_combined[(df_combined['predicted_label'] == 'Nicht-Politik') & (df_combined['ground_truth'] == 'P')]
print("False Negatives DataFrame:")
false_negatives_df.head(10)

False Negatives DataFrame:


Unnamed: 0,predicted_label,score,article,ground_truth
12,Nicht-Politik,0.547244,Ricarda Lang,P
15,Nicht-Politik,0.524983,Alice Weidel,P
18,Nicht-Politik,0.568354,Annalena Baerbock,P
23,Nicht-Politik,0.571846,Robert Habeck,P
66,Nicht-Politik,0.739958,Gruppe Wagner,P
74,Nicht-Politik,0.501062,Dienstgrade der Bundeswehr,P
78,Nicht-Politik,0.559827,Georgien,P
101,Nicht-Politik,0.553183,Nancy Faeser,P
106,Nicht-Politik,0.511183,Türkei,P
114,Nicht-Politik,0.723911,Taurus (Marschflugkörper),P


In [34]:
false_negatives_df.shape[0]

130

**False Negatives:** There are 130 articles that are about politics, but were wrongly labeled as being "NP".

Looking at the list, we see politician names, some countries (all countries in the Gemini list were labeled "Political") and military terms such as Taurus (Marschflugkörper) or Dienstgrade der Bundeswehr.

These results show that it is challenging for the classifier to perform labeling on the name of the articles alone and that we would have to think of alternatives, especially for article titles in non-English languages.