# Zero-shot text classification

In this lab, we are going perform zero-shot text classification using a NLI model.

## Importing the dependencies

First, we are going to import all the dependencies that we will need for this lab. If you cannot run the following code cell, do not forget to [create an environment](https://www.freecodecamp.org/news/how-to-setup-virtual-environments-in-python/), to install the dependencies inside of it (using the command `pip install -r requirements.txt`) and to use it as your Jupyter kernel.

In [29]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['HF_HOME'] = os.getcwd() + "/cache/"
import ssl

import torch
import numpy as np
import pandas as pd
import nltk

from transformers import pipeline
from datasets import load_dataset
from nltk import word_tokenize

In [None]:
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download("punkt_tab")

## Identifying the best device to run the model

Since we are going to perform a computing-intensive task, we must identify the most efficient device available to perform it. We do so using PyTorch, which is the back-end that we will use in this lab. We prioritize NVIDIA GPUs with CUDA installed, then Apple Silicon GPUs, and finally CPUs if none of the above is found.

If you need help installing the relevant version of PyTorch: https://pytorch.org/get-started/locally/

If you have a NVIDIA GPU but you don't know whether you have CUDA installed or not, type the following command:

```bash
nvcc --version
```

If you have it installed, you should see the CUDA version installed on your computer. Otherwise, you should install a PyTorch-compatible version (as listed [here](https://pytorch.org/get-started/locally/), row "Stable CUDA").

In [None]:
if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device('cpu')

print(device)

Now let's download and prepare the language model that we will use! This model is a fine-tuning of a multilingual BGE model on text labelling and NLI datasets.

Note that here, we will not load the model directly but use a `pipeline`, which is an interface between you and your model to help you perform a specific task. Here, we use a pipeline designed to facilitate zero-shot classification.

In [3]:
model_name = "MoritzLaurer/bge-m3-zeroshot-v2.0"
zeroshot_classifier = pipeline("zero-shot-classification", model=model_name, device=device)

## First experiment: Sentiment analysis

As a first test, we are simply going to test the model on basic sentiment analysis.

First, let's choose our label candidates and the template that we will use to transform them into NLI hypotheses.

In [None]:
label_candidates = ["joy", "anger", "sadness", "surprise", "love", "fear"]
hypothesis_template = "This text expresses {}."

In our template, we will replace the `{}` symbols with our different labels to produce our NLI hypotheses. We will then evaluate the entailment of our hypotheses with a text used as the premise to assess the relevance of the different labels to classify the said text.

In [None]:
premise = "Dauphine's canteen is remarkably better than the one I had at my previous job!"

In [None]:
output= zeroshot_classifier(premise, label_candidates, hypothesis_template=hypothesis_template, multi_label=True)

print("Classification results:")
for label, score in zip(output["labels"], output["scores"]):
    print(f"\tFor label '{label}': {score:.2f}")

Note that the results do not sum up to 1. That is because the score for each label is computed individually, which allows us to perform multi-label classification.

To decide which label(s) you can choose to apply to your text, there are two main strategies, each having its pros and cons:
- Select the label with the highest score
- Select all the labels with a score higher than a given threshold

## Second experiment: text labeling at the scale of a dataset.

Now that we have performed our first zero-shot classifications, we may want to perform it on a larger dataset. Here, we are going to test our model on a dataset of Reddit comments on climate change.

In [None]:
dst = load_dataset("cathw/reddit_climate_comment", split='train', revision="refs/convert/parquet")
dst

In [None]:
dst[0]

## Dataset cleaning process

As a first attempt to clean our dataset, we are going to select only the proposals containing between 20 and 200 words (or more precisely, tokens).

In [16]:
def get_token_count(batch):
    token_counts = []
    for comment in batch["CommentBody"]:
        if comment:
            token_counts.append(len(word_tokenize(comment, language='english')))
        else:
            token_counts.append(0)
    return {"CommentNbTokens": token_counts}

In [17]:
dst = dst.map(get_token_count, batched=True, num_proc=4)

In [None]:
dst = dst.filter(lambda row: 20 <= row["CommentNbTokens"] <= 200, num_proc=4)
dst

As you may guess, 100k+ rows may be a bit much for our little experiment. So, to speed up the process, we are going to select 1000 random lines from the dataset.

In [None]:
dst = dst.shuffle(seed=1234).select(range(1000))
dst

In [None]:
dst[0]

## Classification of the Reddit posts

Now that we hava a clean and small dataset, we are going to classify the posts according to a series of labels that we will define. I have created a few labels as a starting point, including a dummy label `environment` which should be true for most of the texts in our dataset, but do not hesitate to add more.

In [None]:
labels = [
    "greenhouse gases",
    "recycling",
    "air pollution",
    "water pollution",
    "individual changes",
    "systemic changes",
    "decarbonization",
    "sustainability",
    "ecologic transition",
    "environment",
]
hypothesis_template = "This text is about {}"

In [22]:
outputs = zeroshot_classifier(dst["CommentBody"], labels, hypothesis_template=hypothesis_template, multi_label=True, batch_size=4)

In [None]:
idx = 123

print("Comment:")
print(dst["CommentBody"][idx], end="\n\n")
print("Classification results:")
for label, score in zip(outputs[idx]["labels"], outputs[idx]["scores"]):
    print(f"\tFor label '{label}': {score:.2f}")

## Performing a hard labeling

Now that we have our entailment scores, we want to transform them into a list of labels for each text. Note that this step is not mandatory, for certain tasks keeping the scores may be the better decision.

In [24]:
threshold = 0.5

In [None]:
comments_labeled = {
    "comment": dst["CommentBody"]
}
for label in labels:
    label_relevance_list = []
    #label_score_list = []
    for i in range(len(dst["CommentBody"])):
        label_idx = outputs[i]["labels"].index(label)
        label_is_relevant = outputs[i]["scores"][label_idx] > threshold
        label_relevance_list.append(label_is_relevant)
        #label_score_list.append(outputs[i]["scores"][label_idx])

    comments_labeled[label] = label_relevance_list
    #comments_labeled[f"{label}_score"] = label_score_list

comments_labeled_df = pd.DataFrame(comments_labeled)
comments_labeled_df

In [None]:
print("Aggregated results:")
for label in labels:
    print(f"\tNumber of comments with label '{label}': {comments_labeled_df[label].sum()}")