# Zero-Shot Classification

In this notebook, we'll use the `zero-shot-classification` pipeline from the HF Transformers library to predict the intents of sentences in a dataset. We'll compare the predicted intents with the actual labels and print the evaluation metrics.

Creating an NLP-based framework to parse the input question to categorize the intent into one of the question types.

Question Types:
1. Why is action A not used in the plan, rather than being used?
2. Why is action A used in the plan, rather than not being used?
3. Why is action A used in state S, rather than action B?

## Single Text Prediction

In [1]:
from transformers import pipeline
from pprint import pprint

In [2]:
# Load a pre-trained zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

In [3]:
# Define the query and candidate labels
candidate_labels = ["Why is action A not used in the plan?", 
                    "Why is action A used in the plan?", 
                    "Why is action A used in state S, rather than action B?"]
query = "What made 'push box to the left' more suitable than 'move to the right'?"

# Perform zero-shot classification
result = classifier(query, candidate_labels)
pprint(result, width=100)

{'labels': ['Why is action A used in the plan?',
            'Why is action A not used in the plan?',
            'Why is action A used in state S, rather than action B?'],
 'scores': [0.3973885476589203, 0.3694774806499481, 0.2331339716911316],
 'sequence': "What made 'push box to the left' more suitable than 'move to the right'?"}


In [4]:
# Define the query and candidate labels
candidate_labels = ["Why is action A not used in the plan?", 
                    "Why is action A used in the plan?", 
                    "Why is action A used rather than action B?"]
query = "What made 'push box to the left' more suitable than 'move to the right'?"

# Perform zero-shot classification
result = classifier(query, candidate_labels)
pprint(result, width=100)

{'labels': ['Why is action A used rather than action B?',
            'Why is action A used in the plan?',
            'Why is action A not used in the plan?'],
 'scores': [0.7602096199989319, 0.12425892055034637, 0.11553144454956055],
 'sequence': "What made 'push box to the left' more suitable than 'move to the right'?"}


It seems that "Why is action A used rather than action B?" is a better intent category label than "Why is action A used in state S, rather than action B?".

<br>

## Dataset Prediction using Base Model
Predict the intents of the sentences in the text column from the data csv, compare them with the actual labels, and print the evaluation metrics.

Use the base model.

In [16]:
import pandas as pd
from transformers import pipeline
from sklearn.metrics import classification_report, accuracy_score
import swifter
import wandb
import os

In [17]:
# set the wandb project where this run will be logged
os.environ["WANDB_PROJECT"] = "zero-shot-classification"

# save your trained model checkpoint to wandb
os.environ["WANDB_LOG_MODEL"]="true"

# turn off watch to log faster
os.environ["WANDB_WATCH"]="false"

In [18]:
# Load the CSV file into a DataFrame
df = pd.read_csv('./data/combined_dataset.csv')
print(f"Number of rows in the dataset: {df.shape[0]}")
df.head()

Number of rows in the dataset: 346


Unnamed: 0,text,label
0,Why was action A excluded from the plan?,1
1,What were the reasons for omitting action A fr...,1
2,Can you explain why action A was not considere...,1
3,Why didn't the plan include action A?,1
4,What is the rationale for not using action A i...,1


In [19]:
# Define the candidate labels and their corresponding intent numbers
candidate_labels = ["Why is action A not used in the plan?", 
                    "Why is action A used in the plan?", 
                    "Why is action A used rather than action B?"]

intent_to_label = {label: intent for label, intent in zip(candidate_labels, range(1, 4))}
intent_to_label

{'Why is action A not used in the plan?': 1,
 'Why is action A used in the plan?': 2,
 'Why is action A used rather than action B?': 3}

In [20]:
# Load a pre-trained zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Function to get predictions for each text
def get_prediction(text):
    result = classifier(text, candidate_labels)
    predicted_label = intent_to_label[result['labels'][0]]
    return predicted_label

In [21]:
# Apply the function to the text column
df['predicted_label'] = df['text'].swifter.apply(get_prediction)

Pandas Apply:   0%|          | 0/346 [00:00<?, ?it/s]

In [22]:
# Compare predicted labels with actual labels
y_true = df['label']
y_pred = df['predicted_label']

# Print the classification report
print(classification_report(y_true, y_pred))
print(f"Accuracy: {accuracy_score(y_true, y_pred):.2f}")

              precision    recall  f1-score   support

           1       0.74      0.97      0.84       116
           2       0.98      0.94      0.96       115
           3       0.90      0.65      0.76       115

    accuracy                           0.86       346
   macro avg       0.87      0.86      0.85       346
weighted avg       0.87      0.86      0.85       346

Accuracy: 0.86


In [23]:
# Display the rows in which the predictions didn't match the label
incorrect_predictions = df[df['label'] != df['predicted_label']]
print(f"{incorrect_predictions.shape[0]} incorrect predictions out of {df.shape[0]} test samples.")
incorrect_predictions.head()

50 incorrect predictions out of 346 test samples.


Unnamed: 0,text,label,predicted_label
46,The player doesn't push any boxes. Shouldn't p...,1,2
96,What's the justification for not using 'move r...,1,3
102,What is the reasoning for not using 'push up'?,1,3
176,What made the plan opt for 'move left' instead...,2,1
186,What was the reasoning for 'move down' being s...,2,3


## Dataset Prediction using Fine-Tuned Model

In [28]:
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from numpy import argmax
from utils import get_best_available_device

In [29]:
device = get_best_available_device()
print(f"Using device: {device}")

Using device: mps


In [30]:
# CSV has columns 'text' and 'label'
train_df, val_df = train_test_split(df, test_size=0.2, random_state=13)

# Convert to Hugging Face Datasets format
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

In [43]:
# Use a tokenizer to preprocess the text data:
model_name = "facebook/bart-large-mnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/276 [00:00<?, ? examples/s]

Map:   0%|          | 0/70 [00:00<?, ? examples/s]

In [44]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
model.to(device);

In [45]:
# pass "wandb" to the 'report_to' parameter to turn on wandb logging
training_args = TrainingArguments(
    output_dir="./results",
    report_to="wandb",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

In [46]:
# Define the metrics to evaluate the model
def compute_metrics(p):
    pred, labels = p
    pred = argmax(pred, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, pred, average='weighted')
    acc = accuracy_score(labels, pred)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

In [47]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [49]:
wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [None]:
trainer.evaluate()