# Zero-Shot Classification

In this notebook, we'll use the `zero-shot-classification` pipeline from the HF Transformers library to predict the intents of sentences in a dataset. We'll compare the predicted intents with the actual labels and print the evaluation metrics.

Creating an NLP-based framework to parse the input question to categorize the intent into one of the question types.

Question Types:
1. Why is action A not used in the plan, rather than being used?
2. Why is action A used in the plan, rather than not being used?
3. Why is action A used in state S, rather than action B?

## Single Text Prediction

In [1]:
from transformers import pipeline
from pprint import pprint

In [2]:
!pwd

/Users/nitingupta/usc/aiisc/planning_ontology/Contrastive-Planning/code/intent_parsing/zero_shot_fine_tune


In [3]:
# Load a zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", 
                      model="../../../models/grounded_saved_model_20240606_230822", 
                      tokenizer="../../../models/grounded_saved_model_20240606_230822")

In [21]:
# Define the query and candidate labels
candidate_labels = ["Why is action A not used in the plan?", 
                    "Why is action A used in the plan?", 
                    "Why is action A used in state S, rather than action B?"]
query = "Why is moveLeft x1 y1 used in the plan??"

# Perform zero-shot classification
result = classifier(query, candidate_labels)
pprint(result, width=100)

{'labels': ['Why is action A used in state S, rather than action B?',
            'Why is action A used in the plan?',
            'Why is action A not used in the plan?'],
 'scores': [0.511748731136322, 0.3627440631389618, 0.1255071759223938],
 'sequence': 'Why is moveLeft x1 y1 used in the plan??'}


In [5]:
# Define the query and candidate labels
candidate_labels = ["Why is action A not used in the plan?", 
                    "Why is action A used in the plan?", 
                    "Why is action A used rather than action B?"]
query = "What made 'push box to the left' more suitable than 'move to the right'?"

# Perform zero-shot classification
result = classifier(query, candidate_labels)
pprint(result, width=100)

{'labels': ['Why is action A used rather than action B?',
            'Why is action A used in the plan?',
            'Why is action A not used in the plan?'],
 'scores': [0.9998933672904968, 8.815569162834436e-05, 1.8498139979783446e-05],
 'sequence': "What made 'push box to the left' more suitable than 'move to the right'?"}


It seems that "Why is action A used rather than action B?" is a better intent category label than "Why is action A used in state S, rather than action B?".

<br>

## Dataset Prediction using Base Model
Predict the intents of the sentences in the text column from the data csv, compare them with the actual labels, and print the evaluation metrics.

Use the base model.

In [6]:
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score
import swifter
import wandb
import os

In [7]:
# set the wandb project where this run will be logged
os.environ["WANDB_PROJECT"] = "zero-shot-classification"

# save your trained model checkpoint to wandb
os.environ["WANDB_LOG_MODEL"]="true"

# turn off watch to log faster
os.environ["WANDB_WATCH"]="false"

In [8]:
# Load the CSV file into a DataFrame
df = pd.read_csv('../../../data/sokoban/sokoban_final_grounded_dataset.csv')

print(f"Number of rows in the dataset: {df.shape[0]}")
df.head()

Number of rows in the dataset: 216


Unnamed: 0,text,intent
0,Why is moveLeft x1 y1 not used in the plan?,Why is action A not used in the plan?
1,Why is moveRight x1 y1 not used in the plan?,Why is action A not used in the plan?
2,Why is moveUp x1 y1 not used in the plan?,Why is action A not used in the plan?
3,Why is moveDown x1 y1 not used in the plan?,Why is action A not used in the plan?
4,Why is pushLeft x1 y1 z1 crate1 not used in th...,Why is action A not used in the plan?


In [9]:
# Define the candidate labels and their corresponding intent numbers
candidate_labels = ["Why is action A not used in the plan?", 
                    "Why is action A used in the plan?", 
                    "Why is action A used rather than action B?"]

intent_to_label = {label: intent for label, intent in zip(candidate_labels, range(1, 4))}
intent_to_label

{'Why is action A not used in the plan?': 1,
 'Why is action A used in the plan?': 2,
 'Why is action A used rather than action B?': 3}

In [11]:
# Load a zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", 
                      model="../../../models/grounded_saved_model_20240606_230822", 
                      tokenizer="../../../models/grounded_saved_model_20240606_230822")

# Function to get predictions for each text
def get_prediction(text):
    result = classifier(text, candidate_labels)
    predicted_label = intent_to_label[result['labels'][0]]
    return predicted_label

In [12]:
# Apply the function to the text column
df['predicted_label'] = df['text'].swifter.apply(get_prediction)

Pandas Apply:   0%|          | 0/216 [00:00<?, ?it/s]

In [14]:
# Convert the intent labels to numbers
df['label'] = pd.factorize(df['intent'])[0] + 1
df.head()

Unnamed: 0,text,intent,predicted_label,label
0,Why is moveLeft x1 y1 not used in the plan?,Why is action A not used in the plan?,1,1
1,Why is moveRight x1 y1 not used in the plan?,Why is action A not used in the plan?,1,1
2,Why is moveUp x1 y1 not used in the plan?,Why is action A not used in the plan?,1,1
3,Why is moveDown x1 y1 not used in the plan?,Why is action A not used in the plan?,1,1
4,Why is pushLeft x1 y1 z1 crate1 not used in th...,Why is action A not used in the plan?,2,1


In [15]:
# Compare predicted labels with actual labels
y_true = df['label']
y_pred = df['predicted_label']

# Print the classification report
print(classification_report(y_true, y_pred))
print(f"Accuracy: {accuracy_score(y_true, y_pred):.2f}")

              precision    recall  f1-score   support

           1       1.00      0.46      0.63        72
           2       0.00      0.00      0.00        72
           3       0.46      1.00      0.63        72

    accuracy                           0.49       216
   macro avg       0.49      0.49      0.42       216
weighted avg       0.49      0.49      0.42       216

Accuracy: 0.49


In [20]:
df[df['label'] == 2]

Unnamed: 0,text,intent,predicted_label,label
72,Why is moveLeft x1 y1 used in the plan?,Why is action A used in the plan?,3,2
73,Why is moveRight x1 y1 used in the plan?,Why is action A used in the plan?,3,2
74,Why is moveUp x1 y1 used in the plan?,Why is action A used in the plan?,3,2
75,Why is moveDown x1 y1 used in the plan?,Why is action A used in the plan?,3,2
76,Why is pushLeft x1 y1 z1 crate1 used in the plan?,Why is action A used in the plan?,3,2
...,...,...,...,...
139,Why is moveDown x9 y9 used in the plan?,Why is action A used in the plan?,3,2
140,Why is pushLeft x9 y9 z9 crate9 used in the plan?,Why is action A used in the plan?,3,2
141,Why is pushRight x9 y9 z9 crate9 used in the p...,Why is action A used in the plan?,3,2
142,Why is pushUp x9 y9 z9 crate9 used in the plan?,Why is action A used in the plan?,3,2


In [16]:
# Display the rows in which the predictions didn't match the label
incorrect_predictions = df[df['label'] != df['predicted_label']]
print(f"{incorrect_predictions.shape[0]} incorrect predictions out of {df.shape[0]} test samples.")
incorrect_predictions.head()

111 incorrect predictions out of 216 test samples.


Unnamed: 0,text,intent,predicted_label,label
4,Why is pushLeft x1 y1 z1 crate1 not used in th...,Why is action A not used in the plan?,2,1
5,Why is pushRight x1 y1 z1 crate1 not used in t...,Why is action A not used in the plan?,2,1
6,Why is pushUp x1 y1 z1 crate1 not used in the ...,Why is action A not used in the plan?,2,1
7,Why is pushDown x1 y1 z1 crate1 not used in th...,Why is action A not used in the plan?,2,1
12,Why is pushLeft x2 y2 z2 crate2 not used in th...,Why is action A not used in the plan?,2,1
