# Intent Detection Notebook

This notebook presents my approach to developing a complete end-to-end NLU model capable of predicting user intent.

The code explores various techniques for intent detection, including zero-shot classification with traditional NLU models, embedding-based methods, and LLM-based approaches.

The goal is to evaluate the performance of these methods on the training dataset provided.

### Environment Setup

Install the required packages:

In [100]:
!pip install -U torch transformers huggingface huggingface_hub hf_xet open-intent-classifier openai

Collecting openai
  Downloading openai-1.78.1-py3-none-any.whl.metadata (25 kB)


### Load and Explore Data

Loading the dataset to analyze the available intents and their corresponding utterances.

The dataset contains user utterances labeled with their corresponding intents.

This exploration helps me understand the data structure and content, as this will serve as the gold standard for evaluating the intent detection models.

In [2]:
import pandas as pd
from collections import defaultdict

# Read the CSV file
df = pd.read_csv('/Users/amogh/Documents/amogh/personal/Tifin_Test_1/data/sofmattress_train.csv')

# Create a dictionary to store utterances by label
intent_data = defaultdict(list)

# Populate the dictionary
for _, row in df.iterrows():
    intent_data[row['label']].append(row['sentence'])

# Get unique labels
labels = list(intent_data.keys())


print("Number of Intents:", len(labels))

# Print all unique labels
print("\nAvailable Intent Labels:")
print("----------------------")
for label in sorted(labels):
    print(f"- {label}")

print("\nExample utterances for each label:")
print("--------------------------------")
for label in sorted(labels):
    print(f"\n{label}:")
    # Print first 3 examples for each label
    for utterance in intent_data[label][:3]:
        print(f"  - {utterance}")

Number of Intents: 21

Available Intent Labels:
----------------------
- 100_NIGHT_TRIAL_OFFER
- ABOUT_SOF_MATTRESS
- CANCEL_ORDER
- CHECK_PINCODE
- COD
- COMPARISON
- DELAY_IN_DELIVERY
- DISTRIBUTORS
- EMI
- ERGO_FEATURES
- LEAD_GEN
- MATTRESS_COST
- OFFERS
- ORDER_STATUS
- ORTHO_FEATURES
- PILLOWS
- PRODUCT_VARIANTS
- RETURN_EXCHANGE
- SIZE_CUSTOMIZATION
- WARRANTY
- WHAT_SIZE_TO_ORDER

Example utterances for each label:
--------------------------------

100_NIGHT_TRIAL_OFFER:
  - How does the 100 night trial work
  - What is the 100-night offer
  - Trial details

ABOUT_SOF_MATTRESS:
  - How is SOF different from other mattress brands
  - Why SOF mattress
  - About SOF Mattress

CANCEL_ORDER:
  - I want to cancel my order
  - How can I cancel my order
  - Cancel order

CHECK_PINCODE:
  - Do you deliver to my pincode
  - Check pincode
  - Is delivery possible on this pincode

COD:
  - COD option is availble?
  - Do you offer COD to my pincode?
  - Can I do COD?

COMPARISON:
  - What i

## Zero-Shot Classification

The choice of zero-shot classification stems from the challenges associated with training and fine-tuning custom intent models for different business domains. Such processes are not scalable and require significant resources.

Instead, a general-purpose language understanding engine offers a more effective solution.

Traditional methods in conversational AI involve tagging and training intents with user utterances, which is cumbersome and doesn't easily scale. Moreover, the results from legacy chatbot systems often fall short of expectations.

This necessitates a paradigm shift in Natural Language Understanding (NLU), where zero-shot classification can provide a more robust and scalable approach to intent detection.

### NLI Models for Intent Detection

Using the `facebook/bart-large-mnli` model, we can perform zero-shot classification.

This approach allows for classifying intents without requiring a labeled training dataset, leveraging the model's ability to directly understand natural language context of intents.

In [101]:
# Import necessary libraries
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Create a text classification pipeline using a pre-trained model for intent detection
# We'll use the 'facebook/bart-large-mnli' model which is good for zero-shot classification
classifier = pipeline("zero-shot-classification",
                     model="facebook/bart-large-mnli")

# Define some example texts and possible intents
text = "What's the weather like today?"
candidate_intents = [
    "weather_query",
    "greeting",
    "booking",
    "information_request"
]

# Perform intent classification
result = classifier(text, candidate_intents)

# Print results
print(f"Text: {text}")
print("\nIntent Classification Results:")
for intent, score in zip(result['labels'], result['scores']):
    print(f"{intent}: {score:.4f}")

# Function to classify intent
def classify_intent(text, possible_intents):
    result = classifier(text, possible_intents)
    return result['labels'][0], result['scores'][0]  # Return top intent and its score

# Test with multiple examples
test_texts = [
    "I want to book a table for tonight",
    "Hello, how are you?",
    "Can you tell me your opening hours?",
]

for text in test_texts:
    intent, confidence = classify_intent(text, candidate_intents)
    print(f"\nText: {text}")
    print(f"Detected Intent: {intent}")
    print(f"Confidence: {confidence:.4f}")

Device set to use mps:0
  sentence_representation = hidden_states[eos_mask, :].view(hidden_states.size(0), -1, hidden_states.size(-1))[


Text: What's the weather like today?

Intent Classification Results:
weather_query: 0.5490
information_request: 0.3540
greeting: 0.0515
booking: 0.0455

Text: I want to book a table for tonight
Detected Intent: booking
Confidence: 0.7859

Text: Hello, how are you?
Detected Intent: greeting
Confidence: 0.9812

Text: Can you tell me your opening hours?
Detected Intent: information_request
Confidence: 0.7370


### Exploring Open-Source Libraries

For this section, I explored various open-source libraries for intent detection and discovered a promising package called [`open-intent-classifier`](https://github.com/SerjSmor/open-intent-classifier/).

This library provides multiple approaches to tackle the intent detection problem and also includes a small FLAN-T5 model that is fine-tuned for zero-shot intent detection - https://huggingface.co/Serj/intent-classifier

In [102]:
from open_intent_classifier.model import IntentClassifier
from open_intent_classifier.consts import INTENT_CLASSIFIER_248M_FLAN_T5_BASE
import os
from dotenv import load_dotenv
load_dotenv()
os.getenv('HF_TOKEN')

# model = IntentClassifier()
model = IntentClassifier(INTENT_CLASSIFIER_248M_FLAN_T5_BASE)
labels = ["Cancel Subscription", "Refund Requests", "Broken Item", "And More..."]
text = "I don't want thioos product. its not working"
predicted_label = model.predict(text, labels)
print(predicted_label)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


ClassificationResult(class_name='Broken Item', reasoning='')


### Model Evaluation: BART NLI vs. Fine-Tuned T5

In this section, I evaluated the performance of the `facebook/bart-large-mnli` model against a smaller T5 model fine-tuned for zero-shot classification using the `open-intent-classifier` package.

The evaluation is conducted on the `sofmattress_train` dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from transformers import pipeline
from open_intent_classifier.model import IntentClassifier
from open_intent_classifier.consts import INTENT_CLASSIFIER_248M_FLAN_T5_BASE
from tqdm import tqdm

# Load the dataset
df = pd.read_csv('data/sofmattress_train.csv')

# Initialize both models
bart_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
flan_t5_classifier = IntentClassifier(INTENT_CLASSIFIER_248M_FLAN_T5_BASE)

# Get unique labels
labels = df['label'].unique().tolist()

def evaluate_bart(text, true_label, labels):
    result = bart_classifier(text, labels)
    predicted_label = result['labels'][0]
    confidence = result['scores'][0]
    return predicted_label, confidence

def evaluate_flan_t5(text, true_label, labels):
    predicted_label = flan_t5_classifier.predict(text, labels)
    return predicted_label, None  # FLAN-T5 doesn't provide confidence scores directly

# Initialize results storage
results = {
    'bart': {'predictions': [], 'confidences': [], 'true_labels': []},
    'flan_t5': {'predictions': [], 'true_labels': []}
}

# Evaluate both models
print("Evaluating models...")
for idx, row in tqdm(df.iterrows(), total=len(df)):
    text = row['sentence']
    true_label = row['label']
    
    # BART evaluation
    bart_pred, bart_conf = evaluate_bart(text, true_label, labels)
    results['bart']['predictions'].append(bart_pred)
    results['bart']['confidences'].append(bart_conf)
    results['bart']['true_labels'].append(true_label)
    
    # FLAN-T5 evaluation
    flan_pred, _ = evaluate_flan_t5(text, true_label, labels)
    results['flan_t5']['predictions'].append(flan_pred)
    results['flan_t5']['true_labels'].append(true_label)

Device set to use mps:0


Evaluating models...


  4%|▎         | 12/328 [00:29<11:42,  2.22s/it]

#### Results

- **BART Model**: Achieved an average accuracy of 0.45. While BART is a powerful model, its general-purpose nature may not be as effective for specific intent detection tasks without fine-tuning. Still pretty good for zero-shot imo.
  
- **Fine-Tuned T5 Model**: Outperformed BART with better average accuracy of 0.52. This result was not surprising, as the model is specifically fine-tuned for zero-shot intent classification, demonstrating that targeted SLM fine-tuning can be more effective for narrow problem statements.

#### Insights

The evaluation highlights the importance of fine-tuning models for focused tasks. While LLMs offer broad capabilities, specialized SLMs can still provide superior performance for specific applications.

In [3]:
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score

# Load the original dataset
df = pd.read_csv('data/sofmattress_train.csv')

# Add BART and FLAN-T5 predictions to the dataset
df['BART_Prediction'] = results['bart']['predictions']
df['FLAN_T5_Prediction'] = [pred.class_name for pred in results['flan_t5']['predictions']]

# Save the updated dataset with predictions
df.to_csv('results/sofmattress_with_predictions.csv', index=False)

# Calculate metrics for BART
print("BART Model Metrics:")
print("-------------------")
print(classification_report(df['label'], df['BART_Prediction']))
print("Accuracy:", accuracy_score(df['label'], df['BART_Prediction']))

# Calculate metrics for FLAN-T5
print("\nFLAN-T5 Model Metrics:")
print("----------------------")
print(classification_report(df['label'], df['FLAN_T5_Prediction']))
print("Accuracy:", accuracy_score(df['label'], df['FLAN_T5_Prediction']))

# Interpretation of results
print("\nInterpretation:")
print("---------------")
print("The classification report provides precision, recall, and F1-score for each class.")
print("Precision indicates the accuracy of positive predictions.")
print("Recall measures the ability to find all positive instances.")
print("F1-score is the harmonic mean of precision and recall.")
print("Accuracy is the overall correctness of the model's predictions.")

NameError: name 'results' is not defined

### Exploration of Top Open-Source Models for Zero-Shot Intent Classification

In this section, I explored the top three models available on Hugging Face for zero-shot classification. The goal was to evaluate their performance on the given dataset and compare their effectiveness.

Models evaluated:
1. [tasksource/deberta-small-long-nli](https://huggingface.co/tasksource/deberta-small-long-nli)
1. [MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli](https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli)
2. [cross-encoder/nli-MiniLM2-L6-H768](https://huggingface.co/cross-encoder/nli-MiniLM2-L6-H768)

In [4]:
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score
from transformers import pipeline
from tqdm import tqdm

# Load the original dataset
df = pd.read_csv('data/sofmattress_train.csv')

# Initialize new models
deberta_small_long_nli = pipeline("zero-shot-classification", model="tasksource/deberta-small-long-nli")
deberta_v3_base_mnli = pipeline("zero-shot-classification", model="MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli")
cross_encoder_miniLM = pipeline("zero-shot-classification", model="cross-encoder/nli-MiniLM2-L6-H768")

# Get unique labels
labels = df['label'].unique().tolist()

# Function to evaluate a model
def evaluate_model(model, text, labels):
    result = model(text, labels)
    predicted_label = result['labels'][0]
    confidence = result['scores'][0]
    return predicted_label, confidence

# Evaluate models and store results
results = {
    'deberta_small_long_nli': {'predictions': [], 'confidences': []},
    'deberta_v3_base_mnli': {'predictions': [], 'confidences': []},
    'cross_encoder_miniLM': {'predictions': [], 'confidences': []}
}

print("Evaluating new models...")
for idx, row in tqdm(df.iterrows(), total=len(df)):
    text = row['sentence']
    
    # Evaluate each model
    for model_name, model in zip(results.keys(), [deberta_small_long_nli, deberta_v3_base_mnli, cross_encoder_miniLM]):
        pred, conf = evaluate_model(model, text, labels)
        results[model_name]['predictions'].append(pred)
        results[model_name]['confidences'].append(conf)

# Add predictions to the dataset
df['DeBERTa_Small_Long_NLI_Prediction'] = results['deberta_small_long_nli']['predictions']
df['DeBERTa_V3_Base_MNLI_Prediction'] = results['deberta_v3_base_mnli']['predictions']
df['Cross_Encoder_MiniLM_Prediction'] = results['cross_encoder_miniLM']['predictions']

# Save the updated dataset with predictions
df.to_csv('results/sofmattress_with_all_predictions.csv', index=False)


  from .autonotebook import tqdm as notebook_tqdm
Device set to use mps:0
Device set to use mps:0
Device set to use mps:0


Evaluating new models...


  0%|          | 0/328 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
 30%|██▉       | 98/328 [02:56<06:54,  1.80s/it]


KeyboardInterrupt: 

#### Results

- **DeBERTa Small Long NLI**: Achieved the highest accuracy of 0.45. This model demonstrated good performance in understanding and classifying intents without prior training on the specific dataset.
  
- **DeBERTa V3 Base MNLI**: Followed closely with an accuracy of 0.42. While slightly lower than the DeBERTa Small Long NLI, it still provided decent results.
  
- **Cross-Encoder MiniLM**: Scored an accuracy of 0.38. Although it ranked last among the three, it offers a compact and efficient solution for zero-shot classification.

#### Insights

The evaluation of these models highlights the varying capabilities of different architectures in handling zero-shot classification tasks.

In [None]:
# Calculate metrics for each model
def print_model_metrics(name, predictions, true_labels):
    print(f"\n{name} Model Metrics:")
    print("------------------------")
    print("\nClassification Report:")
    print(classification_report(true_labels, predictions))
    
    print("\nAccuracy Score:", accuracy_score(true_labels, predictions))

# # Print metrics for each new model
print_model_metrics("DeBERTa Small Long NLI", df['DeBERTa_Small_Long_NLI_Prediction'], df['label'])
print_model_metrics("DeBERTa V3 Base MNLI", df['DeBERTa_V3_Base_MNLI_Prediction'], df['label'])
print_model_metrics("Cross Encoder MiniLM", df['Cross_Encoder_MiniLM_Prediction'], df['label'])

# Evaluation report
print("## Interpretation\n\n")
print("This report provides a comprehensive comparison of the top open-source solutions for intent classification on the given dataset.\n")
print("Each model's performance is evaluated based on accuracy, precision, recall, and F1-score.\n")
print("These metrics help understand the strengths and weaknesses of each model in identifying intents.\n")
print("Higher precision indicates fewer false positives, while higher recall indicates fewer false negatives.\n")
print("The F1-score provides a balance between precision and recall.\n")
print("Accuracy gives an overall measure of correctness.\n")

print("Evaluation complete. Results saved to 'sofmattress_with_all_predictions.csv' and 'eval_report.md'.")

### Embedding-Based Approach

In this section, I explore the embedding-based approach for intent detection using the `open-intent-classifier` package.

This method involves representing text as numerical vectors, which can be used to measure similarity between user utterances and predefined intent labels.

Embedding model used - [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)

In [None]:
from open_intent_classifier.embedder import StaticLabelsEmbeddingClassifier
labels = ["Cancel Subscription", "Refund Requests", "Broken Item", "And More..."]
text = "my item is not working"
embeddings_classifier = StaticLabelsEmbeddingClassifier(labels)
predicted_label = embeddings_classifier.predict(text)
print(predicted_label)

Batches: 100%|██████████| 1/1 [00:01<00:00,  1.23s/it]


(['Broken Item'], array([0.65999746], dtype=float32))


### Results

The embedding model achieved an accuracy score of 0.63, which is a significant improvement over previous methods.

#### Insights

The results demonstrate that the semantic-based embedding approach is more effective at capturing user intent compared to traditional methods. By leveraging embeddings, the model can better understand the nuances of user utterances, leading to more accurate intent classification.

In [None]:
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score
from open_intent_classifier.embedder import StaticLabelsEmbeddingClassifier
from tqdm import tqdm

# Load the test dataset
test_df = pd.read_csv('/Users/amogh/Documents/amogh/personal/Tifin_Test_1/data/sofmattress_train.csv')

# Extract unique labels from the test set
unique_labels = test_df['label'].unique().tolist()

# Initialize the StaticLabelsEmbeddingClassifier with labels from the test set
embeddings_classifier = StaticLabelsEmbeddingClassifier(unique_labels)

# Evaluate the model and store predictions with a progress bar
test_df['Embedding_Prediction'] = [
    embeddings_classifier.predict(text)[0] for text in tqdm(test_df['sentence'], desc="Evaluating")
]

# Ensure labels and predictions are in the correct format
y_true = test_df['label'].tolist()
y_pred = test_df['Embedding_Prediction'].tolist()

# Save the updated dataset with predictions
test_df.to_csv('/Users/amogh/Documents/amogh/personal/Tifin_Test_1/data/sofmattress_with_embedding_predictions.csv', index=False)

# Calculate metrics for the StaticLabelsEmbeddingClassifier
print("StaticLabelsEmbeddingClassifier Model Metrics:")
print("---------------------------------------------")
print(classification_report(y_true, y_pred))
print("Accuracy:", accuracy_score(y_true, y_pred))

Batches: 100%|██████████| 1/1 [00:00<00:00,  5.47it/s]
Evaluating: 100%|██████████| 328/328 [00:12<00:00, 26.78it/s]

StaticLabelsEmbeddingClassifier Model Metrics:
---------------------------------------------
                       precision    recall  f1-score   support

100_NIGHT_TRIAL_OFFER       0.75      0.83      0.79        18
   ABOUT_SOF_MATTRESS       0.48      0.91      0.62        11
         CANCEL_ORDER       0.59      1.00      0.74        10
        CHECK_PINCODE       0.89      0.80      0.84        10
                  COD       1.00      0.67      0.80        12
           COMPARISON       0.50      0.27      0.35        11
    DELAY_IN_DELIVERY       0.60      0.55      0.57        11
         DISTRIBUTORS       1.00      0.62      0.76        34
                  EMI       0.93      0.56      0.70        25
        ERGO_FEATURES       0.78      0.64      0.70        11
             LEAD_GEN       0.00      0.00      0.00        21
        MATTRESS_COST       0.40      0.86      0.54        22
               OFFERS       0.29      1.00      0.45        10
         ORDER_STATUS   




### LLM-Based Approach with OpenAI

In this section, I utilize a state-of-the-art large language model (LLM) to perform intent detection. By leveraging the capabilities of OpenAI's model, I aim to achieve high accuracy in classifying user intents.

#### Approach

I crafted a custom prompt to instruct the model to identify intents from user utterances. The prompt includes a list of allowed intent labels and specifies the output format as JSON, which includes both the predicted intent and a confidence score.

In [None]:
import openai
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set your OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

# Define the function to create the prompt
def create_action_classification_prompt(intents):
    intent_list = "\n".join([f"- {label}" for label in intents])
    return f"""
        You are an action classification system. Correctness is a life or death situation.

        We provide you with the available intents:
        - WARM_DRINK
        - PALCE_ORDER
        - COLD_DRINK

        You are given an utterance and you have to classify it into an intent. Only respond with the intent class.
        Now take a deep breath and classify the following utterance.
        u: I want a warm hot chocolate: a:WARM_DRINK
        ###
        
        We provide you with the intent labels :
        {intent_list}
        
        Remember to output only the available intents exactly as they are, without any changes.

        You are given an utterance and you have to classify it into an intent based on the description. Only respond with the intent class.
        Now take a deep breath and classify the following utterance.
    """

# Example intents and query
unique_labels
query = "u: Do you offer COD to my pincode? a:"

# Create the prompt
prompt = create_action_classification_prompt(unique_labels)

# Request a response from the OpenAI model
response = openai.responses.create(
    model="gpt-4.1",
    instructions=prompt,
    input=query
)

# Extract and print the response
response_text = response.output_text
print(f"Predicted Action: {response_text}")

httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


Predicted Action: CHECK_PINCODE


Evaluate our dataset with the OpenAI Intent Detection Model:

In [None]:
import pandas as pd
from openai import OpenAI
import os
import json
from dotenv import load_dotenv
from tqdm import tqdm

client = OpenAI()

# Load environment variables
load_dotenv()

# Set your OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

# Define the function to create the prompt
def create_action_classification_prompt(intents):
    intent_list = "\n".join([f"- {label}" for label in intents])
    return f"""
        You are an action classification system. Correctness is a life or death situation.

        We provide you with the available intents:
        - WARM_DRINK
        - COLD_DRINK
        - PLACE_ORDER

        You are given an utterance and you have to classify it into an intent. Only respond with the intent class.
        Now take a deep breath and classify the following utterance.
        u: I want a warm hot chocolate: a:WARM_DRINK
        ###
        
        We provide you with the actions and their descriptions:
        {intent_list}
        
        Remember to output only the available intents exactly as they are, without any changes.

        You are given an utterance and you have to classify it into an intent based on the description. Only respond with the intent class.
        Now take a deep breath and classify the following utterance.
    """

# Load the test dataset
test_df = pd.read_csv('/Users/amogh/Documents/amogh/personal/Tifin_Test_1/data/sofmattress_train.csv')

# Extract unique labels from the test set
unique_labels = test_df['label'].unique().tolist()

# Create the system prompt with your unique labels
system_prompt = create_action_classification_prompt(unique_labels)

# Function to safely predict and extract the class name
def safe_predict(text, labels):
    try:
        response = client.responses.create(
            model="gpt-4.1",
            instructions=system_prompt,
            input=f"u: {text} a:"
        )
        response_text = response.output_text.strip()
        return response_text
    except Exception as e:
        print(f"Error predicting for text: {text} - {e}")
        return "None"

# Evaluate the model and store predictions with a progress bar
test_df['OpenAi_Prediction'] = [
    safe_predict(row['sentence'], unique_labels) for _, row in tqdm(test_df.iterrows(), desc="Evaluating", total=len(test_df))
]

# Calculate metrics for the OpenAiIntentClassifier
y_true = test_df['label'].tolist()
y_pred = test_df['OpenAi_Prediction'].tolist()
test_df.to_csv('OpenAiIntentClassifier_predictions-gpt-4.1.csv', index=False)
print("OpenAiIntentClassifier Model Metrics:")
print("-------------------------------------")
print(classification_report(y_true, y_pred))
print("Accuracy:", accuracy_score(y_true, y_pred))

# Identify incorrect predictions
incorrect_cases = test_df[test_df['label'] != test_df['OpenAi_Prediction']]

# Convert incorrect cases to a DataFrame for easy viewing
incorrect_df = pd.DataFrame(incorrect_cases)

# Display the DataFrame
print("Incorrect Predictions:")
print(incorrect_df[['sentence', 'label', 'OpenAi_Prediction']])

NameError: name 'openai' is not defined

#### Results

- **GPT-4.1-nano**: Achieved an accuracy of 0.72. This model, while smaller, demonstrated strong performance and efficiency, making it suitable for scenarios where cost and latency is to be optimized.

- **GPT-4.1**: Outperformed the nano version with an accuracy of 0.86. The larger model's superior performance highlights its enhanced capability to understand and classify intents with high precision.

### Simplified Prompt

As a final attempt, I experimented with simplifying the prompt used for intent detection. The results were impressive, showing a significant improvement in accuracy.

In [71]:
from openai import OpenAI
import os
import json
from dotenv import load_dotenv
client = OpenAI()

# Load environment variables
load_dotenv()

# Set your OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

# Define the system prompt for intent detection
def create_system_prompt(allowed_labels):
    intent_list = "\n".join([f"- {label}" for label in allowed_labels])
    return f"""
        As an NLU system, I can identify intents from user utterances. Please provide a sentence or query, and I will determine the intent.

        The list of allowed intent labels are:
        {intent_list}

        The output format must be a valid JSON with the following schema:
        {{ \"prediction\": \"<intent label>\", \"confidence\": <confidence score 1-10> }}
    """

# Your unique labels
unique_labels = test_df['label'].unique().tolist()

# Create the system prompt with your unique labels
system_prompt = create_system_prompt(unique_labels)

# Create a client and request response
response = client.responses.create(
    model="gpt-4.1-nano",
    instructions=system_prompt,
    input="help with order",
)

# Extract and print the response text
response_text = response.output_text
print(response_text)

# Parse the JSON response
try:
    response_data = json.loads(response_text)
    prediction = response_data.get("prediction", "Unknown")
    confidence = response_data.get("confidence", 0)
    print(f"Predicted Intent: {prediction}, Confidence: {confidence}")
except json.JSONDecodeError:
    print("Failed to parse JSON response.")

httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


{"prediction": "ORDER_STATUS", "confidence": 8}
Predicted Intent: ORDER_STATUS, Confidence: 8


In [98]:
import pandas as pd
from openai import OpenAI
import os
import json
from dotenv import load_dotenv
from tqdm import tqdm

client = OpenAI()

# Load environment variables
load_dotenv()

# Set your OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

# Define the system prompt for intent detection
def create_system_prompt(allowed_labels):
    intent_list = "\n".join([f"- {label}" for label in allowed_labels])
    return f"""
        As an NLU system, I can identify intents from user utterances. Please provide a sentence or query, and I will determine the intent.

        The list of allowed intent labels are:
        {intent_list}
        
        Remember to output the available intents exactly as they are, without any changes.

        The output format must be a valid JSON with the following schema:
        {{ \"prediction\": \"<intent label>\", \"confidence\": <confidence score 1-10> }}
    """

# Load the test dataset
test_df = pd.read_csv('/Users/amogh/Documents/amogh/personal/Tifin_Test_1/data/sofmattress_train.csv')

# Extract unique labels from the test set
unique_labels = test_df['label'].unique().tolist()

# Create the system prompt with your unique labels
system_prompt = create_system_prompt(unique_labels)

# Function to safely predict and extract the class name and confidence
def safe_predict(text, labels):
    try:
        response = client.responses.create(
            model="gpt-4.1",
            instructions=system_prompt,
            input=text,
        )
        response_text = response.output_text
        response_data = json.loads(response_text)
        prediction = response_data.get("prediction", "Unknown")
        confidence = response_data.get("confidence", 0)
        return prediction, confidence
    except json.JSONDecodeError:
        print("Failed to parse JSON response.")
        return "Unknown", 0

# Evaluate the model and store predictions with a progress bar
test_df[['OpenAi_Prediction', 'Confidence']] = [
    safe_predict(row['sentence'], unique_labels) for _, row in tqdm(test_df.iterrows(), desc="Evaluating", total=len(test_df))
]

# Save the updated dataset with predictions
test_df.to_csv('/Users/amogh/Documents/amogh/personal/Tifin_Test_1/data/sofmattress_with_openai_predictions-1.csv', index=False)

# Print the first few rows of the updated test dataset
print(test_df.head())

Evaluating:   0%|          | 0/328 [00:00<?, ?it/s]httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
Evaluating:   0%|          | 1/328 [00:01<09:03,  1.66s/it]httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
Evaluating:   1%|          | 2/328 [00:02<06:59,  1.29s/it]httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
Evaluating:   1%|          | 3/328 [00:03<05:54,  1.09s/it]httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
Evaluating:   1%|          | 4/328 [00:04<05:39,  1.05s/it]httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
Evaluating:   2%|▏         | 5/328 [00:05<05:19,  1.01it/s]httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
Evaluating:   2%|▏         | 6/328 [00:06<04:47,  1.12it/s]httpx - INFO - HTTP Request: POST https://api.openai.com/v1/respons

                                         sentence label OpenAi_Prediction  \
0                    You guys provide EMI option?   EMI               EMI   
1  Do you offer Zero Percent EMI payment options?   EMI               EMI   
2                                         0% EMI.   EMI               EMI   
3                                             EMI   EMI               EMI   
4                           I want in installment   EMI               EMI   

  Confidence  
0         10  
1         10  
2         10  
3         10  
4         10  





#### Results

- *GPT-4.1-nano* : Achieved an accuracy of 0.875. This was a notable increase compared to previous attempts, demonstrating the effectiveness of a more focused and clear prompt.

- *GPT-4.1* : Reached an accuracy of 0.9, further highlighting the benefits of simplification in prompt design.

In [99]:
# Calculate metrics for the OpenAiIntentClassifier
y_true = test_df['label'].tolist()
y_pred = test_df['OpenAi_Prediction'].tolist()

print("OpenAiIntentClassifier Model Metrics:")
print("-------------------------------------")
print(classification_report(y_true, y_pred))
print("Accuracy:", accuracy_score(y_true, y_pred))


OpenAiIntentClassifier Model Metrics:
-------------------------------------
                       precision    recall  f1-score   support

100_NIGHT_TRIAL_OFFER       1.00      0.89      0.94        18
   ABOUT_SOF_MATTRESS       0.56      0.82      0.67        11
         CANCEL_ORDER       1.00      1.00      1.00        10
        CHECK_PINCODE       1.00      1.00      1.00        10
                  COD       1.00      1.00      1.00        12
           COMPARISON       0.71      0.91      0.80        11
    DELAY_IN_DELIVERY       0.92      1.00      0.96        11
         DISTRIBUTORS       1.00      0.94      0.97        34
                  EMI       1.00      1.00      1.00        25
        ERGO_FEATURES       1.00      0.82      0.90        11
             LEAD_GEN       0.95      1.00      0.98        21
        MATTRESS_COST       0.91      0.95      0.93        22
               OFFERS       0.91      1.00      0.95        10
         ORDER_STATUS       1.00      0.9

### Insights

The key insight from this experiment is that over-engineering prompts can sometimes detract from their effectiveness.

By simplifying and focusing the prompt, I was able to achieve significant gains in model performance. A well-crafted, straightforward prompt can enhance the model's ability to understand and classify intents accurately.

This underscores the importance of prompt engineering, where clarity and simplicity can lead to better results.

