# Generative AI and Prompt Engineering
## A program by IISc and TalentSprint
### Mini-Project: Text Classification

## Problem Statement

Intent identification using BERT model

## Learning Objectives

At the end of the mini-project, you will be able to :

* Read the intent, questions and responses data
* Load a pre trained BERT model
* Fine-tune the BERT model
* Get the predictions for each question

## Overview

The intent identification problem is framed as a text classification task, where a BERT model is trained to classify intent. Once the model is fine-tuned, a conversation tool is set up. For each user question, the model first predicts the intent, and a response is selected from a predefined set of responses corresponding to the predicted intent as the answer to the input question.

## Dataset

Different classes of intent with a set of questions that fall into each intent and a pool of suitable responses for each intent.

**NOTE:** Please refer to previous assignments to implement the steps related to tokenization and data processing. Additionally, check out the references given at the bottom of the notebook to implement code for other sections of this notebook.

## Grading = 10 Points

In [1]:
# prompt: Create a hidden code cell with @#title Download the Dataset. Data should be downloaded from the following link: https://cdn.exec.talentsprint.com/static/aimlops/c3/spam.csv

#@title Download the Dataset
!wget https://cdn.exec.talentsprint.com/static/aimlops/c3/Intent.json

--2024-09-22 04:07:06--  https://cdn.exec.talentsprint.com/static/aimlops/c3/Intent.json
Resolving cdn.exec.talentsprint.com (cdn.exec.talentsprint.com)... 172.105.52.210
Connecting to cdn.exec.talentsprint.com (cdn.exec.talentsprint.com)|172.105.52.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69866 (68K) [application/json]
Saving to: ‘Intent.json.2’


2024-09-22 04:07:06 (665 KB/s) - ‘Intent.json.2’ saved [69866/69866]



### Import Neccesary Packages

In [2]:
# Please feel free to add/remove installations here

# Initial Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

### Read the Intent, Questions, and Response Data (1 point)

In [3]:
# Path to the JSON file
file_path = '/content/Intent.json'

# Reading the JSON file
with open(file_path, 'r') as file:
    data = json.load(file)

# Extracting intents, questions, and responses
for intent_data in data['intents']:
    intent = intent_data['intent']
    questions = intent_data['text']
    responses = intent_data['responses']

    print(f"Intent: {intent}")
    print("Questions:")
    for question in questions:
        print(f"- {question}")

    print("Responses:")
    for response in responses:
        print(f"- {response}")

    print("\n" + "="*50 + "\n")

Intent: Greeting
Questions:
- Hi
- Hi there
- Hola
- Hello
- Hello there
- Hya
- Hya there
Responses:
- Hi human, please tell me your GeniSys user
- Hello human, please tell me your GeniSys user
- Hola human, please tell me your GeniSys user


Intent: GreetingResponse
Questions:
- My user is Adam
- This is Adam
- I am Adam
- It is Adam
- My user is Bella
- This is Bella
- I am Bella
- It is Bella
Responses:
- Great! Hi <HUMAN>! How can I help?
- Good! Hi <HUMAN>, how can I help you?
- Cool! Hello <HUMAN>, what can I do for you?
- OK! Hola <HUMAN>, how can I help you?
- OK! hi <HUMAN>, what can I do for you?


Intent: CourtesyGreeting
Questions:
- How are you?
- Hi how are you?
- Hello how are you?
- Hola how are you?
- How are you doing?
- Hope you are doing well?
- Hello hope you are doing well?
Responses:
- Hello, I am great, how are you? Please tell me your GeniSys user
- Hello, how are you? I am great thanks! Please tell me your GeniSys user
- Hello, I am good thank you, how are yo

# Explanation:

- This code loads the JSON data from the Intent.json file.
- It loops through each intent, extracts the associated text (questions) and responses, and prints them out.


### Tokenize the Questions (1 point)

In [12]:
from transformers import BertTokenizer
# Load the pretrained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Initialize lists to hold tokenized questions and corresponding intent labels
tokenized_inputs = []
attention_masks = []
intent_labels = []
intent_label_map = {}  # A dictionary to map intent names to label IDs

# Create a mapping for intents to label IDs
for idx, intent_data in enumerate(data['intents']):
    intent = intent_data['intent']
    intent_label_map[intent] = idx


# Tokenize questions and prepare input data
for intent_data in data['intents']:
    intent = intent_data.get('intent', 'No Intent Found')
    questions = intent_data.get('text', [])

    for question in questions:
        # Tokenize and encode the question using BERT tokenizer
        encoding = tokenizer.encode_plus(
            question,
            add_special_tokens=True,  # Add [CLS] and [SEP]
            max_length=20,            # Set max length (adjust as necessary)
            padding='max_length',     # Pad to max_length
            truncation=True,          # Truncate to max_length
            return_attention_mask=True, # Include attention mask
            return_tensors='pt'       # Return as PyTorch tensors
        )

        tokenized_inputs.append(encoding['input_ids'])
        attention_masks.append(encoding['attention_mask'])
        intent_labels.append(intent_label_map[intent])



In [13]:
intent_label_map

{'Greeting': 0,
 'GreetingResponse': 1,
 'CourtesyGreeting': 2,
 'CourtesyGreetingResponse': 3,
 'CurrentHumanQuery': 4,
 'NameQuery': 5,
 'RealNameQuery': 6,
 'TimeQuery': 7,
 'Thanks': 8,
 'NotTalking2U': 9,
 'UnderstandQuery': 10,
 'Shutup': 11,
 'Swearing': 12,
 'GoodBye': 13,
 'CourtesyGoodBye': 14,
 'WhoAmI': 15,
 'Clever': 16,
 'Gossip': 17,
 'Jokes': 18,
 'PodBayDoor': 19,
 'PodBayDoorResponse': 20,
 'SelfAware': 21}

# Explanation:

## 1. Intent Label Mapping:
- We create a dictionary intent_label_map to assign a unique label to each intent. The intent label is an integer, which is required for most machine learning models.

## 2. Tokenization and Padding:
- Each question is tokenized using BertTokenizer's encode_plus method, which also adds special tokens ([CLS] and [SEP]), and it pads/truncates the input sequence to a fixed maximum length (e.g., 20 tokens).

## 3. Attention Mask:
- The attention mask is used by BERT to differentiate between actual tokens and padded tokens. This is crucial for models like BERT to ignore padded positions.


### Create the Train Data with Tokenized Questions and Intent Labels (1 point)

In [14]:
import torch
from torch.utils.data import DataLoader, TensorDataset

# Convert lists into torch tensors
tokenized_inputs = torch.cat(tokenized_inputs, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
intent_labels = torch.tensor(intent_labels)

# Create a TensorDataset for input_ids, attention_masks, and labels
train_dataset = TensorDataset(tokenized_inputs, attention_masks, intent_labels)

# Example: Creating a DataLoader (optional)
batch_size = 16
train_dataloader = DataLoader(train_dataset, batch_size=batch_size)

# Print a sample batch (for validation)
for batch in train_dataloader:
    input_ids, attention_mask, labels = batch
    print(f"Input IDs: {input_ids}")
    print(f"Attention Mask: {attention_mask}")
    print(f"Labels: {labels}")
    break  # Print only the first batch

Input IDs: tensor([[  101,  7632,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  7632,  2045,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  7570,  2721,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  7592,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  7592,  2045,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1044,  3148,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1044,  3148,  2045

## 1. Tensor Conversion:
- After tokenizing and preparing the data, we convert the lists into PyTorch tensors (`input_ids`, `attention_masks`, and `labels`).

## 2. TensorDataset:
- The TensorDataset is created by combining `input_ids`, `attention_masks`, and `intent_labels`, making it easy to use with PyTorch’s DataLoader for batching.

### Load a Pre-Trained BERT Model (1 point)

In [15]:
import torch
from transformers import BertForSequenceClassification, BertTokenizer

# Load the pre-trained BERT model for sequence classification
# Specify the number of classes (e.g., 10 for 10 different intents)
num_labels = len(intent_label_map)  # Assuming you've created an intent label map
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

# Move model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Print model architecture (optional)
print(model)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

# Explanation:

## 1. BERT Tokenizer:

- We load the BertTokenizer again (or reuse the one from the previous steps) using the bert-base-uncased variant. This is required to tokenize the input before feeding it into the model.

## 2. BERT Model for Sequence Classification:

- We use BertForSequenceClassification, which is a pre-trained BERT model with an added classification layer on top. This is suited for tasks like intent classification, sentiment analysis, etc.
- `num_labels` specifies the number of output classes (equal to the number of intents you have in the `intent_label_map`).

## 3. Move to GPU:

- If a GPU is available, we move the model to the GPU using `model.to(device)` to speed up training.

## 4. Model Architecture:

- Printing the model gives you insight into the architecture. The final layer is a classification head with the specified number of output labels.

### Prepare the Model to Fine-Tune (1 point)

In [16]:
from transformers import AdamW, get_linear_schedule_with_warmup

# Prepare optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)

# Number of training epochs and total steps
epochs = 10  # You can adjust the number of epochs
total_steps = len(train_dataloader) * epochs

# Create a learning rate scheduler
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,          # Number of warmup steps (optional)
    num_training_steps=total_steps  # Total number of training steps
)

# Define the loss function (CrossEntropyLoss is used by default for classification)
loss_fn = torch.nn.CrossEntropyLoss()

# Explanation:

## 1. Optimizer:

- The `AdamW` optimizer is used, which is the recommended optimizer for fine-tuning BERT. A learning rate (`lr`) of `2e-5` is typically a good starting point for BERT fine-tuning.

## 2. Scheduler:

We use a learning rate scheduler (`get_linear_schedule_with_warmup`) to gradually reduce the learning rate during training, which can help with model convergence.

### Train the Model using the Tokenized Questions and Intent Labels (1 point)

In [17]:
# Function to train the model for one epoch
def train_epoch(model, dataloader, optimizer, scheduler, device):
    model.train()  # Set model to training mode
    total_loss = 0

    for batch in dataloader:
        # Get batch data
        input_ids = batch[0].to(device)
        attention_mask = batch[1].to(device)
        labels = batch[2].to(device)

        # Clear gradients from the previous step
        optimizer.zero_grad()

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        # Backward pass and optimization step
        loss.backward()
        optimizer.step()

        # Update the learning rate
        scheduler.step()

        total_loss += loss.item()

    # Return average loss over the training data
    avg_loss = total_loss / len(dataloader)
    return avg_loss

# Training loop
for epoch in range(epochs):
    print(f"Epoch {epoch+1}/{epochs}")

    # Train the model for one epoch
    avg_train_loss = train_epoch(model, train_dataloader, optimizer, scheduler, device)

    print(f"Training loss: {avg_train_loss:.4f}")

Epoch 1/10
Training loss: 3.1455
Epoch 2/10
Training loss: 3.0228
Epoch 3/10
Training loss: 2.9377
Epoch 4/10
Training loss: 2.8552
Epoch 5/10
Training loss: 2.7419
Epoch 6/10
Training loss: 2.6402
Epoch 7/10
Training loss: 2.5781
Epoch 8/10
Training loss: 2.5238
Epoch 9/10
Training loss: 2.4783
Epoch 10/10
Training loss: 2.4349


## Training Loop:

- The training loop runs for epochs iterations, and in each epoch:
  - The model is set to training mode (model.train()).
  - We loop through the train_dataloader, get each batch, and perform a forward pass.
  - We compute the loss, backpropagate (loss.backward()), and update the model's parameters (optimizer.step()).
  - After each batch, the learning rate is updated with the scheduler (scheduler.step()).
  - The average loss for the epoch is calculated and printed.

In [18]:
# Save the fine-tuned model (optional)
model.save_pretrained('./fine_tuned_bert_model')
tokenizer.save_pretrained('./fine_tuned_bert_model')

('./fine_tuned_bert_model/tokenizer_config.json',
 './fine_tuned_bert_model/special_tokens_map.json',
 './fine_tuned_bert_model/vocab.txt',
 './fine_tuned_bert_model/added_tokens.json')

# Saving the Model:

- After fine-tuning, the model and tokenizer are saved using save_pretrained, which allows for easy reloading of the fine-tuned model later.

### Create a Function to get the Predictions for each Question (1 point)

In [19]:
# Set the model to evaluation mode
model.eval()

# Function to predict intent for a list of questions
def predict_intent(questions, model, tokenizer, device, intent_label_map):
    inputs = []
    attention_masks = []

    # Tokenize each question
    for question in questions:
        encoding = tokenizer.encode_plus(
            question,
            add_special_tokens=True,  # Add [CLS] and [SEP]
            max_length=20,            # Set max length (adjust as necessary)
            padding='max_length',     # Pad to max_length
            truncation=True,          # Truncate to max_length
            return_attention_mask=True, # Include attention mask
            return_tensors='pt'       # Return as PyTorch tensors
        )
        inputs.append(encoding['input_ids'])
        attention_masks.append(encoding['attention_mask'])

    # Convert lists to tensors and move to device
    inputs = torch.cat(inputs).to(device)
    attention_masks = torch.cat(attention_masks).to(device)

    # Get predictions (no gradient required)
    with torch.no_grad():
        outputs = model(inputs, attention_mask=attention_masks)

    # Get the predicted labels (logits are the raw predictions)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=1)

    # Convert label IDs to intent names
    label_id_to_intent = {v: k for k, v in intent_label_map.items()}  # Reverse the label map
    predicted_intents = [label_id_to_intent[pred.item()] for pred in predictions]

    return predicted_intents

# Example usage
questions = [
    "Hi, how are you?",
    "What is your name?",
    "Tell me a joke.",
]

# Predict intents for the questions
predicted_intents = predict_intent(questions, model, tokenizer, device, intent_label_map)

# Print the results
for question, intent in zip(questions, predicted_intents):
    print(f"Question: {question} -> Predicted Intent: {intent}")

Question: Hi, how are you? -> Predicted Intent: CourtesyGreeting
Question: What is your name? -> Predicted Intent: RealNameQuery
Question: Tell me a joke. -> Predicted Intent: Jokes


# Explanation:

## 1. Model and Tokenizer Loading:

  - The fine-tuned model and tokenizer are loaded from `BertForSequenceClassification` and `BertTokenizer`.

## 2. Prediction Function (predict_intent):

  - This function accepts a list of questions, the model, tokenizer, device (GPU/CPU), and the intent_label_map.
  - It tokenizes each question and prepares them for input to the model.
  - The inputs are concatenated into tensors, and we run a forward pass through the model without calculating gradients (torch.no_grad()).
  - The raw logits (model outputs) are converted into predicted labels using torch.argmax, which returns the class with the highest score.
  - The predicted label IDs are then mapped back to the corresponding intent names using the intent_label_map.

## 3. Example Usage:

  - The questions list contains sample questions, and the function predicts the corresponding intents using the fine-tuned model.

### Create a Function to Choose the Response based on the Intent Prediction (1 point)

In [20]:
import random

# Function to predict intent and choose a response
def predict_intent_and_choose_response(predicted_intents, intents_data):
    # Choose a response for each predicted intent
    responses = []
    for intent in predicted_intents:
        # Find the intent in the intents_data
        for intent_data in intents_data:
            if intent_data['intent'] == intent:
                # Choose a random response from the available responses
                if intent_data['responses']:
                    response = random.choice(intent_data['responses'])
                else:
                    response = "Sorry, I don't have a response for this."
                responses.append(response)
                break

    return responses

# Assuming intents_data is loaded from the Intent.json file
intents_data = data['intents']

# Predict intents and choose responses
chosen_responses = predict_intent_and_choose_response(
    predicted_intents, intents_data
)

# Print the results
for question, intent, response in zip(questions, predicted_intents, chosen_responses):
    print(f"Question: {question} -> Predicted Intent: {intent}")
    print(f"Chosen Response: {response}")
    print("-" * 50)

Question: Hi, how are you? -> Predicted Intent: CourtesyGreeting
Chosen Response: Hello, I am good thank you, how are you? Please tell me your GeniSys user
--------------------------------------------------
Question: What is your name? -> Predicted Intent: RealNameQuery
Chosen Response: GeniSys
--------------------------------------------------
Question: Tell me a joke. -> Predicted Intent: Jokes
Chosen Response: A famous blues musician died. His tombstone bore the inscription, 'Didn't wake up this morning...'
--------------------------------------------------


# Explanation:

## 1. Choosing a Response:

  - After predicting the intent, we loop through the intents_data (from the Intent.json file) to find the matching intent.
  - Once the correct intent is found, we randomly choose one of the responses available for that intent. This adds some variety to the responses, rather than always returning the same response.

## 2. Handling Edge Cases:

  - If no responses are available for the intent, a fallback message is provided ("Sorry, I don't have a response for this.").

## 3. Example Usage:

  - The function is called with a list of questions, and it prints both the predicted intents and the chosen responses.

### Connect the above 2 Functions to take a Question from the User and Respond with Intent and the Answer (2 point)

In [23]:
## Add your code here
def get_response(questions, model, tokenizer, device, intent_label_map):
  # Predict intents for the questions
  predicted_intents = predict_intent(questions, model, tokenizer, device, intent_label_map)
  # Predict intents and choose responses
  responses = predict_intent_and_choose_response(
      predicted_intents, intents_data
  )
  return predicted_intents, response

# Example usage
questions = [
    "Hi, how are you?",
    "What is your name?",
    "Tell me a joke.",
]

# Predict intents for the questions
predicted_intents, response = get_response(questions, model, tokenizer, device, intent_label_map)

# Print the results
for question, intent in zip(questions, predicted_intents):
    print(f"Question: {question} -> Predicted Intent: {intent}")
    print(f"Chosen Response: {response}")
    print("-" * 50)

Question: Hi, how are you? -> Predicted Intent: CourtesyGreeting
Chosen Response: A famous blues musician died. His tombstone bore the inscription, 'Didn't wake up this morning...'
--------------------------------------------------
Question: What is your name? -> Predicted Intent: RealNameQuery
Chosen Response: A famous blues musician died. His tombstone bore the inscription, 'Didn't wake up this morning...'
--------------------------------------------------
Question: Tell me a joke. -> Predicted Intent: Jokes
Chosen Response: A famous blues musician died. His tombstone bore the inscription, 'Didn't wake up this morning...'
--------------------------------------------------


## References

1. https://huggingface.co/blog/bert-101
2. https://huggingface.co/docs/transformers/en/model_doc/bert
3. https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb