Chatbot using T5-Base Model

In [None]:
!pip install transformers torch pandas openpyxl



In [None]:
import pandas as pd
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments

# Load the dataset
data = pd.read_excel('/content/Emotion_DS-3E.xlsx')
dataset=pd.DataFrame(data)
# Display the first few rows of the dataset
print(data.head())

      ID Type                                          Utterance Dialogue_Act  \
0  194_0    T  Hi. Alvina, how are you doing today? It's good...           gt   
1  194_1    P                                    I'm just tired.           gt   
2  194_2    T                                        just tired?          crq   
3  194_3    P                                               Yeah           cd   
4  194_4    T  you know, we did some pre visit planning with ...      gc, irq   

   Emotion  
0        0  
1       -1  
2        0  
3       -1  
4        0  


1)Define format_data Function:

2)This function formats each row of the DataFrame to create a string that represents the input for the model. It identifies whether the speaker is a therapist or a patient and includes the relevant utterance, dialogue act, and emotion.
Generate Input Text:

3)data['input_text'] = data.apply(...): Applies the format_data function to each row of the DataFrame to create a new column, input_text, which includes formatted strings for model input.
Filter Patient and Therapist Lines:

4)patient_turns: Selects rows where the patient speaks (Type is 'P') and resets the index.
therapist_responses: Uses the same indexing to shift therapist responses by one row up so that each patient statement is paired with the correct therapist response.
Create Training DataFrame:

5)Creates a new DataFrame training_data containing the inputs and corresponding therapist responses. dropna() removes any rows where either input or target is missing.

6)Convert to Lists:

inputs and targets are lists of strings that will be fed into the model for training.

In [None]:
# Prepare input-output pairs for the model
def format_data(row):
    speaker = "Therapist" if row['Type'] == 'T' else "Patient"
    return f"{speaker}: {row['Utterance']} Dialogue Act: {row['Dialogue_Act']}, Emotion: {row['Emotion']}"

# Generate input text for each row
data['input_text'] = data.apply(lambda row: format_data(row), axis=1)

# Filter pairs where the patient speaks and the therapist responds
patient_turns = data[data['Type'] == 'P'].reset_index()
# Shift therapist responses to align with patient inputs
therapist_responses = data[data['Type'] == 'T'].shift(-1).reset_index()

# Create the final dataset for training
training_data = pd.DataFrame({
    'input_text': patient_turns['input_text'],
    'target_text': therapist_responses['Utterance']
}).dropna()

# Convert to lists for model processing
inputs = training_data['input_text'].tolist()
targets = training_data['target_text'].tolist()

In [None]:
# Load T5 model and tokenizer
model_name = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# Tokenize the data
# Specify max_length for truncation
max_length = 512  # or any appropriate length for your task

# Tokenize the data with max_length
train_encodings = tokenizer(inputs, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
target_encodings = tokenizer(targets, padding=True, truncation=True, max_length=max_length, return_tensors="pt")


# Prepare dataset for PyTorch
class ChatbotDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels['input_ids'][idx])
        return item

    def __len__(self):
        return len(self.labels['input_ids'])

# Create dataset
dataset = ChatbotDataset(train_encodings, target_encodings)

In [None]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",          # Output directory for checkpoints and logs
    evaluation_strategy="steps",     # Evaluation is done every 'eval_steps'
    per_device_train_batch_size=8,   # Adjust batch size based on your hardware
    per_device_eval_batch_size=8,
    num_train_epochs=5,              # Start with 3-5 epochs, adjust based on results
    logging_dir="./logs",            # Directory for logs
    logging_steps=500,
    save_steps=1000,
    eval_steps=500,
    warmup_steps=500,
    weight_decay=0.01,               # Regularization term
    save_total_limit=3,              # Maximum number of saved checkpoints
    learning_rate=3e-4,              # Start with this, can be tuned
    load_best_model_at_end=True,     # Load best checkpoint after training
)



In [None]:
# Define the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,     # Training dataset
    eval_dataset=dataset,      # Reusing the training dataset for evaluation
    tokenizer=tokenizer,             # Tokenizer
)

# Train the model
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels['input_ids'][idx])


Step,Training Loss,Validation Loss
500,1.4238,0.205095


TrainOutput(global_step=615, training_loss=1.2008589612759226, metrics={'train_runtime': 1230.6708, 'train_samples_per_second': 3.99, 'train_steps_per_second': 0.5, 'total_flos': 2260006865049600.0, 'train_loss': 1.2008589612759226, 'epoch': 5.0})

Define generate_response Function:

1)This function takes an input_text, tokenizes it, generates a response from the model, and decodes the output back into human-readable text.
input_ids: The tokenized input text converted into tensors.
model.generate(...): The method used to generate the response based on the input.
tokenizer.decode(...): Converts the generated token IDs back into text, skipping any special tokens that aren't relevant.

In [None]:
def generate_response(input_text):
    # Check the device of the model (GPU or CPU)
    device = model.device

    # Tokenize the input text and move it to the same device as the model
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

    # Generate the model's output
    outputs = model.generate(input_ids)

    # Decode the generated output into text
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

# Example input
test_input = "Patient: I'm feeling anxious about my new job. Dialogue Act: cd, Emotion: -1"
print("Therapist:", generate_response(test_input))



Therapist: So you're feeling anxious about your job?


In [None]:
# Function to generate response
def generate_response(input_text):
    # Move model and tokenizer to the same device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # Tokenize the input text and move it to the device
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

    # Generate the model's output
    outputs = model.generate(input_ids, max_length=100, num_beams=4, early_stopping=True)

    # Decode the generated output into text
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Conversation loop
print("Chatbot: Hi! How can I help you today?")

while True:
    # Get input from the user
    user_input = input("You: ")

    # If the user types 'exit', end the conversation
    if user_input.lower() == "exit":
        print("Chatbot: Goodbye!")
        break

    # Generate chatbot response
    chatbot_response = generate_response(user_input)
    print(f"Chatbot: {chatbot_response}")

Chatbot: Hi! How can I help you today?
You: I'm feeling anxious about my new job.
Chatbot: 
You: hi
Chatbot: Hi,
You: I am not feeling well
Chatbot: I'm sorry to hear that. I'm sorry to hear that.
You: please can u help me
Chatbot: ???
You: I am missing my home so much.I feel like crying.
Chatbot: hmmm hmmmm hmm
You: I am very sad,depressed and feeling hopeless
Chatbot: i'm sad to hear you're feeling this way.
You: please can u provide me some techniques to stay calm and composed.
Chatbot: Please, can you please can u provide me some techniques to stay calm and composed.
You: I am not feeling like living.I feel i am worthless.
Chatbot: .
You: I am happy and satisfied in my life
Chatbot: I'm happy and satisfied in my life. I'm happy and satisfied in my life.
You: bye.
Chatbot: bye
You: exit
Chatbot: Goodbye!
