<a href="https://colab.research.google.com/github/ch3ker/sentiment-analysis/blob/main/Copie_de_Yet_another_copy_of_Healthcare_Question_Answering_with_Bio_ClinicalBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Introduction**

In this project, we will fine-tune the Bio_ClinicalBERT model for a healthcare question-answering task. This model is pre-trained on clinical data, making it suitable for healthcare-related tasks. We will implement fine-tuning techniques to adapt the model to our specific dataset, `medquad.csv`, which contains medical questions and answers.


# **Process Overview**

This project focuses on fine-tuning a healthcare question-answering model using Bio_ClinicalBERT. The steps involved in this project are as follows:

1. **Loading and Exploring the Dataset**: Load the healthcare question-answering dataset (`medquad.csv`) and perform an initial exploration to understand its structure.
2. **Data Preprocessing**: Clean and tokenize the dataset to prepare it for model input.
3. **Model Initialization**: Load Bio_ClinicalBERT, a specialized model for healthcare data, and configure it for question-answering tasks.
4. **Fine-Tuning Setup**: Implement parameter-efficient fine-tuning techniques to adapt the model without overwhelming computational resources.
5. **Model Training**: Train the model on the prepared dataset to enhance its question-answering capabilities in the healthcare domain.
6. **Evaluation**: Test the fine-tuned model by generating answers to sample questions and evaluating its performance.



# **Dataset** **Overview**

The medquad.csv dataset used in this Kaggle notebook is designed for healthcare question-answering tasks. It likely includes the following key elements:

*   **Questions:** Medical questions, covering a range of health-related topics.
*   **Answers:** Detailed responses to the questions, possibly sourced from reliable medical information databases.
* **Source Information:** Likely includes references to where the answers were obtained (e.g., medical websites or publications).
* **Focus Area:** Each question may have a designated focus area, such as "Glaucoma" or other specific medical topics, to categorize the content.

This dataset can be effectively used to train and evaluate models for answering healthcare-related questions, making it suitable for fine-tuning a language model specialized in medical domains, like Bio_ClinicalBERT. Let me know if you’d like more specifics on any of these elements!

# **Installing Necessary Libraries**

In [None]:
# Install transformers and datasets libraries
#!pip install transformers torch datasets


# **Import** **Libraries**

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import csv

# **Load the Dataset**

In [None]:
data = pd.read_csv('/content/medquad.csv', encoding='utf-8', nrows=2000)
data.head()

Unnamed: 0,question,answer,source,focus_area
0,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...,NIHSeniorHealth,Glaucoma
1,What causes Glaucoma ?,"Nearly 2.7 million people have glaucoma, a lea...",NIHSeniorHealth,Glaucoma
2,What are the symptoms of Glaucoma ?,Symptoms of Glaucoma Glaucoma can develop in ...,NIHSeniorHealth,Glaucoma
3,What are the treatments for Glaucoma ?,"Although open-angle glaucoma cannot be cured, ...",NIHSeniorHealth,Glaucoma
4,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...,NIHSeniorHealth,Glaucoma


# **Explore and Analyze the Dataset**

We will check the dataset for basic information, including the number of unique questions, repeated questions, and any potential duplicates. This will help us understand the dataset’s structure and decide on preprocessing steps.


In [None]:
print("Dataset shape:", data.shape)
print("Columns:", data.columns)

Dataset shape: (2000, 4)
Columns: Index(['question', 'answer', 'source', 'focus_area'], dtype='object')


### 1. Column Summary and Data Types

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   question    2000 non-null   object
 1   answer      2000 non-null   object
 2   source      2000 non-null   object
 3   focus_area  2000 non-null   object
dtypes: object(4)
memory usage: 62.6+ KB


### 2. Distribution of Questions and Answers

In [None]:
# Count the number of unique questions
unique_questions_count = data['question'].nunique()
print("Number of unique questions:", unique_questions_count)

# Count the frequency of each question
question_counts = data['question'].value_counts()
print("Top 10 most common questions:\n", question_counts.head(10))


Number of unique questions: 1433
Top 10 most common questions:
 question
What is (are) High Blood Cholesterol ?           18
What is (are) Medicare and Continuing Care ?     14
What is (are) Skin Cancer ?                      12
What is (are) Breast Cancer ?                    12
What are the treatments for Breast Cancer ?      12
What is (are) Colorectal Cancer ?                12
What is (are) Stroke ?                           11
What is (are) Leukemia ?                         10
What are the treatments for Prostate Cancer ?    10
What is (are) High Blood Pressure ?              10
Name: count, dtype: int64


### 3. Analysis of Repeated Questions
This helps to understand which questions are repeated in the dataset and how often each appears, which can be useful for the model.

In [None]:
# Find questions that appear more than once
repeated_questions = data[data.duplicated(subset=['question'], keep=False)]

# Display repeated questions and their counts
repeated_questions_summary = repeated_questions['question'].value_counts()
print("Number of repeated questions:", len(repeated_questions_summary))
print("Examples of repeated questions:\n", repeated_questions_summary.head(10))


Number of repeated questions: 213
Examples of repeated questions:
 question
What is (are) High Blood Cholesterol ?           18
What is (are) Medicare and Continuing Care ?     14
What is (are) Colorectal Cancer ?                12
What is (are) Breast Cancer ?                    12
What is (are) Skin Cancer ?                      12
What are the treatments for Breast Cancer ?      12
What is (are) Stroke ?                           11
What is (are) High Blood Pressure ?              10
What are the treatments for Prostate Cancer ?    10
What is (are) Leukemia ?                         10
Name: count, dtype: int64


### 4. Check if repeated questions have the same or different answers

Here we group the dataset by question and calculates the number of unique answers for each question. It then filters for questions that have more than one unique answer, showing only those with multiple responses.

In [None]:
# Group by 'question' and count unique answers for each question
question_answer_variation = data.groupby('question')['answer'].nunique().reset_index()
question_answer_variation.columns = ['question', 'unique_answer_count']

# Filter questions with multiple unique answers
repeated_questions_with_different_answers = question_answer_variation[question_answer_variation['unique_answer_count'] > 1]

# Display questions that have multiple unique answers
print("Questions with multiple unique answers:")
print(repeated_questions_with_different_answers)


Questions with multiple unique answers:
                                               question  unique_answer_count
183            How to diagnose Alzheimer's Caregiving ?                    2
184               How to diagnose Alzheimer's Disease ?                    3
188                     How to diagnose Breast Cancer ?                    3
189                              How to diagnose COPD ?                    2
211   How to diagnose Creating a Family Health Histo...                    2
...                                                 ...                  ...
1415  what research (or clinical trials) is being do...                    2
1416  what research (or clinical trials) is being do...                    3
1417  what research (or clinical trials) is being do...                    2
1423  what research (or clinical trials) is being do...                    4
1425  what research (or clinical trials) is being do...                    2

[213 rows x 2 columns]


In [None]:
# Select two questions with multiple unique answers for display
sample_questions = repeated_questions_with_different_answers['question'].head(2)

# Filter the main dataset for these two questions
sample_repeated_questions_data = data[data['question'].isin(sample_questions)]

# Sort by question for organized display
sample_repeated_questions_data = sample_repeated_questions_data.sort_values(by='question')


for question in sample_questions:
    print(f"\nQuestion: {question}\n")
    answers = sample_repeated_questions_data[sample_repeated_questions_data['question'] == question]['answer']
    for idx, answer in enumerate(answers, 1):
        print(f"Answer {idx}: {answer}\n")




Question: How to diagnose Alzheimer's Caregiving ?

Answer 1: Now that your family member or friend has received a diagnosis of Alzheimers disease, its important to learn as much as you can about the disease and how to care for someone who has it. You may also want to know the right way to share the news with family and friends. Learning About Alzheimers Sometimes, you may feel that you don't know how to care for the person with Alzheimers. This is a common feeling among caregivers of people with Alzheimers because each day may bring different challenges. Learning about the disease can help you understand and cope with these challenges. Here is some information about Alzheimers and ways you can learn more about it. Alzheimers disease is an illness of the brain. It causes large numbers of nerve cells in the brain to die. This affects a persons ability to remember things and think clearly. People with Alzheimers become forgetful and easily confused and may have a hard time concentrating

# **Data Preprocessing**

### Remove Exact Duplicates

In [None]:
# Remove rows where both the question and answer are identical
data_unique = data.drop_duplicates(subset=['question', 'answer'])

# Check new number of rows after removing exact duplicates
print("Number of rows after removing exact duplicates:", data_unique.shape[0])


Number of rows after removing exact duplicates: 2000


### Check for Missing Values

In [None]:
# Display the count of missing values for each column
missing_values = data.isnull().sum()
print("Missing values per column:\n", missing_values)

# Display the percentage of missing values for each column
missing_percentage = (data.isnull().sum() / len(data)) * 100
print("\nPercentage of missing values per column:\n", missing_percentage)


Missing values per column:
 question      0
answer        0
source        0
focus_area    0
dtype: int64

Percentage of missing values per column:
 question      0.0
answer        0.0
source        0.0
focus_area    0.0
dtype: float64


### Handling Missing Values

* Since source and focus_area don’t directly contribute to the question-answering task, we can safely remove these columns. Retaining only question and answer will simplify the dataset and reduce unnecessary information that won’t be used during model training.
* Since only 5 answers are missing we can delete the columns that have missing values

In [None]:
# Drop the columns 'source' and 'focus_area'
data = data.drop(columns=['source', 'focus_area'])

In [None]:
# Remove rows where 'answer' is missing
data = data.dropna(subset=['answer'])

# Verify that no missing values remain in 'question' and 'answer'
print("Remaining missing values:\n", data.isnull().sum())


Remaining missing values:
 question    0
answer      0
dtype: int64


### Removing Unwanted Characters or Excessive Whitespace:

In [None]:
# Minimal Preprocessing
def clean_text(text):
    # Remove excessive whitespace
    text = ' '.join(text.split())
    return text

# Apply cleaning to the 'question' and 'answer' columns
data['question'] = data['question'].apply(clean_text)
data['answer'] = data['answer'].apply(clean_text)


# **Splitting to train and test dataset**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Now split into train and validation sets
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

# Display the first few rows to confirm the change
print(train_data.head())

                                          question  \
968      What is (are) Myelodysplastic Syndromes ?   
240                 What is (are) Kidney Disease ?   
819  What is (are) Childhood Soft Tissue Sarcoma ?   
692                           What is (are) Gout ?   
420   What are the symptoms of Colorectal Cancer ?   

                                                answer  
968  Key Points - Myelodysplastic syndromes are a g...  
240  When you visit your doctor, here are questions...  
819  Key Points - Childhood soft tissue sarcoma is ...  
692  Sudden, Intense Joint Pain Gout is a form of a...  
420  Possible signs of colorectal cancer include: -...  


# **Model Setup**

## 1.Load the Model and Tokenizer

In [None]:
pip install transformers torch




In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


## 2. Convert the DataFrame to a Hugging Face Dataset

In [None]:
!pip install datasets




In [None]:
from datasets import Dataset
# Convert DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_data)
val_dataset = Dataset.from_pandas(val_data)


##  3. Define Tokenization and Preprocessing Function

In [None]:
def preprocess_function(examples):
    inputs = examples['question']
    targets = examples['answer']

    # Tokenize inputs and targets
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

    # Setup the tokenizer for targets
    labels = tokenizer(targets, max_length=512, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply the preprocessing function to the datasets
train_dataset = train_dataset.map(preprocess_function, batched=True)
val_dataset = val_dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

In [None]:
print(val_data[:5])


                                               question  \
1860                        What is (are) Thalassemia ?   
353   what research (or clinical trials) is being do...   
1333  What is the outlook for Childhood Central Nerv...   
905   what research (or clinical trials) is being do...   
1289  What are the stages of Chronic Myeloproliferat...   

                                                 answer  
1860  Thalassemias are inherited blood disorders. If...  
353   The National Eye Institute, or NEI, is conduct...  
1333  Certain factors affect prognosis (chance of re...  
905   New types of treatment are being tested in cli...  
1289  Key Points - There is no standard staging syst...  


In [None]:
print(val_data)

                                               question  \
1860                        What is (are) Thalassemia ?   
353   what research (or clinical trials) is being do...   
1333  What is the outlook for Childhood Central Nerv...   
905   what research (or clinical trials) is being do...   
1289  What are the stages of Chronic Myeloproliferat...   
...                                                 ...   
965              What are the stages of Rectal Cancer ?   
1284  Who is at risk for Adult Acute Lymphoblastic L...   
1739                       What is (are) Hypoglycemia ?   
261               How to diagnose Alzheimer's Disease ?   
535                             What causes Psoriasis ?   

                                                 answer  
1860  Thalassemias are inherited blood disorders. If...  
353   The National Eye Institute, or NEI, is conduct...  
1333  Certain factors affect prognosis (chance of re...  
905   New types of treatment are being tested in cli...  
1

## 4. Set Up Training Arguments

In [None]:
import torch
torch.cuda.empty_cache()


In [None]:
from transformers import AutoModelForQuestionAnswering, Trainer, TrainingArguments
from transformers import TrainingArguments


training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    logging_strategy="steps",   # Enables step-by-step logging
    logging_steps=10,           # Display training loss every 10 steps
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)
model.gradient_checkpointing_enable()


## 5. Initialize the Trainer and train the model

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msarasafer59[0m ([33msarasafer59-esprit[0m). Use [1m`wandb login --relogin`[0m to force relogin


  return fn(*args, **kwargs)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Epoch,Training Loss,Validation Loss
1,1.2427,1.062044
2,1.0147,0.927548
3,0.9245,0.909394


  return fn(*args, **kwargs)


TrainOutput(global_step=600, training_loss=3.0281810029347738, metrics={'train_runtime': 3397.4751, 'train_samples_per_second': 1.413, 'train_steps_per_second': 0.177, 'total_flos': 1.10628861640704e+16, 'train_loss': 3.0281810029347738, 'epoch': 3.0})

## 6. Save the model

In [None]:
trainer.save_model("./fine_tuned_flan_t5_large")
tokenizer.save_pretrained("./fine_tuned_flan_t5_large")


# **Evaluation before Fine Tuning**

## 1. Validation loss

In [None]:
# Evaluate model performance on the validation set
eval_results = trainer.evaluate()
print(f"Validation Loss: {eval_results['eval_loss']:.4f}")


Validation Loss: 0.9094


## 2. Define the Test Function

Create a function to handle testing on a set of input questions and contexts. This function will generate answers using the fine-tuned model and print or store the results for evaluation.

In [None]:
import torch

def test_model(model, tokenizer, dataset, num_samples=5, max_length=100):
    model.eval()  # Set model to evaluation mode
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)

    # Loop over the dataset and generate predictions
    for i in range(num_samples):
        # Prepare inputs
        sample = dataset[i]
        input_text = sample['question'] + " " + sample['answer']  # Adjust based on your dataset structure
        inputs = tokenizer(input_text, return_tensors="pt").to(device)

        # Generate output
        with torch.no_grad():
            outputs = model.generate(inputs['input_ids'], max_length=max_length)

        # Decode and print the generated answer
        generated_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
        true_answer = sample['answer']  # Adjust if necessary for your dataset

        print(f"Question: {sample['question']}")
        print(f"Context: {sample['answer']}")
        print(f"Generated Answer: {generated_answer}")
        print(f"True Answer: {true_answer}")
        print("=" * 50)




In [None]:
test_model(model, tokenizer, val_dataset, num_samples=5, max_length=100)


Question: What is (are) Thalassemia ?
Context: Thalassemias are inherited blood disorders. If you have one, your body makes fewer healthy red blood cells and less hemoglobin. Hemoglobin is a protein that carries oxygen to the body. That leads to anemia. Thalassemias occur most often among people of Italian, Greek, Middle Eastern, Southern Asian, and African descent. Thalassemias can be mild or severe. Some people have no symptoms or mild anemia. The most common severe type in the United States is called Cooley's anemia. It usually appears during the first two years of life. People with it may have severe anemia, slowed growth and delayed puberty, and problems with the spleen, liver, heart, or bones. Doctors diagnose thalassemias using blood tests. Treatments include blood transfusions and treatment to remove excess iron from the body. If you have mild symptoms or no symptoms, you may not need treatment. In some severe cases, you may need a bone marrow transplant. NIH: National Heart, L

## 3. Evaluate Model Performance Quantitatively

### **BLEU Score**

In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def calculate_bleu(model, tokenizer, dataset, num_samples=100):
    bleu_scores = []
    device = model.device  # Make sure the model and tokenizer are on the same device

    for i in range(min(num_samples, len(dataset))):
        sample = dataset[i]

        # Input text consists only of the question in this case
        input_text = sample['question']

        # Tokenize and prepare input
        inputs = tokenizer(input_text, return_tensors="pt").to(device)
        outputs = model.generate(inputs['input_ids'], max_length=100)

        # Decode the generated output
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Reference answer for calculating BLEU score
        reference_text = sample['answer']

        # Calculate BLEU score with smoothing
        score = sentence_bleu([reference_text.split()], generated_text.split(),
                              smoothing_function=SmoothingFunction().method1)
        bleu_scores.append(score)

    # Calculate the average BLEU score across all samples
    avg_bleu_score = sum(bleu_scores) / len(bleu_scores)
    print("Average BLEU score:", avg_bleu_score)
    return avg_bleu_score

# Now run the BLEU calculation on your model and dataset
average_bleu = calculate_bleu(model, tokenizer, val_dataset, num_samples=100)


Average BLEU score: 0.008185154286891651


### **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**


In [None]:
import evaluate

# Load ROUGE metric
rouge = evaluate.load("rouge")

def calculate_rouge(model, tokenizer, dataset, num_samples=100):
    rouge_scores = []
    for i in range(num_samples):
        sample = dataset[i]
        input_text = sample['question']
        reference = sample['answer']

        # Generate the model's answer
        inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
        output_ids = model.generate(inputs["input_ids"], max_length=100)
        generated_answer = tokenizer.decode(output_ids[0], skip_special_tokens=True)

        # Compute ROUGE score
        rouge_score = rouge.compute(predictions=[generated_answer], references=[reference])
        rouge_scores.append(rouge_score)

    # Calculate average ROUGE scores
    avg_rouge = {key: sum(score[key] for score in rouge_scores) / num_samples for key in rouge_scores[0]}
    print("Average ROUGE score:", avg_rouge)
    return avg_rouge
rouge = calculate_rouge(model, tokenizer, val_dataset, num_samples=100)

Average ROUGE score: {'rouge1': 0.18099617450861827, 'rouge2': 0.039737180608210076, 'rougeL': 0.1378623999401774, 'rougeLsum': 0.1378623999401774}


### **Perplexity**

In [None]:
import torch.nn.functional as F

def calculate_perplexity(model, tokenizer, dataset, num_samples=100):
    perplexities = []
    for i in range(min(num_samples, len(dataset))):
        sample = dataset[i]
        input_text = sample['question']
        reference_text = sample['answer']

        inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
        labels = tokenizer(reference_text, return_tensors="pt").input_ids.to(model.device)

        with torch.no_grad():
            outputs = model(**inputs, labels=labels)
            loss = outputs.loss
            perplexity = torch.exp(loss).item()
            perplexities.append(perplexity)

    avg_perplexity = sum(perplexities) / len(perplexities)
    print("Average Perplexity:", avg_perplexity)
    return avg_perplexity

perplexity = calculate_perplexity(model, tokenizer, val_dataset, num_samples=100)

Token indices sequence length is longer than the specified maximum sequence length for this model (1648 > 512). Running this sequence through the model will result in indexing errors


Average Perplexity: 6.247942788600922


# **Fine**-**tuning**

## **1. Applying PEFT with LoRA**

## Install Necessary Libraries

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model
import torch

## Load the Pretrained Model and Tokenizer

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model_lora = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")




## Configure LoRA


In [None]:
from peft import TaskType

lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,  # Since T5 is a sequence-to-sequence model
    inference_mode=False,  # Set to False for training mode
    r=8,  # Rank of the LoRA adaptation matrix; adjust as needed
    lora_alpha=32,  # Scaling factor for LoRA
    lora_dropout=0.1  # Dropout rate for LoRA layers
)

# Wrap the model with LoRA
model = get_peft_model(model_lora, lora_config)


## Prepare The Dataset

In [None]:
from datasets import Dataset
# Convert DataFrames to Hugging Face Datasets
train_data = Dataset.from_pandas(train_data)
val_data = Dataset.from_pandas(val_data)

In [None]:
def preprocess_function(examples):
    inputs = [f"Question: {q}" for q in examples["question"]]
    targets = examples["answer"]
    model_inputs = tokenizer(inputs, max_length=128, padding="max_length", truncation=True)
    labels = tokenizer(targets, max_length=128, padding="max_length", truncation=True).input_ids

    model_inputs["labels"] = labels
    return model_inputs

# Apply preprocessing on your dataset
train_data = train_data.map(preprocess_function, batched=True)
val_data = val_data.map(preprocess_function, batched=True)


Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

## Train the model

In [None]:
training_args = TrainingArguments(
    output_dir="./lora_fine_tuned_model",
    evaluation_strategy="epoch",
    learning_rate=5e-4,  # LoRA often benefits from a higher learning rate
    num_train_epochs=3,
    per_device_train_batch_size=4,  # Adjust based on your hardware capacity
    save_steps=10_000,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=500,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
)

# Start training
trainer.train()

# Save the final model and tokenizer
trainer.save_model("./lora_fine_tuned_model")
tokenizer.save_pretrained("./lora_fine_tuned_model")


dataloader_config = DataLoaderConfiguration(dispatch_batches=None)


Epoch,Training Loss,Validation Loss
1,No log,1.682542
2,2.517800,1.573104
3,1.672400,1.550586




('./lora_fine_tuned_model/tokenizer_config.json',
 './lora_fine_tuned_model/special_tokens_map.json',
 './lora_fine_tuned_model/spiece.model',
 './lora_fine_tuned_model/added_tokens.json',
 './lora_fine_tuned_model/tokenizer.json')

# **Evaluation after using LoRA**

## 1. Validation loss

In [None]:
# Evaluate model performance on the validation set
eval_results = trainer.evaluate()
print(f"Validation Loss: {eval_results['eval_loss']:.4f}")

Validation Loss: 1.5506


## 2. Define the Test Function

In [None]:
def test_model(model, tokenizer, dataset, num_samples=5, max_length=100):
    model.eval()  # Set model to evaluation mode
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    for i in range(num_samples):
        sample = dataset[i]
        input_text = sample['question']
        true_answer = sample['answer']  # Assuming the true answer is under 'answer' in your dataset

        # Tokenize input text and move to the correct device
        inputs = tokenizer(input_text, return_tensors="pt").to(device)

        # Generate output
        with torch.no_grad():
            outputs = model.generate(input_ids=inputs['input_ids'], max_length=max_length)

        # Decode generated and true answers
        generated_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print(f"Input Question: {input_text}")
        print(f"True Answer: {true_answer}")
        print(f"Generated Answer: {generated_answer}\n")

# Run the function
test_model(model, tokenizer, val_data, num_samples=5, max_length=100)


Input Question: What is (are) Thalassemia ?
True Answer: Thalassemias are inherited blood disorders. If you have one, your body makes fewer healthy red blood cells and less hemoglobin. Hemoglobin is a protein that carries oxygen to the body. That leads to anemia. Thalassemias occur most often among people of Italian, Greek, Middle Eastern, Southern Asian, and African descent. Thalassemias can be mild or severe. Some people have no symptoms or mild anemia. The most common severe type in the United States is called Cooley's anemia. It usually appears during the first two years of life. People with it may have severe anemia, slowed growth and delayed puberty, and problems with the spleen, liver, heart, or bones. Doctors diagnose thalassemias using blood tests. Treatments include blood transfusions and treatment to remove excess iron from the body. If you have mild symptoms or no symptoms, you may not need treatment. In some severe cases, you may need a bone marrow transplant. NIH: Nationa

## **BLEU** **Score**

In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def calculate_bleu(model, tokenizer, dataset, num_samples=100):
    bleu_scores = []
    device = model.device  # Ensure the model and tokenizer are on the same device

    for i in range(min(num_samples, len(dataset))):
        sample = dataset[i]

        # Input text consists only of the question in this case
        input_text = sample['question']

        # Tokenize and prepare input
        inputs = tokenizer(input_text, return_tensors="pt").to(device)

        # Access base model's generate method
        outputs = model.base_model.generate(inputs['input_ids'], max_length=100)

        # Decode the generated output
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Reference answer for calculating BLEU score
        reference_text = sample['answer']

        # Calculate BLEU score with smoothing
        score = sentence_bleu([reference_text.split()], generated_text.split(),
                              smoothing_function=SmoothingFunction().method1)
        bleu_scores.append(score)

    # Calculate the average BLEU score across all samples
    avg_bleu_score = sum(bleu_scores) / len(bleu_scores)
    print("Average BLEU score:", avg_bleu_score)
    return avg_bleu_score

# Run the BLEU calculation
average_bleu = calculate_bleu(model, tokenizer, val_data, num_samples=100)


Average BLEU score: 0.027946411910918324


## **Perplexity**

In [None]:
import torch.nn.functional as F

def calculate_perplexity(model, tokenizer, dataset, num_samples=100):
    perplexities = []
    for i in range(min(num_samples, len(dataset))):
        sample = dataset[i]
        input_text = sample['question']
        reference_text = sample['answer']

        inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
        labels = tokenizer(reference_text, return_tensors="pt").input_ids.to(model.device)

        with torch.no_grad():
            outputs = model(**inputs, labels=labels)
            loss = outputs.loss
            perplexity = torch.exp(loss).item()
            perplexities.append(perplexity)

    avg_perplexity = sum(perplexities) / len(perplexities)
    print("Average Perplexity:", avg_perplexity)
    return avg_perplexity

perplexity = calculate_perplexity(model, tokenizer, val_data, num_samples=100)

Token indices sequence length is longer than the specified maximum sequence length for this model (1648 > 512). Running this sequence through the model will result in indexing errors


Average Perplexity: 7.777258045673371


# **2.** **Applying Prompt Tuning**

## Import Libraries

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import Dataset
import torch


## Define Soft Prompt Tuning Layer

In [None]:
class SoftPromptTuning(torch.nn.Module):
    def __init__(self, model, prompt_length=10, hidden_size=768):
        super(SoftPromptTuning, self).__init__()
        self.prompt_embeddings = torch.nn.Parameter(torch.randn(prompt_length, hidden_size) * 0.01)  # Smaller values
        self.model = model
        self.prompt_length = prompt_length

    def forward(self, input_ids, attention_mask, labels=None):
        # Get the input embeddings
        inputs_embeds = self.model.get_input_embeddings()(input_ids)

        # Add the soft prompt to the beginning of the input embeddings
        prompt_embeds = self.prompt_embeddings.unsqueeze(0).expand(inputs_embeds.size(0), -1, -1)
        inputs_embeds = torch.cat([prompt_embeds, inputs_embeds], dim=1)

        # Adjust the attention mask for the prompt tokens
        prompt_mask = torch.ones((attention_mask.size(0), self.prompt_length), device=attention_mask.device)
        attention_mask = torch.cat([prompt_mask, attention_mask], dim=1)

        # Pass through the model with modified inputs
        outputs = self.model(inputs_embeds=inputs_embeds, attention_mask=attention_mask, labels=labels)
        return outputs


## Load Model and Wrap It with Soft Prompt Tuning

In [None]:
# Load base model and tokenizer
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Define soft prompt tuning with model wrapper
soft_prompt_model = SoftPromptTuning(base_model, prompt_length=10, hidden_size=base_model.config.d_model)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Prepare the Dataset


In [None]:
from datasets import Dataset
# Convert DataFrames to Hugging Face Datasets
train_data_prompt = Dataset.from_pandas(train_data)
val_data_prompt = Dataset.from_pandas(val_data)

## Define the Prompt and Data Preprocessing

In [None]:
def preprocess_function(examples):
    # Add prompts to questions
    inputs = ["Answer the following medical question: " + question for question in examples["question"]]
    targets = examples["answer"]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length").input_ids
    model_inputs["labels"] = labels
    return model_inputs

# Tokenize data
train_data_tokenized = train_data_prompt.map(preprocess_function, batched=True)
val_data_tokenized = val_data_prompt.map(preprocess_function, batched=True)

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

## Initialize the Trainer with Prompt Tuning Parameters

In [None]:
import torch
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, Trainer

# Reduce batch size and add gradient accumulation
training_args = Seq2SeqTrainingArguments(
    output_dir="./soft_prompt_tuning",
    evaluation_strategy="epoch",  # Use "no" if you want to skip validation during training
    learning_rate=5e-6,
    per_device_train_batch_size=1,  # Lower batch size
    per_device_eval_batch_size=1,
    num_train_epochs=5,
    weight_decay=0.01,
    predict_with_generate=True,
    max_grad_norm=1.0,
    gradient_accumulation_steps=4,  # Use accumulation for larger effective batch size
    fp16=False,
    save_safetensors=False,
    save_steps=0,
)

# Enable gradient checkpointing
base_model.gradient_checkpointing_enable()

for param in soft_prompt_model.model.parameters():
    param.requires_grad = False
# Clear any cached memory
torch.cuda.empty_cache()

# Initialize the trainer
trainer = Trainer(

    model=soft_prompt_model,
    args=training_args,
    train_dataset=train_data_tokenized,
    eval_dataset=val_data_tokenized,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()


  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,4.614409
2,5.075200,4.623473
3,5.114200,4.613799
4,4.950800,4.617537
5,5.112500,4.621651


TrainOutput(global_step=2000, training_loss=5.063213500976563, metrics={'train_runtime': 2991.9796, 'train_samples_per_second': 2.674, 'train_steps_per_second': 0.668, 'total_flos': 0.0, 'train_loss': 5.063213500976563, 'epoch': 5.0})

## Evaluation after Prompt Tuning

### 1. Validation loss

In [None]:
results = trainer.evaluate()
print(results)


{'eval_loss': 4.6216511726379395, 'eval_runtime': 38.3464, 'eval_samples_per_second': 10.431, 'eval_steps_per_second': 10.431, 'epoch': 5.0}


### 2. Define the test function

###  **Testing Without adding a Prompt**



In [None]:
import torch

def test_soft_prompt_tuning(soft_prompt_model, base_model, tokenizer, dataset, num_samples=5, max_length=100):
    """
    Test the T5 model with soft prompt tuning by generating answers to questions.

    Args:
    - soft_prompt_model: The model wrapper with soft prompt tuning.
    - base_model: The underlying T5 model that performs generation.
    - tokenizer: Tokenizer for encoding and decoding text.
    - dataset: Dataset containing questions and answers.
    - num_samples: Number of samples to test.
    - max_length: Maximum length of the generated answer.
    """

    base_model.eval()  # Set model to evaluation mode
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    base_model.to(device)

    for i in range(num_samples):
        # Extract a question and true answer from the dataset
        sample = dataset[i]
        input_text = sample['question']
        true_answer = sample['answer']

        # Tokenize input question
        inputs = tokenizer(input_text, return_tensors="pt").to(device)

        # Retrieve the soft prompt embeddings from the soft_prompt_model
        prompt_embeds = soft_prompt_model.prompt_embeddings.unsqueeze(0).expand(inputs['input_ids'].size(0), -1, -1).to(device)

        # Get T5 input embeddings for the tokenized question and concatenate with soft prompts
        input_embeddings = base_model.get_input_embeddings()(inputs['input_ids'])
        combined_embeddings = torch.cat((prompt_embeds, input_embeddings), dim=1)

        # Adjust attention mask to account for the prompt length
        prompt_mask = torch.ones((inputs['attention_mask'].size(0), soft_prompt_model.prompt_length), device=device)
        attention_mask = torch.cat([prompt_mask, inputs['attention_mask']], dim=1)

        # Generate output using the base model with the combined embeddings and modified attention mask
        with torch.no_grad():
            outputs = base_model.generate(inputs_embeds=combined_embeddings, attention_mask=attention_mask, max_length=max_length)

        # Decode the generated output
        generated_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Display results
        print(f"Input Question: {input_text}")
        print(f"True Answer: {true_answer}")
        print(f"Generated Answer: {generated_answer}\n")

# Usage example:
# Assuming `val_data_tokenized` is your validation dataset and has been prepared with question-answer pairs
test_soft_prompt_tuning(soft_prompt_model, base_model, tokenizer, val_data_tokenized, num_samples=5, max_length=100)


Input Question: What is (are) Thalassemia ?
True Answer: Thalassemias are inherited blood disorders. If you have one, your body makes fewer healthy red blood cells and less hemoglobin. Hemoglobin is a protein that carries oxygen to the body. That leads to anemia. Thalassemias occur most often among people of Italian, Greek, Middle Eastern, Southern Asian, and African descent. Thalassemias can be mild or severe. Some people have no symptoms or mild anemia. The most common severe type in the United States is called Cooley's anemia. It usually appears during the first two years of life. People with it may have severe anemia, slowed growth and delayed puberty, and problems with the spleen, liver, heart, or bones. Doctors diagnose thalassemias using blood tests. Treatments include blood transfusions and treatment to remove excess iron from the body. If you have mild symptoms or no symptoms, you may not need treatment. In some severe cases, you may need a bone marrow transplant. NIH: Nationa

### **Testing with a prompt**

In [None]:
import torch

def test_detailed_generation(soft_prompt_model, base_model, tokenizer, dataset, num_samples=5, max_length=200):
    """
    Test the T5 model with soft prompt tuning by generating more detailed answers to healthcare questions.

    Args:
    - soft_prompt_model: The model wrapper with soft prompt tuning.
    - base_model: The underlying T5 model that performs generation.
    - tokenizer: Tokenizer for encoding and decoding text.
    - dataset: Dataset containing questions and answers.
    - num_samples: Number of samples to test.
    - max_length: Maximum length of the generated answer.
    """

    base_model.eval()  # Set model to evaluation mode
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    base_model.to(device)

    for i in range(num_samples):
        # Extract a question and true answer from the dataset
        sample = dataset[i]
        input_text = sample['question']
        true_answer = sample['answer']

        # Add a prompt prefix to encourage detailed responses
        prompt_text = "Provide a detailed medical explanation: " + input_text

        # Tokenize the prompt text
        inputs = tokenizer(prompt_text, return_tensors="pt").to(device)

        # Retrieve the soft prompt embeddings from the soft_prompt_model
        prompt_embeds = soft_prompt_model.prompt_embeddings.unsqueeze(0).expand(inputs['input_ids'].size(0), -1, -1).to(device)

        # Get T5 input embeddings for the tokenized question and concatenate with soft prompts
        input_embeddings = base_model.get_input_embeddings()(inputs['input_ids'])
        combined_embeddings = torch.cat((prompt_embeds, input_embeddings), dim=1)

        # Adjust attention mask to account for the prompt length
        prompt_mask = torch.ones((inputs['attention_mask'].size(0), soft_prompt_model.prompt_length), device=device)
        attention_mask = torch.cat([prompt_mask, inputs['attention_mask']], dim=1)

        # Generate output with adjusted generation parameters
        with torch.no_grad():
            outputs = base_model.generate(
                inputs_embeds=combined_embeddings,
                attention_mask=attention_mask,
                max_length=max_length,  # Allowing longer responses
                temperature=0.7,        # Adjusting creativity level
                top_k=50,               # Limiting sampling to top 50 tokens
                top_p=0.9               # Nucleus sampling for more coherent output
            )

        # Decode the generated output
        generated_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Display results
        print(f"Input Question: {input_text}")
        print(f"True Answer: {true_answer}")
        print(f"Generated Answer: {generated_answer}\n")

# Usage example:
# Assuming `val_data_tokenized` is your validation dataset and has been prepared with question-answer pairs
test_detailed_generation(soft_prompt_model, base_model, tokenizer, val_data_tokenized, num_samples=5, max_length=200)




Input Question: What is (are) Thalassemia ?
True Answer: Thalassemias are inherited blood disorders. If you have one, your body makes fewer healthy red blood cells and less hemoglobin. Hemoglobin is a protein that carries oxygen to the body. That leads to anemia. Thalassemias occur most often among people of Italian, Greek, Middle Eastern, Southern Asian, and African descent. Thalassemias can be mild or severe. Some people have no symptoms or mild anemia. The most common severe type in the United States is called Cooley's anemia. It usually appears during the first two years of life. People with it may have severe anemia, slowed growth and delayed puberty, and problems with the spleen, liver, heart, or bones. Doctors diagnose thalassemias using blood tests. Treatments include blood transfusions and treatment to remove excess iron from the body. If you have mild symptoms or no symptoms, you may not need treatment. In some severe cases, you may need a bone marrow transplant. NIH: Nationa

### **BLEU score**

In [None]:
!pip install evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
import torch
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def test_detailed_generation_with_bleu(soft_prompt_model, base_model, tokenizer, dataset, num_samples=5, max_length=200):
    """
    Test the T5 model with soft prompt tuning by generating answers to healthcare questions
    and calculate BLEU scores to evaluate the generated answers.

    Args:
    - soft_prompt_model: The model wrapper with soft prompt tuning.
    - base_model: The underlying T5 model that performs generation.
    - tokenizer: Tokenizer for encoding and decoding text.
    - dataset: Dataset containing questions and answers.
    - num_samples: Number of samples to test.
    - max_length: Maximum length of the generated answer.
    """

    base_model.eval()  # Set model to evaluation mode
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    base_model.to(device)

    bleu_scores = []  # List to store individual BLEU scores

    for i in range(num_samples):
        # Extract a question and true answer from the dataset
        sample = dataset[i]
        input_text = sample['question']
        true_answer = sample['answer']

        # Add a prompt prefix to encourage detailed responses
        prompt_text = "Provide a detailed medical explanation: " + input_text

        # Tokenize the prompt text
        inputs = tokenizer(prompt_text, return_tensors="pt").to(device)

        # Retrieve the soft prompt embeddings from the soft_prompt_model
        prompt_embeds = soft_prompt_model.prompt_embeddings.unsqueeze(0).expand(inputs['input_ids'].size(0), -1, -1).to(device)

        # Get T5 input embeddings for the tokenized question and concatenate with soft prompts
        input_embeddings = base_model.get_input_embeddings()(inputs['input_ids'])
        combined_embeddings = torch.cat((prompt_embeds, input_embeddings), dim=1)

        # Adjust attention mask to account for the prompt length
        prompt_mask = torch.ones((inputs['attention_mask'].size(0), soft_prompt_model.prompt_length), device=device)
        attention_mask = torch.cat([prompt_mask, inputs['attention_mask']], dim=1)

        # Generate output with adjusted generation parameters
        with torch.no_grad():
            outputs = base_model.generate(
                inputs_embeds=combined_embeddings,
                attention_mask=attention_mask,
                max_length=max_length,
                temperature=0.7,
                top_k=50,
                top_p=0.9
            )

        # Decode the generated output
        generated_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Calculate BLEU score
        reference = [true_answer.split()]  # Reference answer tokenized into a list of words
        hypothesis = generated_answer.split()  # Generated answer tokenized into a list of words
        bleu_score = sentence_bleu(reference, hypothesis, smoothing_function=SmoothingFunction().method1)
        bleu_scores.append(bleu_score)

        # Display results
        print(f"Input Question: {input_text}")
        print(f"True Answer: {true_answer}")
        print(f"Generated Answer: {generated_answer}")
        print(f"BLEU Score: {bleu_score}\n")

    # Calculate the average BLEU score across all samples
    avg_bleu_score = sum(bleu_scores) / len(bleu_scores)
    print(f"Average BLEU Score across {num_samples} samples: {avg_bleu_score}")

# Usage example:
# Assuming `val_data_tokenized` is your validation dataset with question-answer pairs
test_detailed_generation_with_bleu(soft_prompt_model, base_model, tokenizer, val_data_tokenized, num_samples=5, max_length=200)


Input Question: What is (are) Thalassemia ?
True Answer: Thalassemias are inherited blood disorders. If you have one, your body makes fewer healthy red blood cells and less hemoglobin. Hemoglobin is a protein that carries oxygen to the body. That leads to anemia. Thalassemias occur most often among people of Italian, Greek, Middle Eastern, Southern Asian, and African descent. Thalassemias can be mild or severe. Some people have no symptoms or mild anemia. The most common severe type in the United States is called Cooley's anemia. It usually appears during the first two years of life. People with it may have severe anemia, slowed growth and delayed puberty, and problems with the spleen, liver, heart, or bones. Doctors diagnose thalassemias using blood tests. Treatments include blood transfusions and treatment to remove excess iron from the body. If you have mild symptoms or no symptoms, you may not need treatment. In some severe cases, you may need a bone marrow transplant. NIH: Nationa

### **ROUGE score**

In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=b1e9a83db7aa94dcc9d12787167be49373070c9440fef4b305434c6a046a52c0
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
import torch
from transformers import pipeline
from evaluate import load

# Load the ROUGE metric
rouge = load("rouge")

def test_detailed_generation_with_rouge(soft_prompt_model, base_model, tokenizer, dataset, num_samples=5, max_length=200):
    """
    Test the T5 model with soft prompt tuning by generating answers to healthcare questions
    and calculate ROUGE scores to evaluate the generated answers.

    Args:
    - soft_prompt_model: The model wrapper with soft prompt tuning.
    - base_model: The underlying T5 model that performs generation.
    - tokenizer: Tokenizer for encoding and decoding text.
    - dataset: Dataset containing questions and answers.
    - num_samples: Number of samples to test.
    - max_length: Maximum length of the generated answer.
    """

    base_model.eval()  # Set model to evaluation mode
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    base_model.to(device)

    # Lists to store generated answers and true answers for ROUGE calculation
    generated_answers = []
    true_answers = []

    for i in range(num_samples):
        # Extract a question and true answer from the dataset
        sample = dataset[i]
        input_text = sample['question']
        true_answer = sample['answer']

        # Add a prompt prefix to encourage detailed responses
        prompt_text = "Provide a detailed medical explanation: " + input_text

        # Tokenize the prompt text
        inputs = tokenizer(prompt_text, return_tensors="pt").to(device)

        # Retrieve the soft prompt embeddings from the soft_prompt_model
        prompt_embeds = soft_prompt_model.prompt_embeddings.unsqueeze(0).expand(inputs['input_ids'].size(0), -1, -1).to(device)

        # Get T5 input embeddings for the tokenized question and concatenate with soft prompts
        input_embeddings = base_model.get_input_embeddings()(inputs['input_ids'])
        combined_embeddings = torch.cat((prompt_embeds, input_embeddings), dim=1)

        # Adjust attention mask to account for the prompt length
        prompt_mask = torch.ones((inputs['attention_mask'].size(0), soft_prompt_model.prompt_length), device=device)
        attention_mask = torch.cat([prompt_mask, inputs['attention_mask']], dim=1)

        # Generate output with adjusted generation parameters
        with torch.no_grad():
            outputs = base_model.generate(
                inputs_embeds=combined_embeddings,
                attention_mask=attention_mask,
                max_length=max_length,
                temperature=0.7,
                top_k=50,
                top_p=0.9
            )

        # Decode the generated output
        generated_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Store answers for ROUGE score calculation
        generated_answers.append(generated_answer)
        true_answers.append(true_answer)

        # Display results
        print(f"Input Question: {input_text}")
        print(f"True Answer: {true_answer}")
        print(f"Generated Answer: {generated_answer}\n")

    # Calculate ROUGE scores
    results = rouge.compute(predictions=generated_answers, references=true_answers)
    print("ROUGE Scores:")
    print(f"ROUGE-1: {results['rouge1']}")
    print(f"ROUGE-2: {results['rouge2']}")
    print(f"ROUGE-L: {results['rougeL']}")

# Usage example:
# Assuming `val_data_tokenized` is your validation dataset with question-answer pairs
test_detailed_generation_with_rouge(soft_prompt_model, base_model, tokenizer, val_data_tokenized, num_samples=5, max_length=200)


Input Question: What is (are) Thalassemia ?
True Answer: Thalassemias are inherited blood disorders. If you have one, your body makes fewer healthy red blood cells and less hemoglobin. Hemoglobin is a protein that carries oxygen to the body. That leads to anemia. Thalassemias occur most often among people of Italian, Greek, Middle Eastern, Southern Asian, and African descent. Thalassemias can be mild or severe. Some people have no symptoms or mild anemia. The most common severe type in the United States is called Cooley's anemia. It usually appears during the first two years of life. People with it may have severe anemia, slowed growth and delayed puberty, and problems with the spleen, liver, heart, or bones. Doctors diagnose thalassemias using blood tests. Treatments include blood transfusions and treatment to remove excess iron from the body. If you have mild symptoms or no symptoms, you may not need treatment. In some severe cases, you may need a bone marrow transplant. NIH: Nationa

### **Perplexity**

In [None]:
def test_detailed_generation_with_perplexity_and_answer(soft_prompt_model, base_model, tokenizer, dataset, num_samples=5, max_length=200):
    base_model.eval()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    base_model.to(device)

    perplexities = []
    generated_answers = []
    true_answers = []

    for i in range(num_samples):
        sample = dataset[i]
        input_text = sample['question']
        true_answer = sample['answer']

        # Add prompt
        prompt_text = "Provide a detailed medical explanation: " + input_text
        inputs = tokenizer(prompt_text, return_tensors="pt").to(device)
        target = tokenizer(true_answer, return_tensors="pt").input_ids.to(device)

        prompt_embeds = soft_prompt_model.prompt_embeddings.unsqueeze(0).expand(inputs['input_ids'].size(0), -1, -1).to(device)
        input_embeddings = base_model.get_input_embeddings()(inputs['input_ids'])
        combined_embeddings = torch.cat((prompt_embeds, input_embeddings), dim=1)

        prompt_mask = torch.ones((inputs['attention_mask'].size(0), soft_prompt_model.prompt_length), device=device)
        attention_mask = torch.cat([prompt_mask, inputs['attention_mask']], dim=1)

        # Calculate loss for perplexity
        with torch.no_grad():
            outputs = base_model(input_ids=inputs['input_ids'], labels=target)
            loss = outputs.loss.item()

        perplexity = math.exp(loss)
        perplexities.append(perplexity)

        # Generate an answer
        with torch.no_grad():
            generated_output = base_model.generate(
                inputs_embeds=combined_embeddings,
                attention_mask=attention_mask,
                max_length=max_length,
                temperature=0.7,
                top_k=50,
                top_p=0.9
            )
            generated_answer = tokenizer.decode(generated_output[0], skip_special_tokens=True)
            generated_answers.append(generated_answer)
            true_answers.append(true_answer)

        # Display results
        print(f"Input Question: {input_text}")
        print(f"True Answer: {true_answer}")
        print(f"Generated Answer: {generated_answer}")
        print(f"Perplexity: {perplexity}\n")

    # Average perplexity
    avg_perplexity = sum(perplexities) / len(perplexities)
    print("Average Perplexity:", avg_perplexity)

# Usage
test_detailed_generation_with_perplexity_and_answer(soft_prompt_model, base_model, tokenizer, val_data_tokenized, num_samples=5, max_length=200)


Input Question: What is (are) Thalassemia ?
True Answer: Thalassemias are inherited blood disorders. If you have one, your body makes fewer healthy red blood cells and less hemoglobin. Hemoglobin is a protein that carries oxygen to the body. That leads to anemia. Thalassemias occur most often among people of Italian, Greek, Middle Eastern, Southern Asian, and African descent. Thalassemias can be mild or severe. Some people have no symptoms or mild anemia. The most common severe type in the United States is called Cooley's anemia. It usually appears during the first two years of life. People with it may have severe anemia, slowed growth and delayed puberty, and problems with the spleen, liver, heart, or bones. Doctors diagnose thalassemias using blood tests. Treatments include blood transfusions and treatment to remove excess iron from the body. If you have mild symptoms or no symptoms, you may not need treatment. In some severe cases, you may need a bone marrow transplant. NIH: Nationa