<a href="https://colab.research.google.com/github/Vivek-Sajjan/DS_Projects/blob/main/M6_NB_MiniProject_1_Medical_Q%26A_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Mini-Project: Medical Q&A using GPT2

## Learning Objectives

At the end of the experiment, you will be able to:

* perform data preprocessing, EDA and feature extraction on the Medical Q&A dataset
* load a pre-trained tokenizer
* finetune a GPT-2 language model for medical question-answering

## Dataset Description

The dataset used in this project is the *Medical Question Answering Dataset* ([MedQuAD](https://github.com/abachaa/MedQuAD/tree/master)). It includes medical question-answer pairs along with additional information, such as the question type, the question *focus*, its UMLS(Unified Medical Language System) details like - Concept Unique Identifier(*CUI*) and Semantic *Type* and *Group*.

To know more about this data's collection, and construction method, refer to this [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4).

The data is extracted and is in CSV format with below features:

- **Focus**: the question focus
- **CUI**: concept unique identifier
- **SemanticType**
- **SemanticGroup**
- **Question**
- **Answer**

## Part-A: Grading = 10 Points

## Information

Healthcare professionals often have to refer to medical literature and documents while seeking answers to medical queries. Medical databases or search engines are powerful resources of upto date medical knowledge. However, the existing documentation is large and makes it difficult for professionals to retrieve answers quickly in a clinical setting. The problem with search engines and informative retrieval engines is that these systems return a list of documents rather than answers. Instead, healthcare professionals can use question answering systems to retrieve short sentences or paragraphs in response to medical queries. Such systems have the biggest advantage of generating answers and providing hints in a few seconds.

### Problem Statement

Fine-tune gpt2 model on medical-question-answering-dataset for performing response generation for medical queries.

Please refer to ***M6 Assignment-1 Fine-tune GPT2*** to get familiar with how to load pre-trained gpt2 tokenizer and model.

### Import required packages

In [2]:
#hugging face libraries
!pip -q install -U accelerate
!pip -q install -U transformers
!pip -q install torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m46.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

import warnings
warnings.filterwarnings('ignore')

In [4]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/MedQuAD.csv
!ls | grep ".csv"

MedQuAD.csv


**Exercise 1: Read the MedQuAD.csv dataset**

**Hint:** pd.read_csv()

In [5]:
df = pd.read_csv("MedQuAD.csv")
df1=df.copy()
df.shape

(16412, 6)

In [6]:
df.head(10)

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
0,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is (are) Adult Acute Lymphoblastic Leukem...,Key Points - Adult acute lymphoblastic leukemi...
1,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the symptoms of Adult Acute Lymphobla...,"Signs and symptoms of adult ALL include fever,..."
2,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,How to diagnose Adult Acute Lymphoblastic Leuk...,Tests that examine the blood and bone marrow a...
3,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is the outlook for Adult Acute Lymphoblas...,Certain factors affect prognosis (chance of re...
4,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,Who is at risk for Adult Acute Lymphoblastic L...,Previous chemotherapy and exposure to radiatio...
5,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the stages of Adult Acute Lymphoblast...,Key Points - Once adult ALL has been diagnosed...
6,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the treatments for Adult Acute Lympho...,Key Points - There are different types of trea...
7,Adult Acute Myeloid Leukemia,C0220615,T191,Disorders,What is (are) Adult Acute Myeloid Leukemia ?,Key Points - Adult acute myeloid leukemia (AML...
8,Adult Acute Myeloid Leukemia,C0220615,T191,Disorders,Who is at risk for Adult Acute Myeloid Leukemi...,"Smoking, previous chemotherapy treatment, and ..."
9,Adult Acute Myeloid Leukemia,C0220615,T191,Disorders,What are the symptoms of Adult Acute Myeloid L...,"Signs and symptoms of adult AML include fever,..."


### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [0.5 Mark]**

- Handle missing values
- Remove duplicates from data considering `Question` and `Answer` columns

- **Handle missing values**

In [7]:
df.isnull().sum()

Unnamed: 0,0
Focus,14
CUI,565
SemanticType,597
SemanticGroup,565
Question,0
Answer,5


In [8]:
df.dropna(inplace=True)

In [9]:
df.shape

(15810, 6)

- **Remove duplicates from data considering `Question` and `Answer` columns**

In [10]:
df.drop_duplicates(subset=['Question', 'Answer'], inplace=True)
df.shape

(15762, 6)

**Exercise 3: Display the category name, and the number of records belonging to top 100 categories of `Focus` column [1 Mark]**

In [11]:
top_100_focus_categories = df['Focus'].value_counts().head(100)
top_100_focus_categories

Unnamed: 0_level_0,count
Focus,Unnamed: 1_level_1
Breast Cancer,53
Prostate Cancer,43
Stroke,35
Skin Cancer,34
Alzheimer's Disease,30
...,...
Medullary Sponge Kidney,11
IgA Nephropathy,11
Alagille Syndrome,11
Urinary Incontinence in Men,10


### Create Training and Validation set

**Exercise 4: Create training and validation set [2 Marks]**

- Consider 4 samples per `Focus` category, for each top 100 categories, from the dataset (It will give 400 samples for training)

- Consider 1 sample per `Focus` category (different from training set), for each top 100 categories, from the dataset (It will give 100 samples for validation)

In [12]:
import pandas as pd
training_set = pd.DataFrame()
validation_set = pd.DataFrame()

for category in top_100_focus_categories.index:
    category_samples = df[df['Focus'] == category]

    # Take 4 samples for training
    training_samples = category_samples.sample(n=4, random_state=42)
    training_set = pd.concat([training_set, training_samples])

    # Take 1 sample for validation from the remaining
    remaining_samples = category_samples.drop(training_samples.index)
    if len(remaining_samples) > 0:
      validation_sample = remaining_samples.sample(n=1, random_state=42)
      validation_set = pd.concat([validation_set, validation_sample])

print("Training set shape:", training_set.shape)
print("Validation set shape:", validation_set.shape)

Training set shape: (400, 6)
Validation set shape: (100, 6)


### Pre-process `Question` and `Answer` text

**Exercise 5: Perform below tasks: [1.5 Marks]**

- Combine `Question` and `Answer` for train and validation data as shown below:
    - sequence = *'\<question\>' + question-text + '\<answer\>' + answer-text*

- Join the combined text using '\n' into a single string for training and validation separately

- Save the training and validation strings as separate text files

- **Combine Question and Answer for train and val data**

In [13]:
train_sequences = training_set.apply(lambda row: '<question>' + row['Question'] + '<answer>' + row['Answer'], axis=1)
val_sequences = validation_set.apply(lambda row: '<question>' + row['Question'] + '<answer>' + row['Answer'], axis=1)

- **Join the combined text using '\n' into a single string for training and validation separately**

In [14]:
train_text = '\n'.join(train_sequences)
val_text = '\n'.join(val_sequences)

- **Save the training and validation strings as text files**

In [15]:
# prompt: Save the training and validation strings as text files

with open('train_data.txt', 'w') as f:
    f.write(train_text)

with open('val_data.txt', 'w') as f:
    f.write(val_text)

!ls train_data.txt val_data.txt

train_data.txt	val_data.txt


**Exercise 6: Load pre-trained GPT2Tokenizer [0.5 Mark]**

- Use checkpoint = "gpt2"

In [16]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

**Exercise 7: Tokenize train and validation data and form TextDataset objects [0.5 Mark]**

- Use the loaded pre-trained tokenizer
- Use training and validation data saved in text files

In [17]:
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="train_data.txt",
    block_size=128  # Or choose an appropriate block size
)

val_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="val_data.txt",
    block_size=128  # Or choose an appropriate block size
)

**Exercise 8: Create a DataCollator object [0.5 Mark]**

 list of samples from dataset and prepare them into a batch that can be fed into a model for training or inference.

In [18]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # For causal language modeling like GPT-2, set mlm=False
)

**Exercise 9: Load pre-trained GPT2LMHeadModel [0.5 Mark]**

In [19]:
model = GPT2LMHeadModel.from_pretrained("gpt2")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Exercise 10: Fine-tune GPT2 Model [1 Mark]**

- Specify training arguments and create a TrainingArguments object (Use 30 epochs)

- Train a GPT-2 model using the provided training arguments

- Save the resulting trained model and tokenizer to a specified output directory

In [22]:
# TrainingArguments object for GPT2 model

training_args = TrainingArguments(
    output_dir="./gpt2-medical-qa",  # Where the model predictions and checkpoints will be written
    overwrite_output_dir=True,
    num_train_epochs=30,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
    report_to=None # Disable wandb integration
)

In [24]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset, # Optional: include evaluation data
)

trainer.train()

# Save the trained model
model.save_pretrained("./gpt2-medical-qa")

# Save the tokenizer
tokenizer.save_pretrained("./gpt2-medical-qa")

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msajjan-vivek[0m ([33msajjan-vivek-abc[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
500,2.4833
1000,1.9222
1500,1.5667
2000,1.2738
2500,1.034
3000,0.8502
3500,0.7122
4000,0.5912
4500,0.5091
5000,0.444


('./gpt2-medical-qa/tokenizer_config.json',
 './gpt2-medical-qa/special_tokens_map.json',
 './gpt2-medical-qa/vocab.json',
 './gpt2-medical-qa/merges.txt',
 './gpt2-medical-qa/added_tokens.json')

**Exercise 11: Test Model with user input prompts [1 Mark]**

- Create `generate_response()` function that takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model

- Test it with some user input prompts

In [26]:
def generate_response(model, tokenizer, prompt, max_length=100, num_return_sequences=1):
    """
    Generates a response using the trained GPT-2 model.

    Args:
        model: The trained GPT-2 language model.
        tokenizer: The tokenizer associated with the model.
        prompt (str): The input prompt string.
        max_length (int): The maximum length of the generated response.
        num_return_sequences (int): The number of response sequences to generate.

    Returns:
        list: A list of generated response strings.
    """
    input_ids = tokenizer.encode(prompt, return_tensors='pt')

    # Move input tensor to the same device as the model
    device = model.device
    input_ids = input_ids.to(device)

    output_sequences = model.generate(
        input_ids=input_ids,
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        no_repeat_ngram_size=2,
        pad_token_id=tokenizer.eos_token_id # Add this to avoid warning
    )

    generated_responses = []
    for generated_sequence in output_sequences:
        generated_sequence = generated_sequence.tolist()

        # Decode text
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)

        # Remove the prompt from the generated text
        text = text.replace(prompt, "", 1)

        # Optionally, stop decoding after an end token like '<answer>'
        if '<answer>' in text:
            text = text.split('<answer>')[1] # Take content after <answer>

        generated_responses.append(text.strip())

    return generated_responses

# Test with some user input prompts
prompt1 = "<question>What are the symptoms of diabetes?"
response1 = generate_response(model, tokenizer, prompt1)
print(f"Prompt: {prompt1}")
print(f"Response: {response1[0]}\n")

prompt2 = "<question>How is a migraine treated?"
response2 = generate_response(model, tokenizer, prompt2)
print(f"Prompt: {prompt2}")
print(f"Response: {response2[0]}\n")

prompt3 = "<question>What is hypertension?"
response3 = generate_response(model, tokenizer, prompt3)
print(f"Prompt: {prompt3}")
print(f"Response: {response3[0]}\n")

Prompt: <question>What are the symptoms of diabetes?
Response: Diabetes can be a sign of many other things, including: - Heart disease, stroke, liver disease and diabetes, heart and blood vessel disease that damage the heart - Head and neck injuries that make it hard to breathe, even for a short time - Ails from the sun, fungi and bacteria can cause diabetes - The list of signs and symptoms for diabetes includes - Diabetes - Kidney failure - High blood pressure - Elev

Prompt: <question>How is a migraine treated?
Response: Medications called migraine medicines may help people deal head and neck pain. Medicines can also help some people temporarily, but they are not always right for everyone. People with migraines most often need professional help to get around. They may need help from family, friends, or doctor. There are many types of migraine medications. Some medicines can help relieve some of the symptoms. Others may

Prompt: <question>What is hypertension?
Response: Hormones to th

**Exercise 12: Compare the performance of a *GPT2 model* with the *GPT2 model fine-tuned* on MedQuAD data [1 Mark]**

- Load another pre-trained GPT2LMHeadModel and do not fine-tune it

- To generate response using the untuned model, pass it as a parameter to `generate_response()` function

- Test both models (fine-tuned and untuned) with below user input prompts:

    - "What precautions to take for a healthy life?"
    - "What to do after being diagnosed with cancer?"
    - "What to do when feeling sick?"

In [None]:
# Load a pre-trained GPT2 model, do not finetune it with MedQuAD data

# YOUR CODE HERE

In [None]:
# Testing with finetuned model: prompt 1

# YOUR CODE HERE

In [None]:
# Testing with untuned model: prompt 1

# YOUR CODE HERE

In [None]:
# Testing with finetuned model: prompt 2

# YOUR CODE HERE

In [None]:
# Testing with untuned model: prompt 2

# YOUR CODE HERE

In [None]:
# Testing with finetuned model: prompt 3

# YOUR CODE HERE

In [None]:
# Testing with untuned model: prompt 3

# YOUR CODE HERE