In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('/kaggle/input/medical-ques-and-ans/intern_screening_dataset.csv')
df

Unnamed: 0,question,answer
0,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...
1,What is (are) Glaucoma ?,The optic nerve is a bundle of more than 1 mil...
2,What is (are) Glaucoma ?,Open-angle glaucoma is the most common form of...
3,Who is at risk for Glaucoma? ?,Anyone can develop glaucoma. Some people are a...
4,How to prevent Glaucoma ?,"At this time, we do not know how to prevent gl..."
...,...,...
16401,What is (are) Diabetic Neuropathies: The Nerve...,Autonomic neuropathy affects the nerves that c...
16402,What is (are) Diabetic Neuropathies: The Nerve...,"Proximal neuropathy, sometimes called lumbosac..."
16403,What is (are) Diabetic Neuropathies: The Nerve...,Focal neuropathy appears suddenly and affects ...
16404,How to prevent Diabetic Neuropathies: The Nerv...,The best way to prevent neuropathy is to keep ...


In [34]:
df.isna().sum()

question    0
answer      0
dtype: int64

In [2]:
# dropped missing value rows
df.dropna(inplace=True)

> There are questions in the dataset which are repeated multiple times having different answers so I combined all those answers for similar questions and trained the model on aggregated dataset

In [3]:
aggregated_answers = df.groupby('question')['answer'].apply(lambda x: ' '.join(set(x))).reset_index()
aggregated_answers

Unnamed: 0,question,answer
0,Do you have information about A1C,Summary : A1C is a blood test for type 2 diabe...
1,Do you have information about Acupuncture,Summary : Acupuncture has been practiced in Ch...
2,Do you have information about Adoption,Summary : Adoption brings a child born to othe...
3,Do you have information about Advance Directives,Summary : What kind of medical care would you ...
4,Do you have information about African American...,Summary : Every racial or ethnic group has spe...
...,...,...
14971,what research (or clinical trials) is being do...,New types of treatment are being tested in cli...
14972,what research (or clinical trials) is being do...,The National Institute of Neurological Disorde...
14973,what research (or clinical trials) is being do...,The National Institute of Neurological Disorde...
14974,what research is being done for Tuberculosis (...,TB Epidemiologic Studies Consortium\n \n The ...


In [4]:
pd.options.display.max_colwidth = None
aggregated_answers[aggregated_answers["question"]=="What is (are) Glaucoma ?"]["answer"]

11415    Glaucoma is a group of diseases that can damage the eye's optic nerve. It is a leading cause of blindness in the United States. It usually happens when the fluid pressure inside the eyes slowly rises, damaging the optic nerve. Often there are no symptoms at first. Without treatment, people with glaucoma will slowly lose their peripheral, or side vision. They seem to be looking through a tunnel. Over time, straight-ahead vision may decrease until no vision remains.    A comprehensive eye exam can tell if you have glaucoma. People at risk should get eye exams at least every two years. They include       -  African Americans over age 40    -  People over age 60, especially Mexican Americans    -  People with a family history of glaucoma       There is no cure, but glaucoma can usually be controlled. Early treatment can help protect your eyes against vision loss. Treatments usually include prescription eyedrops and/or surgery.    NIH: National Eye Institute Glaucoma is a group of 

In [5]:
pd.options.display.max_colwidth = 50
train_df, test_df = train_test_split(aggregated_answers, test_size=0.3, random_state=42)
train_df

Unnamed: 0,question,answer
1572,How to diagnose Dermatitis Herpetiformis: Skin...,A skin biopsy is the first step in diagnosing ...
12976,What is (are) Tularemia ?,Tularemia is an infection common in wild roden...
879,How many people are affected by congenital hyp...,Congenital hyperinsulinism affects approximate...
1578,How to diagnose Diabetic mastopathy ?,How is diabetic mastopathy diagnosed? The diag...
13502,What is (are) hypochondrogenesis ?,"Hypochondrogenesis is a rare, severe disorder ..."
...,...,...
5191,"What are the symptoms of Cataract, total conge...","What are the signs and symptoms of Cataract, t..."
13418,What is (are) fragile X-associated primary ova...,Fragile X-associated primary ovarian insuffici...
5390,What are the symptoms of Corneal dystrophy Thi...,What are the signs and symptoms of Corneal dys...
860,How many people are affected by common variabl...,"CVID is estimated to affect 1 in 25,000 to 1 i..."


* Used `bart` as a model for this dataset as it can train on dataset which does not have context in them. I also tried with `t5` but gave very low score. Can't use `bert` as it works well on dataset given context along with question as input to find answer in that context

In [6]:
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments
from datasets import Dataset

train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(test_df)

model_name = "facebook/bart-base"  # You can use "facebook/bart-large" for a larger model
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

2024-05-21 14:13:38.914592: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-21 14:13:38.914725: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-21 14:13:39.054874: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

* `preprocess_function` tokenizes each and every input question as well answers for both training and validation dataset to be used for training and evaluating the model.

In [7]:
def preprocess_function(examples):
    inputs = [q for q in examples["question"]]
    targets = examples["answer"]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=512, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_dataset = train_dataset.map(preprocess_function, batched=True)
val_dataset = val_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/10483 [00:00<?, ? examples/s]



Map:   0%|          | 0/4493 [00:00<?, ? examples/s]

In [8]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    save_total_limit=2,
    push_to_hub=False,
    report_to="none"
)

* For training the model I have used batch size of 16 for both datasets and epochs are set to be 5, more than that would have cost me more computational power and time as well.

In [9]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Fine-tune the model
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss
1,1.968,0.855379
2,0.9545,0.806367
3,0.8913,0.781326
4,0.8466,0.770584
5,0.8264,0.767255


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


TrainOutput(global_step=3280, training_loss=1.0378931138573624, metrics={'train_runtime': 4990.5593, 'train_samples_per_second': 10.503, 'train_steps_per_second': 0.657, 'total_flos': 1.59796682293248e+16, 'train_loss': 1.0378931138573624, 'epoch': 5.0})

In [10]:
model.save_pretrained('./medical_bart')
tokenizer.save_pretrained('./medical_bart')

Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


('./medical_bart/tokenizer_config.json',
 './medical_bart/special_tokens_map.json',
 './medical_bart/vocab.json',
 './medical_bart/merges.txt',
 './medical_bart/added_tokens.json')

In [47]:
val_dataset

Dataset({
    features: ['question', 'answer', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 4493
})

In [63]:
from datasets import load_metric
import numpy as np

# Load the SQuAD v2 metrics
squad_metric = load_metric("squad_v2")

def compute_metrics(pred):
    # Decode the predictions and references
    pred_str = tokenizer.batch_decode(pred.predictions, skip_special_tokens=True)
    labels_ids = np.where(pred.label_ids != -100, pred.label_ids, tokenizer.pad_token_id)
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    
    # Prepare predictions and references in the required format
    formatted_predictions = [{"id": str(i), "prediction_text": pred_str[i]} for i in range(len(pred_str))]
    formatted_references = [{"id": str(i), "answers": {"text": [label_str[i]], "answer_start": [0]}} for i in range(len(label_str))]
    
    # Compute the metrics
    result = squad_metric.compute(predictions=formatted_predictions, references=formatted_references)
    return {"exact_match": result["exact"], "f1": result["f1"]}


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [64]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer, padding=True, max_length=512)
training_args = TrainingArguments(
    output_dir="./results",
    per_device_eval_batch_size=16,
    logging_dir="./logs",
    logging_steps=10,
    do_eval=True,
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)
results = trainer.evaluate()

# Print the evaluation results
print(f"Exact Match (EM): {results['eval_exact_match']:.2f}%")
print(f"F1 Score: {results['eval_f1']:.2f}%")

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB. GPU 0 has a total capacty of 15.89 GiB of which 1.78 GiB is free. Process 3730 has 14.12 GiB memory in use. Of the allocated memory 9.88 GiB is allocated by PyTorch, and 3.94 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

* `generate_answer` function is defined to generate answers for the questions asked, this function involves tokenizing the input question, then generate answer through the model and then detokenizing the answer to get answer in text format.

In [32]:
def generate_answer(question, max_length=50, num_beams=5):
    # Tokenize the input question
    input_text = question
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids
    
    # Generate the answer with adjusted parameters
    generated_ids = model.generate(
        input_ids, 
        max_length=max_length, 
        num_beams=num_beams, 
        early_stopping=True, 
        no_repeat_ngram_size=2
    )
    
    # Decode the generated tokens
    generated_answer = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    
    return generated_answer

In [33]:
# Example question
question = "How to diagnose Osteoarthritis ?"

answer = generate_answer(question)
print(f"Question: {question}")
print(f"Answer: {answer}")


Question: How to diagnose Osteoarthritis ?
Answer: Your doctor will diagnose osteoarthritis based on your symptoms and your physical exam. He or she will also look at your blood pressure, heart rate, and other factors that may contribute to your risk of developing the condition.
 


In [38]:
import random
for i in range(3):
    sentence=random.choice(val_dataset["question"])
    print("Question:",sentence)
    print("Answer:",generate_answer(sentence),"\n")

Question: Is neuroblastoma inherited ?
Answer: This condition is inherited in an autosomal dominant pattern, which means one copy of the altered gene in each cell is sufficient to cause the disorder. 

Question: How to diagnose Stroke ?
Answer: How is stroke diagnosed? Stroke is diagnosed based on the signs and symptoms present in each person. The following tests and procedures may be used to diagnose stroke:  Physical exam and history : An exam of the body to check general signs 

Question: Who is at risk for Osteosarcoma and Malignant Fibrous Histiocytoma of Bone? ?
Answer: Certain factors affect the risk of osteosarcoma and malignant fibrous histiocytoma of bone. These factors include:1) Having a family history of the disease in your family2) Being overweight or obese3 

