# **Model & Tokenization: Why BioGPT?**

This code is part of the model preparation and tokenization pipeline for fine-tuning a language model (BioGPT) on doctor-patient conversations using the transformers library.

**✅ Why BioGPT was chosen**





```
1. Domain specificity:

BioGPT is a biomedical-specific language model trained on PubMed abstracts and biomedical corpora.

Your dataset (HealthCareMagic-100k) contains doctor responses and patient questions – highly clinical and medical in nature.

General-purpose models like GPT-2 or GPT-Neo may fail to understand domain-specific medical vocabulary (e.g., "cardiomyopathy", "dyspnea", etc.).

```


```
2. Strong performance on medical text:

Microsoft’s BioGPT has shown state-of-the-art results on tasks like medical question answering and information extraction.

It provides better contextual understanding for diseases, symptoms, and diagnoses compared to vanilla GPT models.
```


3. Pretrained checkpoint available:

```
You used microsoft/BioGPT from Hugging Face, which is easy to load, reproducible, and has full support for tokenizer + model weights.


```

BioGPT’s training objective and vocabulary are well-suited for this style of generation.



In [14]:
from datasets import Dataset
from transformers import AutoTokenizer
import pandas as pd

In [15]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [16]:
project_path = "/content/drive/MyDrive/Fine_tuning_healthcare_gpt"

In [17]:
!pip install sacremoses



In [18]:
# Load the cleaned JSONs
train_df = pd.read_json(f"{project_path}/train_data.json")
val_df = pd.read_json(f"{project_path}/val_data.json")
test_df = pd.read_json(f"{project_path}/test_data.json")

In [19]:
# Convert to HF Dataset
hf_train = Dataset.from_pandas(train_df[['patient_question', 'doctor_response']])
hf_val = Dataset.from_pandas(val_df[['patient_question', 'doctor_response']])
hf_test = Dataset.from_pandas(test_df[['patient_question', 'doctor_response']])

# Load tokenizer

In [20]:
model_name = "microsoft/BioGPT"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [21]:
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Tokenize

In [22]:
def format_for_lm(example):
    encoding = tokenizer(
        f"### Patient:\n{example['patient_question']}\n\n### Doctor:\n{example['doctor_response']}",
        padding="max_length",
        truncation=True,
        max_length=512
    )
    encoding["labels"] = encoding["input_ids"].copy()
    return encoding

In [23]:
tokenized_train = hf_train.map(format_for_lm)
tokenized_val = hf_val.map(format_for_lm)
tokenized_test = hf_test.map(format_for_lm)

Map:   0%|          | 0/8400 [00:00<?, ? examples/s]

Map:   0%|          | 0/1800 [00:00<?, ? examples/s]

Map:   0%|          | 0/1800 [00:00<?, ? examples/s]

In [24]:
# Save for training
tokenized_train.save_to_disk(f"{project_path}/tokenized_train")
tokenized_val.save_to_disk(f"{project_path}/tokenized_val")
tokenized_test.save_to_disk(f"{project_path}/tokenized_test")

Saving the dataset (0/1 shards):   0%|          | 0/8400 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1800 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1800 [00:00<?, ? examples/s]

In [25]:
%cd /content/drive/MyDrive/Fine_tuning_healthcare_gpt

/content/drive/MyDrive/Fine_tuning_healthcare_gpt


In [26]:
!zip -r tokenized_train.zip tokenized_train
!zip -r tokenized_val.zip tokenized_val
!zip -r tokenized_test.zip tokenized_test

updating: tokenized_train/ (stored 0%)
updating: tokenized_train/data-00000-of-00001.arrow (deflated 90%)
updating: tokenized_train/state.json (deflated 38%)
updating: tokenized_train/dataset_info.json (deflated 71%)
updating: tokenized_val/ (stored 0%)
updating: tokenized_val/data-00000-of-00001.arrow (deflated 90%)
updating: tokenized_val/state.json (deflated 38%)
updating: tokenized_val/dataset_info.json (deflated 71%)
updating: tokenized_test/ (stored 0%)
updating: tokenized_test/data-00000-of-00001.arrow (deflated 90%)
updating: tokenized_test/state.json (deflated 39%)
updating: tokenized_test/dataset_info.json (deflated 71%)
