<a href="https://colab.research.google.com/github/boheling/healthAI/blob/main/SFT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install necessary packages
!pip install transformers datasets trl peft --quiet
#!huggingface-cli login

from google.colab import drive
drive.mount('/content/drive')

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model, TaskType

# Use the distill model
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Wrap the model with LoRA for efficient fine-tuning
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,                  # LoRA rank
    lora_alpha=32,        # Scaling factor
    lora_dropout=0.1,     # Dropout for LoRA layers
    target_modules=["q_proj", "v_proj"]  # Adjust this list based on your model's architecture
)
model = get_peft_model(model, lora_config)

# Save the pre-trained model (with LoRA wrapper) for backup
model.save_pretrained("./pretrain_model")
tokenizer.save_pretrained("./pretrain_model")

# Load a portion of the dataset and split into training and evaluation
train_dataset = load_dataset("lavita/ChatDoctor-HealthCareMagic-100k", split="train[:8%]")
eval_dataset  = load_dataset("lavita/ChatDoctor-HealthCareMagic-100k", split="train[8%:10%]")

# Define a tokenization function with structured formatting
def tokenize_function(examples):
    texts = []
    for i in range(len(examples["instruction"])):
        text = f"Instruction: {examples['instruction'][i]}\n"
        # Process 'input'
        if examples["input"][i]:
            if isinstance(examples["input"][i], list):
                input_text = " ".join(examples["input"][i])
            else:
                input_text = examples["input"][i]
            text += f"Input: {input_text}\n"
        # Process 'output'
        if examples["output"][i]:
            if isinstance(examples["output"][i], list):
                output_text = " ".join(examples["output"][i])
            else:
                output_text = examples["output"][i]
            text += f"Output: {output_text}\n"
        texts.append(text)
    return tokenizer(texts, truncation=True, max_length=128)

# Tokenize the training dataset and remove the original columns
unfiltered_train = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["instruction", "input", "output"]
)
train_tokenized = unfiltered_train.filter(lambda x: len(x["input_ids"]) > 0)

# Tokenize the evaluation dataset and remove the original columns
unfiltered_eval = eval_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["instruction", "input", "output"]
)
eval_tokenized = unfiltered_eval.filter(lambda x: len(x["input_ids"]) > 0)

# Define training arguments (checkpoints and logs are saved to your Google Drive)
training_args = TrainingArguments(
    output_dir="./sft_output",  # save checkpoints here
    logging_dir="./sft_logs",
    per_device_train_batch_size=1,    # Adjust as necessary
    num_train_epochs=2,               # Increase for more training
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    fp16=True,                        # Mixed precision for T4 GPU
    dataloader_num_workers=2,         # Adjust based on your CPU
)

# Initialize the SFTTrainer with the LoRA-wrapped model and separate datasets
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=eval_tokenized,
)

# Evaluate the pre-trained model (baseline)
print("Evaluating pre-trained model...")
pretrain_metrics = trainer.evaluate()
print("Pre-training evaluation metrics:", pretrain_metrics)

# Sample prompt before training
prompt = "Q: What could be wrong with lower back pain in a cancer patient?\nA:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=100, do_sample=True, temperature=0.7, top_p=0.9)
print("Pre-training output:", tokenizer.decode(outputs[0], skip_special_tokens=True))

# Start fine-tuning
print("Starting fine-tuning...")
trainer.train()

# Save the fine-tuned (post-train) model (including the LoRA weights) to Google Drive
model.save_pretrained("/content/drive/MyDrive/SFT/posttrain_model")
tokenizer.save_pretrained("/content/drive/MyDrive/SFT/posttrain_model")

# Evaluate the fine-tuned model
print("Evaluating fine-tuned model...")
posttrain_metrics = trainer.evaluate()
print("Post-training evaluation metrics:", posttrain_metrics)

# Sample prompt after training
prompt = "Q: What could be wrong with lower back pain in a cancer patient?\nA:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=100, do_sample=True, temperature=0.7, top_p=0.9)
print("Post-training output:", tokenizer.decode(outputs[0], skip_special_tokens=True))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/2243 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2243 [00:00<?, ? examples/s]



Converting eval dataset to ChatML:   0%|          | 0/2243 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/2243 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/2243 [00:00<?, ? examples/s]

Evaluating pre-trained model...


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mboheling[0m ([33mboheling-stanford-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Pre-training evaluation metrics: {'eval_loss': 4.322879314422607, 'eval_model_preparation_time': 0.0096, 'eval_runtime': 20.5515, 'eval_samples_per_second': 109.14, 'eval_steps_per_second': 13.673}
Pre-training output: Q: What could be wrong with lower back pain in a cancer patient?
A: The lower back pain in a cancer patient is likely to be due to a number of factors, including:

1. **Cancer itself** - Cancer can have a significant impact on the lower back, often leading to issues such as pain, stiffness, or difficulty moving the legs.

2. **Infiltration of cancer cells into the lower back muscles** - This can cause localized pain, stiffness, or difficulty
Starting fine-tuning...


Step,Training Loss,Validation Loss,Model Preparation Time
100,3.5417,3.529185,0.0096
200,3.3309,3.266077,0.0096
300,3.2029,3.21055,0.0096
400,3.313,3.176517,0.0096
500,3.1114,3.158074,0.0096
600,3.1228,3.142949,0.0096
700,3.2312,3.130116,0.0096
800,3.0379,3.121476,0.0096
900,2.7596,3.111855,0.0096
1000,2.8908,3.106061,0.0096


Evaluating fine-tuned model...


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Post-training evaluation metrics: {'eval_loss': 2.864748954772949, 'eval_model_preparation_time': 0.0096, 'eval_runtime': 19.5207, 'eval_samples_per_second': 114.903, 'eval_steps_per_second': 14.395}
Post-training output: Q: What could be wrong with lower back pain in a cancer patient?
A: Lower back pain in a cancer patient is often due to the treatment of the disease. Many cancer patients undergo treatment with chemotherapy and radiation. This treatment is used to improve the function of the body, and it can also help to reduce back pain. Lower back pain is often caused by the use of chemotherapy drugs and radiation therapy. This type of treatment is used to help the body recover from the treatment. Lower back
