In [8]:
%pip install -q transformers hf_transfer accelerate datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


# Basic Library Support

# NOTE: transformers<4.37.0 is required for qwen 2.5 series

In [1]:
import torch
import transformers

print(torch.__version__)
print(transformers.__version__)

2.8.0+cu128
4.57.3


# Download the pre-trained model and tokenizer

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-0.5B"

#Loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
#Loading the model for a next token prediction task ideally suited for autocomplete functionality
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto")

# Run Basic Inference to verify everything is working

In [4]:
prompt = "The capital of France is"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


The capital of France is Paris


# Seeing the Top 5 Logits 

In [6]:
import torch

prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model(**inputs)

next_token_logits = outputs.logits[0, -1, :]  # logits for the last position
probs = torch.softmax(next_token_logits, dim=-1)

top5 = torch.topk(probs, 5)
for prob, token_id in zip(top5.values, top5.indices):
    token_text = tokenizer.decode(token_id)
    print(f"Token ID {token_id.item():6} | '{token_text}' | {prob.item()*100:.2f}%")

Token ID  12095 | ' Paris' | 31.05%
Token ID  32671 | ' ______' | 10.06%
Token ID  30743 | ' ____' | 6.10%
Token ID   1304 | ' __' | 5.40%
Token ID    510 | ':
' | 5.05%


In [7]:
prompt = "The ICD-10 code for Type 2 Diabetes is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


The ICD-10 code for Type 2 Diabetes is:
A. 400.0



# Task: Can Qwen2.5-0.5B learn to write medical documentation through continued pretraining?

Use Case: Medical Documentation Assistant for Low-Resource Settings
- Doctors in rural clinics
- Community health workers  
- Offline capability (no internet needed)

Dataset: PMC-Patients (167k patient summaries from PubMed Central)

What the model should LEARN (not memorize):
- Clinical writing patterns ("Patient presents with...", "On examination...")
- Medical vocabulary in context
- How to structure patient notes
- How symptoms, findings, and diagnoses connect

Before training: Generic autocomplete, doesn't know medical patterns
After training: Completes clinical notes like a doctor would write

Why 0.5B model:
- Runs on laptop (~1GB memory)
- Works offline
- Fast inference
- Suitable for low-resource deployment

# Download the dataset

In [12]:
from datasets import load_dataset

# This one has clinical notes and works reliably
dataset = load_dataset("AGBonnet/augmented-clinical-notes")
print(dataset)

README.md: 0.00B [00:00, ?B/s]

augmented_notes_30K.jsonl:   0%|          | 0.00/372M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['idx', 'note', 'full_note', 'conversation', 'summary'],
        num_rows: 30000
    })
})


In [13]:
print(dataset['train'][0]['note'])

A a sixteen year-old girl, presented to our Outpatient department with the complaints of discomfort in the neck and lower back as well as restriction of body movements. She was not able to maintain an erect posture and would tend to fall on either side while standing up from a sitting position. She would keep her head turned to the right and upwards due to the sustained contraction of the neck muscles. There was a sideways bending of the back in the lumbar region. To counter the abnormal positioning of the back and neck, she would keep her limbs in a specific position to allow her body weight to be supported. Due to the restrictions with the body movements at the neck and in the lumbar region, she would require assistance in standing and walking. She would require her parents to help her with daily chores, including all activities of self-care.
She had been experiencing these difficulties for the past four months since when she was introduced to olanzapine tablets for the control of he

In [14]:
notes = dataset['train']['note']
lengths = [len(n) for n in notes]
print(f"Total notes: {len(notes)}")
print(f"Average length: {sum(lengths)//len(lengths)} characters")
print(f"Shortest: {min(lengths)}")
print(f"Longest: {max(lengths)}")

Total notes: 30000
Average length: 2050 characters
Shortest: 1689
Longest: 2445


# Before Continued Pretraining

In [16]:
prompts = [
    "The patient was started on tablet trihexyphenidyl",
    "Diagnosis: Bipolar affective disorder, current episode",
    "On examination, the patient showed signs of dystonia with",
    "The drug was tapered gradually and replaced with"
]

for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=20)
    print(f"Prompt: {prompt}")
    print(f"Output: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
    print("-" * 50)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Prompt: The patient was started on tablet trihexyphenidyl
Output: The patient was started on tablet trihexyphenidyl (Artane) 10 mg twice daily for 10 days. The patient was then
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Prompt: Diagnosis: Bipolar affective disorder, current episode
Output: Diagnosis: Bipolar affective disorder, current episode of depression, or other mood disorder.
Treatment: Antidepressants, mood stabilizers, and/or
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Prompt: On examination, the patient showed signs of dystonia with
Output: On examination, the patient showed signs of dystonia with a positive Babinski sign. The most likely diagnosis is
A. Parkinson's disease
B.
--------------------------------------------------
Prompt: The drug was tapered gradually and replaced with
Output: The drug was tapered gradually and replaced with a new drug, which was effective in 90% of patients. The new drug was effective
--------------------------------------------------


In [17]:
# Check how many tokens our notes actually have
sample_note = dataset['train'][0]['note']
tokens = tokenizer(sample_note)
print(f"This note has {len(tokens['input_ids'])} tokens")
print(f"Note length in characters: {len(sample_note)}")

This note has 417 tokens
Note length in characters: 2082


In [18]:
# How many tokens does each note have?
from tqdm import tqdm

token_counts = []
for note in tqdm(dataset['train']['note'][:1000]):  # Check first 1000
    tokens = tokenizer(note)
    token_counts.append(len(tokens['input_ids']))

print(f"Min tokens: {min(token_counts)}")
print(f"Max tokens: {max(token_counts)}")
print(f"Average tokens: {sum(token_counts)//len(token_counts)}")
print(f"Notes over 512 tokens: {sum(1 for t in token_counts if t > 512)}")

100%|██████████| 1000/1000 [00:00<00:00, 1093.84it/s]

Min tokens: 379
Max tokens: 516
Average tokens: 441
Notes over 512 tokens: 2





# Prepare the dataset

In [19]:
def prepare_dataset(examples):
    return tokenizer(
        examples['note'],
        truncation=True,
        max_length=512,
        padding=False
    )

tokenized_dataset = dataset['train'].map(
    prepare_dataset,
    batched=True,
    remove_columns=dataset['train'].column_names
)

print(tokenized_dataset)

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 30000
})


# Training Config

In [22]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)
training_args = TrainingArguments(
    output_dir="./qwen-medical",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_steps=100,
    logging_steps=50,
    save_steps=500,
    bf16=True,  # Changed from fp16=True
)

print("Training config ready")

Training config ready


# Start the trainning

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

# Start training
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.


Step,Training Loss
50,2.5071
100,2.4738
150,2.49
200,2.4854
250,2.4813
300,2.4771
350,2.4474
400,2.4724
450,2.4561
500,2.4595


TrainOutput(global_step=1875, training_loss=2.4418360188802084, metrics={'train_runtime': 1720.6355, 'train_samples_per_second': 17.435, 'train_steps_per_second': 1.09, 'total_flos': 2.996920540351795e+16, 'train_loss': 2.4418360188802084, 'epoch': 1.0})

In [24]:
# Save the trained model
trainer.save_model("./qwen-medical-final")
tokenizer.save_pretrained("./qwen-medical-final")

print("Model saved!")

Model saved!


In [25]:
print(model is trainer.model)  # Should print: True

True


In [27]:
prompts = [
    "The patient was started on tablet trihexyphenidyl",
    "Diagnosis: Bipolar affective disorder, current episode",
    "On examination, the patient showed signs of dystonia with",
    "The drug was tapered gradually and replaced with"
]

for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs, 
        max_new_tokens=30,
        repetition_penalty=1.2,
        do_sample=True,
        temperature=0.7
    )
    print(f"Prompt: {prompt}")
    print(f"Output: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
    print("-" * 50)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Prompt: The patient was started on tablet trihexyphenidyl
Output: The patient was started on tablet trihexyphenidyl at 25 mg/day, followed by decreasing the dose to a maximum of half that daily as tolerated. The dosage continued thereafter and increased slightly every
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Prompt: Diagnosis: Bipolar affective disorder, current episode
Output: Diagnosis: Bipolar affective disorder, current episode with recent onset of hallucinatory episodes associated with visual and auditory hallucinations.
Management:
The patient had a previous history of an unspecified substance use disorder
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Prompt: On examination, the patient showed signs of dystonia with
Output: On examination, the patient showed signs of dystonia with myoclonus in both arms and lower limbs. There was no evidence for muscle atrophy or weakness on palpation.
The electromyography (
--------------------------------------------------
Prompt: The drug was tapered gradually and replaced with
Output: The drug was tapered gradually and replaced with an oral contraceptive (OCP) for 6 weeks. Patient started on the OCP at a dose of 10 mg daily, but her
--------------------------------------------------
