- Allison Evanich
- Week 8 - Milestone 3

### Introduction

In this milestone, I applied what I‚Äôve learned about fine-tuning generative AI models to create a healthcare-specific model that summarize clinical visit information. The goal is to fine-tune OpenAI‚Äôs gpt-4o-mini-2024-07-18 snapshot to generate structured and concise clinical notes for specialties, such as pediatrics.

This work builds on the earlier project idea of a generative AI system that assists clinicians by transforming unstructured visit transcripts into standardized Electronic Health Record (EHR) summaries. While Epic Systems recently introduced Abridge for general note generation, specialty-specific adaptation remains an open challenge.

By fine-tuning with domain-specific datasets, I aim to explore how customized generative AI can improve accuracy, context, and relevance in clinical documentation.

- Problem: Clinicians spend a significant portion of their time writing and editing notes in EHR systems. AI-generated summaries can help reduce administrative workload, but general-purpose models often miss specialty-specific context.
- Goal: Fine-tune a generative AI model to produce structured clinical summaries tailored to pediatric visits.
- Why it matters: Specialty fine-tuning could reduce documentation errors, improve patient communication, and save time in clinical workflows.

### Model Design Rationale

- **Use case:** Specialty‚Äëaware clinical summaries for Pediatrics.  
- **Why fine‚Äëtune:** Teach structure (SOAP) and specialty cues for concise, consistent notes.  
- **Data format:** Chat `messages` schema (one JSONL object per line), required by current API.  
- **Success criteria:** Structured outputs, brevity, clinical relevance, fewer edits.

### Setup Environment

In [49]:
from openai import OpenAI
import os, json, time, pathlib, re
from itertools import islice

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

### Prepare and Convert Datasets

The following dataset is used for fine-tuning:

- Pediatric dataset (20 examples) ‚Äî common childhood symptoms and diagnoses.

Each example contains a short visit summary prompt and a structured completion with the standard SOAP (Subjective, Objective, Assessment, Plan) layout.

Because OpenAI‚Äôs new fine-tuning API requires the messages schema rather than prompt/completion, the dataset is converted automatically below.

In [52]:
PEDS_SRC = "pediatric_notes_train.jsonl"

PEDS_CHAT = "pediatric_notes_train_chat.jsonl"

print(PEDS_SRC, "‚Üí", PEDS_CHAT)

pediatric_notes_train.jsonl ‚Üí pediatric_notes_train_chat.jsonl


### Convert to Chat `messages` Schema

In [54]:
def to_chat_schema(src_path, dst_path):
    if not pathlib.Path(src_path).exists():
        print(f"[skip] {src_path} not found ‚Äî skipping conversion")
        return
    n = 0
    with open(src_path, "r", encoding="utf-8") as fin, open(dst_path, "w", encoding="utf-8") as fout:
        for line in fin:
            if not line.strip():
                continue
            obj = json.loads(line)
            msg = {
                "messages": [
                    {"role": "user", "content": obj["prompt"]},
                    {"role": "assistant", "content": obj["completion"].lstrip()}
                ]
            }
            fout.write(json.dumps(msg, ensure_ascii=False) + "\n")
            n += 1
    print(f"Converted {n} ‚Üí {dst_path}")

to_chat_schema(PEDS_SRC, PEDS_CHAT)

Converted 20 ‚Üí pediatric_notes_train_chat.jsonl


### Preview & Validate

In [56]:
def preview(path, k=2):
    p = pathlib.Path(path)
    if not p.exists():
        print(f"[warn] {path} missing")
        return
    with p.open("r", encoding="utf-8") as f:
        print(f"Preview {path}:")
        for line in islice(f, k):
            print(line.rstrip()[:300])
        print("-"*60)

def full_schema_check(path):
    p = pathlib.Path(path)
    if not p.exists():
        print(f"[warn] {path} missing")
        return
    n = 0
    with p.open("r", encoding="utf-8") as f:
        for i, line in enumerate(f, start=1):
            obj = json.loads(line)
            assert "messages" in obj and isinstance(obj["messages"], list), f"missing messages at line {i}"
            assert len(obj["messages"]) == 2, f"expected 2 messages at line {i}"
            assert obj["messages"][0]["role"] == "user", f"user role missing at line {i}"
            assert obj["messages"][1]["role"] == "assistant", f"assistant role missing at line {i}"
            for m in obj["messages"]:
                assert isinstance(m["content"], str) and m["content"].strip(), f"empty content at line {i}"
            n += 1
    print(f"{path}: {n} lines OK ‚úÖ")

preview(PEDS_CHAT); full_schema_check(PEDS_CHAT)

Preview pediatric_notes_train_chat.jsonl:
{"messages": [{"role": "user", "content": "Patient: 4-year-old with fever, decreased appetite, cough for 3 days.\n\nSummarize:"}, {"role": "assistant", "content": "History: 4-year-old presents with 3-day history of fever, cough, decreased appetite. No vomiting or diarrhea. Mild irritability noted.\n
{"messages": [{"role": "user", "content": "Patient: 7-year-old with persistent headache and mild photophobia.\n\nSummarize:"}, {"role": "assistant", "content": "History: 7-year-old with 3-day headache, mild photophobia, no nausea or vomiting.\nPhysical Exam: Vitals normal, neurologic exam normal.\nA
------------------------------------------------------------
pediatric_notes_train_chat.jsonl: 20 lines OK ‚úÖ


### Upload Training Files

In [58]:
peds_file = client.files.create(file=open(PEDS_CHAT, "rb"), purpose="fine-tune")
print("Pediatrics file ID:", peds_file.id)

Pediatrics file ID: file-DtuyfEq5PMQpBo5Lswfxo2


### Create the Fine‚ÄëTuned Models

In [60]:
SNAPSHOT = "gpt-4o-mini-2024-07-18"

peds_job = client.fine_tuning.jobs.create(
    training_file=peds_file.id,
    model=SNAPSHOT,
    suffix="peds-notes-v1"
)

print("Pediatrics job ID:", peds_job.id)

Pediatrics job ID: ftjob-V5pS9TMMgyiRfWva7yRkna2E


### Track the Build (Status + Events)

In [62]:
def job_status(job_id):
    return client.fine_tuning.jobs.retrieve(job_id).status

def list_events(job_id, limit=100):
    ev = client.fine_tuning.jobs.list_events(job_id, limit=limit)
    for e in ev.data:
        print(f"{e.level:>5} | {e.message}")

print("Pediatrics status:", job_status(peds_job.id)); list_events(peds_job.id, 50)

Pediatrics status: validating_files
 info | Validating training file: file-DtuyfEq5PMQpBo5Lswfxo2
 info | Created fine-tuning job: ftjob-V5pS9TMMgyiRfWva7yRkna2E


### Poll Until Completion

In [64]:
def wait_until_complete(job_id, poll_sec=15, max_min=45):
    print(f"Polling {job_id} every {poll_sec}s (max {max_min} min)")
    start = time.time()
    while True:
        s = job_status(job_id)
        print("Status:", s)
        list_events(job_id, 10)
        if s in ("succeeded","failed","cancelled"):
            print("Final status:", s)
            break
        if time.time() - start > max_min * 60:
            print("Timed out.")
            break
        time.sleep(poll_sec)

# wait_until_complete(peds_job.id)

### Document Metrics (Parse Events for Loss/Steps)

In [66]:
import time, re

LOSS_RE = re.compile(r"train_loss\s*=\s*([0-9.]+)", re.I)
STEP_RE = re.compile(r"Step\s+([0-9]+)/([0-9]+)", re.I)

def job_status(job_id):
    return client.fine_tuning.jobs.retrieve(job_id).status

def list_events(job_id, limit=100):
    return client.fine_tuning.jobs.list_events(job_id, limit=limit).data

def extract_metrics_from_events(events):
    steps, losses = [], []
    for e in events:
        msg = getattr(e, "message", "") or ""
        m_step = STEP_RE.search(msg)
        if m_step:
            steps.append((int(m_step.group(1)), int(m_step.group(2))))
        m_loss = LOSS_RE.search(msg)
        if m_loss:
            losses.append(float(m_loss.group(1)))
    return steps, losses

def wait_until_complete(job_id, poll_sec=15, max_min=60, show_tail=6):
    """Polls the job until it reaches succeeded/failed/cancelled.
    Prints only on status change or new events."""
    print(f"‚è≥ Waiting for {job_id} ...")
    start = time.time()
    last_status = None
    seen_event_ids = set()

    while True:
        j = client.fine_tuning.jobs.retrieve(job_id)
        status = j.status
        if status != last_status:
            print(f"Status ‚Üí {status}")
            last_status = status

        # print only new events
        evs = list_events(job_id, limit=100)
        new_evs = [e for e in evs if e.id not in seen_event_ids]
        for e in new_evs[-show_tail:]:
            print(f"{e.level:>5} | {e.message}")
            seen_event_ids.add(e.id)

        if status in ("succeeded", "failed", "cancelled"):
            print(f"‚úÖ Final status: {status}")
            return j  # return full job object

        if time.time() - start > max_min * 60:
            print("‚è∞ Timed out; returning latest job object.")
            return j

        time.sleep(poll_sec)

def summarize_metrics(job_id, label):
    evs = list_events(job_id, limit=500)
    steps, losses = extract_metrics_from_events(evs)
    print(f"\n=== {label} Metrics ===")
    if steps:
        cur, total = steps[-1]
        print(f"Steps reported: {len(steps)} (last {cur} of {total})")
    else:
        print("No step reports yet.")
    if losses:
        print(f"Loss points: {len(losses)} | First: {losses[0]:.3f} | Last: {losses[-1]:.3f} | Min: {min(losses):.3f}")
    else:
        print("No loss reports yet.")

def test_model_from_job(job, prompt, label):
    if job.status != "succeeded":
        print(f"{label}: model not ready (status = {job.status})")
        return
    model = job.fine_tuned_model
    print(f"{label} model: {model}")
    r = client.chat.completions.create(
        model=model,
        messages=[{"role":"user","content":prompt}]
    )
    print("\n--- Output ---\n")
    print(r.choices[0].message.content)

# üî∏ Run the blocking wait for job, then summarize + test
peds_done = wait_until_complete(peds_job.id, poll_sec=20, max_min=90)

summarize_metrics(peds_job.id, "Pediatrics")

test_model_from_job(
    peds_done,
    "Patient: 6-year-old with 2 days of cough and 101¬∞F fever. Summarize:",
    "Pediatrics"
)

‚è≥ Waiting for ftjob-V5pS9TMMgyiRfWva7yRkna2E ...
Status ‚Üí validating_files
 info | Validating training file: file-DtuyfEq5PMQpBo5Lswfxo2
 info | Created fine-tuning job: ftjob-V5pS9TMMgyiRfWva7yRkna2E
Status ‚Üí running
 info | Fine-tuning job started
 info | Files validated, moving job to queued state
 info | Step 6/100: training loss=1.70
 info | Step 5/100: training loss=1.52
 info | Step 4/100: training loss=2.33
 info | Step 3/100: training loss=2.11
 info | Step 2/100: training loss=2.24
 info | Step 1/100: training loss=2.49
 info | Step 12/100: training loss=1.14
 info | Step 11/100: training loss=0.77
 info | Step 10/100: training loss=0.85
 info | Step 9/100: training loss=0.77
 info | Step 8/100: training loss=1.20
 info | Step 7/100: training loss=1.38
 info | Step 18/100: training loss=0.49
 info | Step 17/100: training loss=0.58
 info | Step 16/100: training loss=0.94
 info | Step 15/100: training loss=0.72
 info | Step 14/100: training loss=0.84
 info | Step 13/100: 

### Evaluate the Fine‚ÄëTuned Models

In [68]:
def try_infer(job_id, prompt):
    j = client.fine_tuning.jobs.retrieve(job_id)
    if j.status != "succeeded":
        print("Model not ready. Status:", j.status)
        return
    model = j.fine_tuned_model
    print("Model:", model)
    resp = client.chat.completions.create(model=model, messages=[{"role":"user","content":prompt}])
    print("\n--- Output ---\n")
    print(resp.choices[0].message.content)

try_infer(peds_job.id, "Patient: 6-year-old with 2 days of cough and 101¬∞F fever. Summarize:")

Model: ft:gpt-4o-mini-2024-07-18:personal:peds-notes-v1:CYD9olYs

--- Output ---

History: 6-year-old with a 2-day history of cough and fever 101¬∞F. No vomiting or diarrhea.

Physical Exam: Mild erythematous pharynx, lungs clear, otherwise normal.

Assessment: Viral upper respiratory infection.

Plan: Supportive care, fluids, antipyretics as needed, return if symptoms worsen.


### Track and Document Metrics

Using list_events() and retrieve(), I can track training progress and metrics such as loss, checkpoint status, and completion time. This is how I documented whether the fine-tuning process worked as expected.

### Reflection 

This milestone demonstrated the complete process of fine-tuning a generative AI model using OpenAI‚Äôs platform to support clinical documentation in pediatrics. By preparing structured training data in chat format, validating the dataset schema, and tracking model performance through fine-tuning endpoints, the project successfully produced a model capable of generating concise, SOAP-style pediatric clinical summaries. The workflow reinforced the importance of data quality, iterative testing, and the monitoring of training metrics as key components of responsible AI development in healthcare.

While only the pediatric model reached completion during this phase, the results highlight the strong potential for expanding this framework to other medical domains. The next step will involve fine-tuning additional specialty models‚Äîbeginning with Oncology, followed by Cardiology and Neurology to ensure each system is optimized for its unique terminology and documentation requirements. Ultimately, this multi-specialty expansion could form the foundation for an intelligent, domain-adaptive documentation assistant that integrates seamlessly into EHR systems like Epic, improving both clinical efficiency and note accuracy.

In [1]:
import os
os.getcwd()

'C:\\Users\\Allis\\Downloads'