# Assignment 7: Clinical NLP with LLMs and Embeddings

Extract structured data from clinical notes using LLM prompt engineering, then build a semantic search system using sentence embeddings.

**Dataset:** 75 synthetic discharge summaries from [Asclepius-Synthetic-Clinical-Notes](https://huggingface.co/datasets/aisc-team-a1/Asclepius-Synthetic-Clinical-Notes) (Kweon et al., 2023) in `asclepius_notes.json`.

## Setup

In [1]:
%pip install -q -r requirements.txt

# Clear state after installing packages. If you re-run cells out of order later, re-run this cell first.
%reset -f

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import json
import random
import numpy as np
from dotenv import load_dotenv

os.makedirs("output", exist_ok=True)
load_dotenv()
print("Setup complete!")

Setup complete!


### API Key

Part 1 requires an [OpenRouter](https://openrouter.ai) API key (OpenAI keys also work). Add the key from class forum to `.env` (not `example.env`). It should look like:

```bash
OPENROUTER_API_KEY=sk-...
```

### Helper Functions (modify at your own risk)

In [3]:
# --- LLM client setup (do not modify) ---

def get_client():
    """Initialize the LLM client based on available API keys."""
    from openai import OpenAI

    if os.environ.get("OPENROUTER_API_KEY"):
        client = OpenAI(
            api_key=os.environ["OPENROUTER_API_KEY"],
            base_url="https://openrouter.ai/api/v1",
        )
        return client, "openrouter"

    if os.environ.get("OPENAI_API_KEY"):
        return OpenAI(), "openai"

    raise ValueError(
        "No API key found. Set OPENROUTER_API_KEY or OPENAI_API_KEY in .env"
    )


def call_llm(prompt, provider, client):
    """Send a prompt to the LLM and return the response text."""
    model = "openai/gpt-4o-mini" if provider == "openrouter" else "gpt-4o-mini"
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a medical information extraction assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0,
        max_tokens=500,
    )
    return response.choices[0].message.content


def get_device():
    """Detect the best available device for local model inference."""
    try:
        import torch
        if torch.cuda.is_available():
            return "cuda"
        if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
            return "mps"
    except ImportError:
        pass
    return "cpu"

### Load Data

In [4]:
with open("asclepius_notes.json") as f:
    asclepius = json.load(f)

print(f"Loaded {len(asclepius)} synthetic clinical notes")
print(f"Keys: {list(asclepius[0].keys())}")

Loaded 75 synthetic clinical notes
Keys: ['patient_id', 'note']


In [5]:
print(asclepius[0]["note"][:500] + "...")

Discharge Summary

Patient Name: N/A
Date of Admission: N/A
Date of Discharge: N/A

Hospital Course:

This patient was admitted with left back pain which had been persistent for over 6 months and had recently aggravated. Radiological examination revealed a cavitary lesion in the left lower lobe of the lung that was indicative of pulmonary chronic inflammation. The lesion was confirmed to be hamartoma after histology revealed the diagnosis. 
Subsequently, a left lower lobectomy was performed. The...


---

## Part 1: Clinical Entity Extraction

Use LLM prompt engineering to extract structured medical data from clinical notes.

In [6]:
# Select 4 notes for extraction
random.seed(2026)
sample = random.sample(asclepius, 4)
notes_p1 = [s["note"] for s in sample]

print(f"Selected {len(notes_p1)} notes for extraction")
for i, n in enumerate(notes_p1, 1):
    print(f"\n--- Note {i} ({len(n)} chars) ---")
    print(n[:150] + "...")

Selected 4 notes for extraction

--- Note 1 (1624 chars) ---
Hospital Course:

The patient was a 27-year-old pregnant woman who presented with recurrent panic attacks and other anxiety symptoms. On evaluation, i...

--- Note 2 (1119 chars) ---
Discharge Summary:

Patient Name: [Redacted]
Age: 35
Sex: Male

Clinical Course:

The patient was admitted to our hospital after reporting an episode ...

--- Note 3 (1459 chars) ---
Hospital Course Summary:

Patient was admitted for multiple warty lesions with severe pruritus on the lower legs and dorsa of feet. The lesions had be...

--- Note 4 (1250 chars) ---
Hospital Course:
The patient, a 23-year-old male with no known comorbidities, presented with complaints of epigastric pain and frequent bilious vomiti...


### `build_prompt`

Build a prompt that instructs the LLM to extract structured data from a clinical note.

In [7]:
# Implement build_prompt
def build_prompt(note, few_shot=False):
    
    schema = json.dumps({
        "diagnosis": "<Primary diagnosis as a string>",
        "medications": ["<List of medications mentioned in the note>"],
        "lab_values": {"<Lab test name>": "<Value and units>"},
        "confidence": "<Float value between 0.0 and 1.0>"
    }, indent = 2)

    example = ""

    if few_shot:
        example = """
        
Example Note 1:

    Note: The patient was a 27-year-old pregnant woman who presented with recurrent panic attacks and other anxiety symptoms. 
    Labs showed Hemoglobin A1c of 5.6% and elevated cortisol levels. 
    She was diagnosed with Generalized Anxiety Disorder and prescribed Sertraline 50mg daily.

    Output:{
    "diagnosis": "Generalized Anxiety Disorder",
    "medications": ["Sertraline 50mg daily"],
    "lab_values": {
        "Hemoglobin A1c": "5.6%",
        "Cortisol": "elevated"
    },
    "confidence": 0.95
    }
        
Example Note 2:

    Note: A 34-year-old woman presented with a two-week history of fatigue, joint pain, and a butterfly-shaped facial rash. 
    ANA was positive at 1:640, anti-dsDNA was elevated. 
    She was started on hydroxychloroquine and low-dose prednisone.

    Output:{
    "diagnosis": "Systemic Lupus Erythematosus",
    "medications": ["Hydroxychloroquine", "Low-dose prednisone"],
    "lab_values": {
        "ANA": "positive at 1:640",
        "Anti-dsDNA": "elevated"
    },
    "confidence": 0.92
    }"""
        
    return f"""Extract structured medical information from the following clinical note.
Return ONLY a JSON object matching this schema:
{schema}

Guidelines:
- If "medications" are REDACTED or unnamed, return "Medication REDACTED" or "Medication Unnamed" accordingly
- For "lab_values" allow both qualitative and quantitative values for test results (e.g. "5.6%", "elevated", "normal")
- "confidence" between 0.0 and 1.0, with 0.0 being no confidence and 1.0 being complete confidence in the accuracy of the extracted information. 
- Set "confidence" higher when any field value is explicitly stated, and lower when they are implied or uncertain. 
- Penalize "confidence" for note length, as longer notes contain more complex information and more room for error.
- If any field cannot be extracted, return "None" for that field.

{example}

Extract from this note:
{note}"""

### `parse_json_response`

Extract a JSON object from LLM response text, which may contain markdown code fences or other wrapping.

In [8]:
# Implement parse_json_response
def parse_json_response(text):

    # Handle clean JSON strings
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass

    # Handle JSON wrapped in ```json ... ``` markdown blocks
    if "```" in text:
        lines = text.split("```")
        for blocks in lines[1::2]:
            blocks = blocks.strip().removeprefix("json").strip()
            try:
                return json.loads(blocks)
            except json.JSONDecodeError:
                continue
    
    # Find JSON within surrounding text (look for outermost { and })
    start_curly = text.find("{")
    end_curly = text.rfind("}")
    if start_curly >=0 and end_curly > start_curly:
        try:
            return json.loads(text[start_curly:end_curly+1])
        except json.JSONDecodeError:
            pass
    
    # If all parsing attempts fail, return None
    return None

### `validate_response`

Check that a parsed response dict contains all required keys.

In [9]:
# Implement validate_response
def validate_response(response):
    if type(response) is not dict:
        return False
    fields = {"diagnosis", "medications", "lab_values", "confidence"}
    return all(field in response for field in fields)

### `extract_entities`

Orchestrate the full extraction pipeline: get client, build prompt, call LLM, parse, validate, return.

In [10]:
# Implement extract_entities
def extract_entities(note, few_shot=False):
    client, provider = get_client()
    prompt = build_prompt(note, few_shot = few_shot)
    raw = call_llm(prompt, provider = provider, client = client)
    parsed = parse_json_response(raw)
    if validate_response(parsed):
        return parsed
    return None 

### Test extraction

In [11]:
results_p1 = []
for i, note in enumerate(notes_p1, 1):
    result = extract_entities(note, few_shot=True)
    print(f"--- Note {i} ---")
    if result:
        print(json.dumps(result, indent=2))
        results_p1.append(result)
    else:
        print("Extraction failed")
    print()

--- Note 1 ---
{
  "diagnosis": "Generalized Anxiety Disorder",
  "medications": [
    "Medication Unnamed"
  ],
  "lab_values": {
    "Thyroid Profile": "normal",
    "Ultrasound": "normal",
    "Electrocardiogram": "normal",
    "Blood Pressure": "normal"
  },
  "confidence": 0.85
}

--- Note 2 ---
{
  "diagnosis": "Anaphylaxis",
  "medications": [
    "Medication Unnamed"
  ],
  "lab_values": {
    "None": "None"
  },
  "confidence": 0.75
}

--- Note 3 ---
{
  "diagnosis": "Hypertrophic Lichen Planus",
  "medications": [
    "Medication REDACTED"
  ],
  "lab_values": {
    "Routine Hematological and Biochemical Investigations": "normal"
  },
  "confidence": 0.85
}

--- Note 4 ---
{
  "diagnosis": "Type II choledochal cyst/double GB/Type VI choledochal cyst",
  "medications": [
    "Medication Unnamed"
  ],
  "lab_values": {
    "Imaging studies": "hypoechoic lesion located around the porta hepatis"
  },
  "confidence": 0.85
}



### Save Part 1 results (do not modify)

In [12]:
with open("output/extraction_results.json", "w") as f:
    json.dump(results_p1, f, indent=2)

print(f"Saved {len(results_p1)} extraction results to output/extraction_results.json")

Saved 4 extraction results to output/extraction_results.json


---

## Part 2: Semantic Search

Build a semantic search system that finds clinical notes by meaning rather than keywords, using sentence embeddings and cosine similarity.

This part runs locally — no API key needed.

In [13]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2", device=get_device())
print(f"Model loaded on {get_device()}")

  from .autonotebook import tqdm as notebook_tqdm
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1388.40it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Model loaded on mps


In [14]:
# Use all 75 notes for the search corpus
notes_p2 = [n["note"] for n in asclepius]
print(f"{len(notes_p2)} notes in search corpus")

75 notes in search corpus


### `embed_notes`

Generate embeddings for a list of notes using the sentence transformer model.

In [15]:
# Implement embed_notes
def embed_notes(notes):
    return model.encode(notes)

### `find_similar`

Search notes by meaning using cosine similarity.

In [16]:
# TODO: Implement find_similar
def find_similar(query, notes, embeddings, top_k=2):
    query_embedding = model.encode([query])
    cosine_score = cosine_similarity(query_embedding, embeddings)[0]
    score_descending = np.argsort(cosine_score)[::-1]
    return [{"note": notes[i], "score": float(cosine_score[i])} for i in score_descending[:top_k]]

### Run the search pipeline

In [17]:
embeddings = embed_notes(notes_p2)
print(f"Embeddings: {embeddings.shape}")

queries = [
    "heart attack symptoms",
    "infectious disease with fever",
    "respiratory illness",
]

for q in queries:
    print(f"\nQuery: '{q}'")
    results = find_similar(q, notes_p2, embeddings, top_k=2)
    for i, r in enumerate(results, 1):
        print(f"  {i}. (score: {r['score']:.3f}) {r['note'][:80]}...")

Embeddings: (75, 384)

Query: 'heart attack symptoms'
  1. (score: 0.386) Discharge Summary

Patient: 49-year-old Hispanic woman
Admission Date: March 202...
  2. (score: 0.367) Discharge Summary:

Patient: 88-year-old female

Admission: Coronary Care Unit

...

Query: 'infectious disease with fever'
  1. (score: 0.412) Discharge Summary

Patient: 44-year-old male with end-stage renal disease caused...
  2. (score: 0.359) Hospital Course: 

The patient, a 9-month-old baby girl from Cameroon, was admit...

Query: 'respiratory illness'
  1. (score: 0.508) Discharge Summary:

Patient Name: Not Provided
Age: 71
Sex: Female

Admission Da...
  2. (score: 0.490) Discharge Summary

Patient: 52-year-old male internist with a positive SARS-CoV-...


### Save Part 2 results (do not modify)

In [18]:
search_results = find_similar("heart attack symptoms", notes_p2, embeddings, top_k=3)
with open("output/search_results.json", "w") as f:
    json.dump(search_results, f, indent=2)

print(f"Saved {len(search_results)} search results to output/search_results.json")

Saved 3 search results to output/search_results.json


---

## Validation

In [19]:
print("Run 'python -m pytest .github/tests/ -v' in your terminal to check your work.")

Run 'python -m pytest .github/tests/ -v' in your terminal to check your work.


---

## Part 3: Build a Tiny LLM *(optional, not graded)*

Train a character-level transformer to generate new text from a dataset of short strings. This mirrors the microGPT demo from lecture — same architecture, different data, using PyTorch's built-in modules instead of writing everything from scratch.

**Choose your dataset** (or use both!):

| Dataset | File | Items | Description |
|:---|:---|:---|:---|
| D&D Spells | `dnd_spells.lst` | 518 | Official spell names from Dungeons & Dragons |
| Ice Cream | `icecream_flavors.lst` | 450 | Ice cream flavor names from a [CMU student survey](https://www.cs.cmu.edu/~15110-f23/slides/all-icecream.csv) |

The code below uses D&D spells — swap the filename and variable names if you prefer ice cream.

In [20]:
import torch
import torch.nn as nn
from torch.nn import functional as F

### Load and prepare data

In [21]:
# Choose your dataset: "dnd_spells.lst" or "icecream_flavors.lst"
datafile = "dnd_spells.lst"

with open(datafile) as f:
    lines = f.read().strip().split("\n")
items = [line.strip() for line in lines[1:] if line.strip()]  # skip header

text = "\n".join(items)
chars = sorted(set(text))
vocab_size = len(chars)

# Character <-> integer mappings
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)
print(f"{len(items)} items from {datafile}")
print(f"{len(chars)} unique characters, {len(data)} total tokens")
print(f"Vocabulary: {''.join(chars)}")

519 items from dnd_spells.lst
55 unique characters, 7348 total tokens
Vocabulary: 
 '-/ABCDEFGHIJKLMNOPQRSTUVWZabcdefghijklmnopqrstuvwxyz


### Define the model

This is a minimal GPT: token embeddings + position embeddings → transformer decoder → output head. Read through the code, then run the cell.

In [22]:
block_size = 32   # context window (characters)
n_embd = 64       # embedding dimension
n_head = 4        # attention heads
n_layer = 2       # transformer blocks
dropout = 0.1


class CharGPT(nn.Module):
    def __init__(self):
        super().__init__()
        # Each character gets a learnable vector of size n_embd
        self.tok_emb = nn.Embedding(vocab_size, n_embd)
        # Each position (0..block_size-1) also gets a learnable vector
        self.pos_emb = nn.Embedding(block_size, n_embd)
        self.drop = nn.Dropout(dropout)

        # Stack of transformer decoder layers — this is where attention happens
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=n_embd,
            nhead=n_head,
            dim_feedforward=4 * n_embd,
            dropout=dropout,
            batch_first=True,
        )
        self.transformer = nn.TransformerDecoder(decoder_layer, num_layers=n_layer)

        self.ln = nn.LayerNorm(n_embd)
        # Project from embedding space back to vocabulary size (one logit per character)
        self.head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok = self.tok_emb(idx)                                    # (B, T, n_embd)
        pos = self.pos_emb(torch.arange(T, device=idx.device))    # (T, n_embd)
        x = self.drop(tok + pos)                                   # (B, T, n_embd)

        # Causal mask: prevents each position from attending to future positions
        mask = nn.Transformer.generate_square_subsequent_mask(T, device=idx.device)
        x = self.transformer(x, x, tgt_mask=mask, memory_mask=mask)
        x = self.ln(x)
        logits = self.head(x)  # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))
        return logits, loss


device = get_device()
char_model = CharGPT().to(device)
print(f"CharGPT: {sum(p.numel() for p in char_model.parameters()):,} parameters on {device}")

CharGPT: 142,775 parameters on mps


### Train

The training loop samples random chunks from the data and teaches the model to predict the next character. Loss should drop below ~2.0 after 2000 steps.

In [23]:
optimizer = torch.optim.AdamW(char_model.parameters(), lr=3e-4)
batch_size = 32
steps = 2000

for step in range(steps):
    # Pick random starting positions
    ix = torch.randint(len(data) - block_size - 1, (batch_size,))
    x = torch.stack([data[i : i + block_size] for i in ix]).to(device)
    y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix]).to(device)

    logits, loss = char_model(x, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if step % 500 == 0 or step == steps - 1:
        print(f"step {step:4d} | loss {loss.item():.4f}")

step    0 | loss 4.2356
step  500 | loss 2.4813
step 1000 | loss 2.2988
step 1500 | loss 2.0728
step 1999 | loss 1.9269


### Generate

Sample from the trained model at different temperatures. Lower temperature = more conservative (common patterns), higher = more creative (weirder output).

In [24]:
@torch.no_grad()
def generate(model, max_new_tokens=500, temperature=0.8):
    model.eval()
    idx = torch.tensor([[stoi["\n"]]], device=device)
    for _ in range(max_new_tokens):
        context = idx[:, -block_size:]
        logits, _ = model(context)
        logits = logits[:, -1, :] / temperature
        probs = F.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        idx = torch.cat([idx, next_id], dim=1)
    model.train()
    return decode(idx[0].tolist())


for temp in [0.5, 0.8, 1.2]:
    print(f"\n--- Temperature {temp} ---")
    output = generate(char_model, temperature=temp)
    names = [s.strip() for s in output.split("\n") if s.strip()]
    for name in names[:10]:
        print(f"  {name}")


--- Temperature 0.5 ---
  Prison Spher
  Prind Bof Ward
  Wind Walls
  Prdatic Wof Shad
  Prmon Sphikon
  Pontinen Sprctore
  Plthtinon Smade
  Relentone Ward
  Stoncere Sthad
  Spimal Sphike

--- Temperature 0.8 ---
  Neate Bof Scity
  Seonecting
  Brane Spoum
  Blll Gueade
  Berouint Flam
  Guisp
  Grate Stond
  Troummhe Wall
  Show Mebstr
  Sponake Poof Wasth

--- Temperature 1.2 ---
  Rater
  Selemal
  Fordiath Woand
  Ammm N
  Hof Belllad
  Maghtuch OwecelnssPeant
  Coriol Ppred Mitonss Mars
  Pronjurcert Sthy
  Pumomond Aeal
  Wak's Spre


### Experiment (optional)

Try changing things and see what happens:

- Switch datasets — do ice cream flavors vs spell names produce different quality output?
- Increase `n_layer` to 4 or `n_embd` to 128 — does the model improve? How much slower is training?
- Train for 5000 steps instead of 2000
- What happens at very low temperature (0.2) vs very high (2.0)?