Note-Explain

This is a project I did together with Claude Code as part of Google's MedGemma Impact Challenge. While I was intrigued by the open format of the challenge itself — basically use MedGemma to build an AI application — I was mostly curious how far you can get with Claude Code as a research companion. Verdict: very far and it's fun! Its much less coding and data scripting and more asking the right questions, inspecting results and discussing next steps.

Problem statement

Every day, millions of people leave a clinical encounter, e.g., from a hospital, the ER, or a specialist's office, with documents they likely do not understand since they are written for clinicians and not patients. Studies show that the majority of patients are ill-equipped to understand the materials they receive and are often alarmed and confused by complex clinical terminology 1, 2.

To fill this knowledge gap, an increasing number of patients are turning to AI chatbots to help them understand their medical records. And while those tools are very powerful, many users doubt their accuracy 3, and consumer AI tools are not bound by healthcare privacy law, meaning sensitive medical data may be used to train models or otherwise exposed 4.

What if patients could get a plain-language summary of their own medical records, directly on their phone, without any of their own data ever leaving the device? This project attempts exactly that: a small on-device language model, trained via knowledge distillation from MedGemma, that translates clinical notes into patient-friendly yet clinically accurate summaries. Because inference runs entirely on-device, patient data is never transmitted — privacy is guaranteed by design.

Overall solution

We post-train an LLM to read clinical notes and summarize them in patient-friendly language while preserving clinical accuracy and validity, all running on-device without internet. We achieve that in 3 steps:

DPO training: We use MedGemma to teach Gemma-9B to write patient summaries in a patient-friendly yet accurate way
Knowledge distillation: We then distill the Gemma-9B-DPO model into a Gemma-2B-DPO model
Quantization: We compress the Gemma-2B-DPO for mobile deployment (~1.5GB)

Figure 1: How we trained the model. Large teacher models guide a small student to summarize clinical notes accurately. The student is then compressed to fit on a phone.

Technical details

Our prime objective was to fit an LLM on a mobile device that can summarize clinical notes. While MedGemma's smallest variant (4B) can't fit on a phone even with aggressive quantization, Gemma-2B can. Hence we decided on using the Gemma family as our student model, with MedGemma-27B as its medical teacher.

Data

As our source of clinical notes we use the MTSamples dataset, a publicly available collection of ~5,000 anonymized medical transcription samples spanning 40 medical specialties including Surgery, Cardiology, Orthopedics, Neurology, and more. Each sample is a real clinical note (operative reports, discharge summaries, consultation notes, etc.) that has been de-identified for research use. We randomly selected 600 notes for training and selected 50 held-out notes sampled across specialties for evaluation.

Training and evaluation data (DPO pairs, SFT distillation data, 50 evaluation samples) is available on HuggingFace: dejori/note-explain.

Evaluation Criteria

We evaluate summaries on 7 criteria, each scored 1-5 by MedGemma 27B (google/medgemma-27b-text-it) as an automated judge:

Accuracy: Factually correct representation of the original note
Completeness: All critical medical information preserved in the summary
Readability: Summary is written in plain, accessible language
Structure: Summary is clearly organized with sections and bullet points
Patient-centered: Summary addresses the patient directly ("you/your")
Actionability: Summary includes clear next steps
Overall: Holistic quality judgment of the summary (not averaged from above)

Raw evaluation results (JSON) are available in evaluation_results/.

Baseline

To establish a baseline, we evaluated Gemma-9B and Gemma-2B out of the box on 50 randomly sampled notes.

Model	Overall	Accuracy	Completeness	Readability	Structure	Patient-Centered	Actionability
Gemma-9B	78.8%	91.2%	75.2%	99.6%	97.6%	69.6%	71.2%
Gemma-2B	64.8%	70.4%	65.2%	92.0%	91.2%	44.4%	55.6%

Table 1: Baseline evaluation on 50 held-out clinical notes.

Both models score well on readability and structure, but fall short on patient-centered communication and actionability. The 2B model also shows weaker accuracy and completeness compared to the 9B model. Below are two examples where the 2B model struggles:

Original	Gemma-9B	Gemma-2B	Scores	Issue
"53-year-old female... nonpalpable neoplasm..."	"You had a small lump found in your right breast..."	"The patient had been diagnosed with a suspicious area in her right breast..."	Patient-centered: 5 vs 1	2B uses third-person instead of addressing patient
"Left lateral malleolus fracture... Plate and screws..."	"broken bone on the outside of your left ankle... Plates and screws were used"	"broken bone on the inside of his ankle... two plates to secure the bone"	Accuracy: 5 vs 2	2B mistranslates lateral→inside, fabricates "two plates"

Table 2: Baseline comparison — Gemma-9B vs Gemma-2B output quality.

We address these gaps by leveraging MedGemma-27B's medical expertise and Gemma-9B's stronger baseline to teach the 2B model.

Step 1: DPO Training with MedGemma-27B as Teacher

We use MedGemma-27B to teach Gemma-2B what good patient communication looks like. For each of 600 clinical notes, Gemma-9B generates 5 candidate summaries with varied temperatures (0.5–0.9). We use the larger 9B model here because it produces higher-quality candidates, giving MedGemma-27B more meaningful variation to score. MedGemma-27B evaluates each candidate on the 7 criteria, and we pair high-accuracy outputs (accuracy ≥4) with low-accuracy ones (≤3) from the same note, creating ~1,400 preference pairs, like the one below:

Clinical Note	✓ Chosen (Accuracy ≥4)	✗ Rejected (Accuracy ≤3)	Key Difference
"...T1 N3 M0 cancer of the nasopharynx, status post radiation therapy with 2 cycles of high dose cisplatin..."	"This patient had nasopharynx cancer and completed radiation and chemotherapy in 2006."	"You had surgery, radiation, and chemotherapy to treat your nasopharynx cancer."	Rejected fabricates surgery that never happened

Table 3: Example DPO pair — accuracy-based selection filters out hallucinated content.

Finally, we apply Direct Preference Optimization (DPO) to teach Gemma-2B to prefer outputs that MedGemma rates highly. We train for 3 epochs with LoRA (rank=16, alpha=32) and 4-bit quantization, using a learning rate of 3e-6 and DPO beta of 0.1:

Model	Overall	Accuracy	Completeness	Readability	Structure	Patient-Centered	Actionability
Gemma-2B + DPO	73%	82%	70%	96%	98%	61%	60%
Gemma-2B baseline	65%	70%	65%	92%	91%	44%	56%
Improvement	+8%	+12%	+5%	+4%	+7%	+17%	+4%

Table 4: DPO training results on Gemma-2B (evaluated with greedy decoding).

DPO training significantly improves accuracy (+12%) by reducing fabrications, and patient-centered communication improves notably (+17%). However, the model still doesn't consistently address patients directly with "you/your" — it often defaults to third-person language. This motivated us to try a different approach.

Original	Baseline Gemma-2B	After DPO	Scores	Improvement
"Left lateral malleolus fracture... Plate and screws..."	"broken bone on the inside of his ankle... two plates"	"broken bone on the outside part of his ankle (lateral malleolus)... screws to hold the bones together"	Accuracy: 2→5, Completeness: 3→4, Readability: 4→5, Structure: 4→5, Patient-centered: 1→2, Actionability: 1→2, Overall: 2→4	Correctly translates "lateral" to "outside", removes fabricated "two plates"

Table 5: Before/after DPO — same clinical note from Table 2.

However, patient-centered score remains weak at 61% — the model still often refers to "a young man" or "the patient" instead of addressing them directly with "you/your". This motivated us to try a different approach.

Step 2: Knowledge Distillation via Gemma-9B

Instead of optimizing Gemma-2B directly, we first optimize Gemma-9B via DPO then use it as a teacher to generate training cases for finetuning Gemma-2B.

Model	Overall	Accuracy	Completeness	Readability	Structure	Patient-Centered	Actionability
Gemma-2B distilled	71%	75%	66%	96%	93%	74%	63%
Gemma-2B + DPO	73%	82%	70%	96%	98%	61%	60%
Gemma-2B baseline	65%	70%	65%	92%	91%	44%	56%

Table 6: Knowledge distillation results — 2B learns from 9B-DPO outputs (evaluated with greedy decoding).

Key insight: Distillation achieves 74% patient-centered score vs 61% from direct DPO — the 2B model learns the communication style better by imitating 9B-DPO outputs than from preference learning directly. Meanwhile, DPO achieves higher accuracy (82% vs 75%). Returning to our fracture example:

Original	Baseline Gemma-2B	After Distillation	Scores	Improvement
"Left lateral malleolus fracture... Plate and screws..."	"The patient had a broken bone on the inside of his ankle... two plates to secure the bone"	"You had a broken bone (lateral malleolus) on your left ankle... metal plates and screws"	Accuracy: 2→4, Completeness: 3→3, Readability: 4→5, Structure: 4→4, Patient-centered: 1→5, Actionability: 1→1, Overall: 2→3	Addresses patient directly with "you/your", correctly identifies "lateral"

Table 7: Before/after distillation — same clinical note from Tables 2 and 5.

The distilled model achieves what DPO alone couldn't: a patient-addressing tone that scores 5/5 on patient-centered communication. Each approach has distinct strengths — DPO achieves higher accuracy (82% vs 75%), reducing fabrications, while distillation produces better patient-centered communication (74% vs 61%). Combining both approaches is a promising direction for future work.

Step 3: Quantization for Mobile

The final 2B model is quantized to GGUF format using llama.cpp with Q4_K_M (4-bit quantization), bringing it down to ~1.5GB — small enough to fit in mobile RAM while generating >5 tokens/second. Tested on iPhone 16e.

Trained Models

All model weights are available on HuggingFace: dejori/note-explain

gemma-2b-distilled/ - Final distilled model (LoRA adapter)
gguf/gemma-2b-distilled-q4_k_m.gguf - Quantized distilled model for mobile (~1.6GB)
gguf/gemma-2b-dpo-q4_k_m.gguf - Quantized DPO model for mobile (~1.6GB)
gemma-2b-dpo/ - DPO-trained 2B model (LoRA adapter)
gemma-9b-dpo/ - Teacher model for distillation (LoRA adapter)

Mobile App

Source code: ios-app/

Figure 2: The NoteExplain app running on iPhone 16e. Click to watch the demo video. Users scan or photograph medical documents and receive a plain-language summary in seconds — all without internet.

Limitations

Not yet production-ready — significant progress, but more work needed before real patient use
Automated judge (MedGemma-27B) — no human evaluation yet
English only, trained on MTSamples dataset
2B model occasionally omits minor details for readability

Future Work

Combine DPO and distillation to achieve both high accuracy and patient-centered communication
Expand training and evaluation data to more clinical note types and specialties
Add follow-up questions so patients can ask about specific parts of their summary
Include confidence indicators that flag uncertain sections
Link medical terms to trusted sources like MedlinePlus

Conclusion

This project demonstrates that MedGemma can effectively teach smaller models to communicate complex medical information in patient-friendly language. By combining DPO training with knowledge distillation, we achieved significant improvement in patient-centered communication while maintaining clinical accuracy — all in a model small enough to run entirely on a mobile device. We believe this approach points toward a future where patients can understand their own medical records without sacrificing privacy, helping bridge the gap between clinical documentation and patient education, engagement, and empowerment.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
evaluation_results		evaluation_results
ios-app		ios-app
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Note-Explain

Problem statement

Overall solution

Technical details

Data

Evaluation Criteria

Baseline

Step 1: DPO Training with MedGemma-27B as Teacher

Step 2: Knowledge Distillation via Gemma-9B

Step 3: Quantization for Mobile

Trained Models

Mobile App

Limitations

Future Work

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Note-Explain

Problem statement

Overall solution

Technical details

Data

Evaluation Criteria

Baseline

Step 1: DPO Training with MedGemma-27B as Teacher

Step 2: Knowledge Distillation via Gemma-9B

Step 3: Quantization for Mobile

Trained Models

Mobile App

Limitations

Future Work

Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages