This is a project I did together with Claude Code as part of Google's MedGemma Impact Challenge. While I was intrigued by the open format of the challenge itself — basically use MedGemma to build an AI application — I was mostly curious how far you can get with Claude Code as a research companion. Verdict: very far and it's fun! Its much less coding and data scripting and more asking the right questions, inspecting results and discussing next steps.
Every day, millions of people leave a clinical encounter, e.g., from a hospital, the ER, or a specialist's office, with documents they likely do not understand since they are written for clinicians and not patients. Studies show that the majority of patients are ill-equipped to understand the materials they receive and are often alarmed and confused by complex clinical terminology 1, 2.
To fill this knowledge gap, an increasing number of patients are turning to AI chatbots to help them understand their medical records. And while those tools are very powerful, many users doubt their accuracy 3, and consumer AI tools are not bound by healthcare privacy law, meaning sensitive medical data may be used to train models or otherwise exposed 4.
What if patients could get a plain-language summary of their own medical records, directly on their phone, without any of their own data ever leaving the device? This project attempts exactly that: a small on-device language model, trained via knowledge distillation from MedGemma, that translates clinical notes into patient-friendly yet clinically accurate summaries. Because inference runs entirely on-device, patient data is never transmitted — privacy is guaranteed by design.
We post-train an LLM to read clinical notes and summarize them in patient-friendly language while preserving clinical accuracy and validity, all running on-device without internet. We achieve that in 3 steps:
- DPO training: We use MedGemma to teach Gemma-9B to write patient summaries in a patient-friendly yet accurate way
- Knowledge distillation: We then distill the Gemma-9B-DPO model into a Gemma-2B-DPO model
- Quantization: We compress the Gemma-2B-DPO for mobile deployment (~1.5GB)
Figure 1: How we trained the model. Large teacher models guide a small student to summarize clinical notes accurately. The student is then compressed to fit on a phone.
Our prime objective was to fit an LLM on a mobile device that can summarize clinical notes. While MedGemma's smallest variant (4B) can't fit on a phone even with aggressive quantization, Gemma-2B can. Hence we decided on using the Gemma family as our student model, with MedGemma-27B as its medical teacher.
As our source of clinical notes we use the MTSamples dataset, a publicly available collection of ~5,000 anonymized medical transcription samples spanning 40 medical specialties including Surgery, Cardiology, Orthopedics, Neurology, and more. Each sample is a real clinical note (operative reports, discharge summaries, consultation notes, etc.) that has been de-identified for research use. We randomly selected 600 notes for training and selected 50 held-out notes sampled across specialties for evaluation.
Training and evaluation data (DPO pairs, SFT distillation data, 50 evaluation samples) is available on HuggingFace: dejori/note-explain.
We evaluate summaries on 7 criteria, each scored 1-5 by MedGemma 27B (google/medgemma-27b-text-it) as an automated judge:
- Accuracy: Factually correct representation of the original note
- Completeness: All critical medical information preserved in the summary
- Readability: Summary is written in plain, accessible language
- Structure: Summary is clearly organized with sections and bullet points
- Patient-centered: Summary addresses the patient directly ("you/your")
- Actionability: Summary includes clear next steps
- Overall: Holistic quality judgment of the summary (not averaged from above)
Raw evaluation results (JSON) are available in evaluation_results/.
To establish a baseline, we evaluated Gemma-9B and Gemma-2B out of the box on 50 randomly sampled notes.
| Model | Overall | Accuracy | Completeness | Readability | Structure | Patient-Centered | Actionability |
|---|---|---|---|---|---|---|---|
| Gemma-9B | 78.8% | 91.2% | 75.2% | 99.6% | 97.6% | 69.6% | 71.2% |
| Gemma-2B | 64.8% | 70.4% | 65.2% | 92.0% | 91.2% | 44.4% | 55.6% |
Table 1: Baseline evaluation on 50 held-out clinical notes.
Both models score well on readability and structure, but fall short on patient-centered communication and actionability. The 2B model also shows weaker accuracy and completeness compared to the 9B model. Below are two examples where the 2B model struggles:
| Original | Gemma-9B | Gemma-2B | Scores | Issue |
|---|---|---|---|---|
| "53-year-old female... nonpalpable neoplasm..." | "You had a small lump found in your right breast..." | "The patient had been diagnosed with a suspicious area in her right breast..." | Patient-centered: 5 vs 1 | 2B uses third-person instead of addressing patient |
| "Left lateral malleolus fracture... Plate and screws..." | "broken bone on the outside of your left ankle... Plates and screws were used" | "broken bone on the inside of his ankle... two plates to secure the bone" | Accuracy: 5 vs 2 | 2B mistranslates lateral→inside, fabricates "two plates" |
Table 2: Baseline comparison — Gemma-9B vs Gemma-2B output quality.
We address these gaps by leveraging MedGemma-27B's medical expertise and Gemma-9B's stronger baseline to teach the 2B model.
We use MedGemma-27B to teach Gemma-2B what good patient communication looks like. For each of 600 clinical notes, Gemma-9B generates 5 candidate summaries with varied temperatures (0.5–0.9). We use the larger 9B model here because it produces higher-quality candidates, giving MedGemma-27B more meaningful variation to score. MedGemma-27B evaluates each candidate on the 7 criteria, and we pair high-accuracy outputs (accuracy ≥4) with low-accuracy ones (≤3) from the same note, creating ~1,400 preference pairs, like the one below:
| Clinical Note | ✓ Chosen (Accuracy ≥4) | ✗ Rejected (Accuracy ≤3) | Key Difference |
|---|---|---|---|
| "...T1 N3 M0 cancer of the nasopharynx, status post radiation therapy with 2 cycles of high dose cisplatin..." | "This patient had nasopharynx cancer and completed radiation and chemotherapy in 2006." | "You had surgery, radiation, and chemotherapy to treat your nasopharynx cancer." | Rejected fabricates surgery that never happened |
Table 3: Example DPO pair — accuracy-based selection filters out hallucinated content.
Finally, we apply Direct Preference Optimization (DPO) to teach Gemma-2B to prefer outputs that MedGemma rates highly. We train for 3 epochs with LoRA (rank=16, alpha=32) and 4-bit quantization, using a learning rate of 3e-6 and DPO beta of 0.1:
| Model | Overall | Accuracy | Completeness | Readability | Structure | Patient-Centered | Actionability |
|---|---|---|---|---|---|---|---|
| Gemma-2B + DPO | 73% | 82% | 70% | 96% | 98% | 61% | 60% |
| Gemma-2B baseline | 65% | 70% | 65% | 92% | 91% | 44% | 56% |
| Improvement | +8% | +12% | +5% | +4% | +7% | +17% | +4% |
Table 4: DPO training results on Gemma-2B (evaluated with greedy decoding).
DPO training significantly improves accuracy (+12%) by reducing fabrications, and patient-centered communication improves notably (+17%). However, the model still doesn't consistently address patients directly with "you/your" — it often defaults to third-person language. This motivated us to try a different approach.
| Original | Baseline Gemma-2B | After DPO | Scores | Improvement |
|---|---|---|---|---|
| "Left lateral malleolus fracture... Plate and screws..." | "broken bone on the inside of his ankle... two plates" | "broken bone on the outside part of his ankle (lateral malleolus)... screws to hold the bones together" | Accuracy: 2→5, Completeness: 3→4, Readability: 4→5, Structure: 4→5, Patient-centered: 1→2, Actionability: 1→2, Overall: 2→4 | Correctly translates "lateral" to "outside", removes fabricated "two plates" |
Table 5: Before/after DPO — same clinical note from Table 2.
However, patient-centered score remains weak at 61% — the model still often refers to "a young man" or "the patient" instead of addressing them directly with "you/your". This motivated us to try a different approach.
Instead of optimizing Gemma-2B directly, we first optimize Gemma-9B via DPO then use it as a teacher to generate training cases for finetuning Gemma-2B.
| Model | Overall | Accuracy | Completeness | Readability | Structure | Patient-Centered | Actionability |
|---|---|---|---|---|---|---|---|
| Gemma-2B distilled | 71% | 75% | 66% | 96% | 93% | 74% | 63% |
| Gemma-2B + DPO | 73% | 82% | 70% | 96% | 98% | 61% | 60% |
| Gemma-2B baseline | 65% | 70% | 65% | 92% | 91% | 44% | 56% |
Table 6: Knowledge distillation results — 2B learns from 9B-DPO outputs (evaluated with greedy decoding).
Key insight: Distillation achieves 74% patient-centered score vs 61% from direct DPO — the 2B model learns the communication style better by imitating 9B-DPO outputs than from preference learning directly. Meanwhile, DPO achieves higher accuracy (82% vs 75%). Returning to our fracture example:
| Original | Baseline Gemma-2B | After Distillation | Scores | Improvement |
|---|---|---|---|---|
| "Left lateral malleolus fracture... Plate and screws..." | "The patient had a broken bone on the inside of his ankle... two plates to secure the bone" | "You had a broken bone (lateral malleolus) on your left ankle... metal plates and screws" | Accuracy: 2→4, Completeness: 3→3, Readability: 4→5, Structure: 4→4, Patient-centered: 1→5, Actionability: 1→1, Overall: 2→3 | Addresses patient directly with "you/your", correctly identifies "lateral" |
Table 7: Before/after distillation — same clinical note from Tables 2 and 5.
The distilled model achieves what DPO alone couldn't: a patient-addressing tone that scores 5/5 on patient-centered communication. Each approach has distinct strengths — DPO achieves higher accuracy (82% vs 75%), reducing fabrications, while distillation produces better patient-centered communication (74% vs 61%). Combining both approaches is a promising direction for future work.
The final 2B model is quantized to GGUF format using llama.cpp with Q4_K_M (4-bit quantization), bringing it down to ~1.5GB — small enough to fit in mobile RAM while generating >5 tokens/second. Tested on iPhone 16e.
All model weights are available on HuggingFace: dejori/note-explain
gemma-2b-distilled/- Final distilled model (LoRA adapter)gguf/gemma-2b-distilled-q4_k_m.gguf- Quantized distilled model for mobile (~1.6GB)gguf/gemma-2b-dpo-q4_k_m.gguf- Quantized DPO model for mobile (~1.6GB)gemma-2b-dpo/- DPO-trained 2B model (LoRA adapter)gemma-9b-dpo/- Teacher model for distillation (LoRA adapter)
Source code: ios-app/
Figure 2: The NoteExplain app running on iPhone 16e. Click to watch the demo video. Users scan or photograph medical documents and receive a plain-language summary in seconds — all without internet.
- Not yet production-ready — significant progress, but more work needed before real patient use
- Automated judge (MedGemma-27B) — no human evaluation yet
- English only, trained on MTSamples dataset
- 2B model occasionally omits minor details for readability
- Combine DPO and distillation to achieve both high accuracy and patient-centered communication
- Expand training and evaluation data to more clinical note types and specialties
- Add follow-up questions so patients can ask about specific parts of their summary
- Include confidence indicators that flag uncertain sections
- Link medical terms to trusted sources like MedlinePlus
This project demonstrates that MedGemma can effectively teach smaller models to communicate complex medical information in patient-friendly language. By combining DPO training with knowledge distillation, we achieved significant improvement in patient-centered communication while maintaining clinical accuracy — all in a model small enough to run entirely on a mobile device. We believe this approach points toward a future where patients can understand their own medical records without sacrificing privacy, helping bridge the gap between clinical documentation and patient education, engagement, and empowerment.

