<a href="https://colab.research.google.com/github/basavarajmullur/Spring-Boot-JdbcTemplate/blob/master/notebooks/quick_start_with_hugging_face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

!pip install -q transformers accelerate bitsandbytes
!pip install -q fastapi uvicorn pyngrok pillow


## Setup

To complete this tutorial, you'll need to have a runtime with [sufficient resources](https://ai.google.dev/gemma/docs/core#sizes) to run the MedGemma model.

You can try out MedGemma 4B for free in Google Colab using a T4 GPU:

1. In the upper-right of the Colab window, select **‚ñæ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.

**Note**: To run the demo with MedGemma 27B in Google Colab, you will need a runtime with an A100 GPU.

### Get access to MedGemma

Before you get started, make sure that you have access to MedGemma models on Hugging Face:

1. If you don't already have a Hugging Face account, you can create one for free by clicking [here](https://huggingface.co/join).
2. Head over to the [MedGemma model page](https://huggingface.co/google/medgemma-1.5-4b-it) and accept the usage conditions.

### Step 1: Authenticate with Hugging Face


In [1]:
from huggingface_hub import login
login()

### Step 2: Install dependencies

In [2]:
!pip install -q \
  fastapi \
  uvicorn \
  transformers \
  accelerate \
  bitsandbytes \
  pillow==10.4.0 \
  torch torchvision \





## Step 3: Load MedGemma

In [3]:
import torch
from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/medgemma-4b-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto"
)

model.eval()
print("‚úÖ MedGemma loaded")


The image processor of type `Gemma3ImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. 
`torch_dtype` is deprecated! Use `dtype` instead!


Loading weights:   0%|          | 0/883 [00:00<?, ?it/s]

‚úÖ MedGemma loaded


## Step 4: Install cloudflared

In [4]:
!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64
!chmod +x cloudflared-linux-amd64

## Step 5: Streaming Server

In [5]:
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from PIL import Image
import base64, io, traceback, os, json, uuid, datetime
import torch
from threading import Thread

from transformers import (
    AutoProcessor,
    AutoModelForImageTextToText,
    TextIteratorStreamer
)

# =========================
# MODEL SETUP
# =========================

MODEL_ID = "google/medgemma-4b-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

model.generation_config.do_sample = False

# =========================
# FASTAPI
# =========================

app = FastAPI(title="ClinIQ ‚Äì MedGemma Streaming API")

# =========================
# AUDIT LOGGING
# =========================

AUDIT_DIR = "audit_logs"
os.makedirs(AUDIT_DIR, exist_ok=True)

def audit_log(prompt: str, output: str):
    """
    Best-effort audit logging.
    NEVER raises exceptions.
    """
    try:
        record = {
            "id": str(uuid.uuid4()),
            "timestamp": datetime.datetime.utcnow().isoformat(),
            "model": MODEL_ID,
            "prompt": prompt,
            "output": output,
        }
        path = os.path.join(AUDIT_DIR, f"{record['id']}.json")
        with open(path, "w", encoding="utf-8") as f:
            json.dump(record, f, indent=2)
    except Exception as e:
        print("‚ö†Ô∏è Audit log failed:", str(e))

# =========================
# REQUEST MODEL
# =========================

class AnalyzeRequest(BaseModel):
    prompt: str
    image_base64: str
    max_tokens: int = 512

# =========================
# HELPERS
# =========================

def base64_to_pil(image_base64: str) -> Image.Image:
    if image_base64.startswith("data:"):
        image_base64 = image_base64.split(",", 1)[1]

    image_bytes = base64.b64decode(image_base64)
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
    image = image.resize((512, 512))  # ‚ö° speed + stability
    return image

SYSTEM_PROMPT = (
    "You are a clinical decision support assistant for healthcare professionals.\n"
    "You may discuss differential considerations but must not claim diagnostic certainty.\n"
    "Respond ONLY with valid JSON."
)

# =========================
# STREAMING GENERATOR
# =========================

def stream_medgemma(image: Image.Image, user_prompt: str, max_tokens: int):

    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": SYSTEM_PROMPT}],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": user_prompt},
                {"type": "image", "image": image},
            ],
        },
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)

    streamer = TextIteratorStreamer(
        processor,
        skip_prompt=True,
        skip_special_tokens=True,
        timeout=30.0,
    )

    generation_kwargs = dict(
        **inputs,
        streamer=streamer,
        max_new_tokens=max_tokens,
        do_sample=False,
        temperature=0.0,
        top_p=0.9,
    )

    # Run generation in background thread
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()

    # Accumulate full output for audit
    full_output = ""

    for token in streamer:
        full_output += token
        yield token

    # üîí Audit AFTER streaming completes
    audit_log(user_prompt, full_output)

# =========================
# STREAMING ENDPOINT
# =========================

@app.post("/analyze")
def analyze(req: AnalyzeRequest):
    try:
        image = base64_to_pil(req.image_base64)

        return StreamingResponse(
            stream_medgemma(
                image=image,
                user_prompt=req.prompt,
                max_tokens=req.max_tokens,
            ),
            media_type="text/plain",
        )

    except Exception as e:
        print("‚ùå ANALYZE FAILED")
        traceback.print_exc()
        raise HTTPException(status_code=422, detail=str(e))


Loading weights:   0%|          | 0/883 [00:00<?, ?it/s]



## Step 6: Run FastAPI server



In [6]:
import logging
import uvicorn
from threading import Thread

# -----------------------
# Logging setup
# -----------------------
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s"
)

logger = logging.getLogger("cliniq")

def start_api():
    logger.info("Starting FastAPI server on 127.0.0.1:8000")

    uvicorn.run(
        app,
        host="127.0.0.1",
        port=8000,
        log_level="info",
        access_log=True
    )

    logger.info("Uvicorn process exited")

Thread(target=start_api).start()


## Step 7 Expose via Cloudflare Tunnel

In [None]:
import subprocess
import re

process = subprocess.Popen(
    [
        "./cloudflared-linux-amd64",
        "tunnel",
        "--no-autoupdate",
        "--protocol", "http2",        # ‚ùå no QUIC
        "--url", "http://127.0.0.1:8000"
    ],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True,
)

for line in process.stdout:
    print(line, end="")
    if "trycloudflare.com" in line:
        print("\nüåç COPY THIS URL ‚Üë‚Üë‚Üë\n")


INFO:     Started server process [53853]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)


2026-02-06T16:17:41Z INF Thank you for trying Cloudflare Tunnel. Doing so, without a Cloudflare account, is a quick way to experiment and try it out. However, be aware that these account-less Tunnels have no uptime guarantee, are subject to the Cloudflare Online Services Terms of Use (https://www.cloudflare.com/website-terms/), and Cloudflare reserves the right to investigate your use of Tunnels for violations of such terms. If you intend to use Tunnels in production you should use a pre-created named tunnel by following: https://developers.cloudflare.com/cloudflare-one/connections/connect-apps
2026-02-06T16:17:41Z INF Requesting new quick Tunnel on trycloudflare.com...

üåç COPY THIS URL ‚Üë‚Üë‚Üë

2026-02-06T16:17:45Z INF +--------------------------------------------------------------------------------------------+
2026-02-06T16:17:45Z INF |  Your quick Tunnel has been created! Visit it at (it may take some time to be reachable):  |
2026-02-06T16:17:45Z INF |  https://blah-tent-jobs