# Gemini API: Schema‑constrained extraction from PDF (LLM‑native)

This notebook shows an **LLM‑native** pipeline using the **standalone Gemini API**:

1. Upload the PDF via the **Files API**
2. Call `generate_content` with **Structured Outputs** (JSON Schema constrained)
3. Validate the response against the same JSON Schema
4. Save canonical JSON per report (ready to index/search)

> **Prereq:** set `GEMINI_API_KEY` in your environment.


In [None]:
%pip -q install google-genai jsonschema

In [None]:
import os, json
from jsonschema import validate
from google import genai
from google.genai import types


## 1) JSON Schema (canonical report format)

We define a canonical schema aligned to your template:
- report metadata (title, date, program)
- sections 1–7
- subsection(s) (e.g., `Incentive Long-Term Goal`)
- tables:
  - Program Basics (key/value pairs)
  - Financial Analysis Summary (matrix)

The schema is saved to disk so you can reuse it for validation and tooling.


In [None]:
SCHEMA_PATH = r"/mnt/data/gemini_program_report_schema.json"
with open(SCHEMA_PATH, "r") as f:
    REPORT_SCHEMA = json.load(f)

REPORT_SCHEMA["title"], list(REPORT_SCHEMA["properties"].keys())


## 2) Upload PDF to Gemini Files API

Upload once → reuse the returned `file_uri` across extraction calls.


In [None]:
API_KEY = os.environ.get("GEMINI_API_KEY")
if not API_KEY:
    raise RuntimeError("Set GEMINI_API_KEY in your environment.")

client = genai.Client(api_key=API_KEY)


In [None]:
PDF_PATH = "/mnt/data/Report - Sample.pdf"  # change to your file path
uploaded = client.files.upload(file=PDF_PATH)
uploaded.uri


## 3) Schema‑constrained extraction call

We ask Gemini to:
- read the PDF
- extract sections/subsections and tables
- **return JSON that conforms to our JSON Schema**

Key settings:
- `response_mime_type="application/json"`
- `response_schema=REPORT_SCHEMA`


In [None]:
MODEL = "gemini-2.5-flash"  # choose an appropriate Gemini model

system_instruction = """You are an information extraction engine.
Extract the report into the provided JSON Schema exactly.
Rules:
- Use section numbers exactly as printed (e.g., "1", "2", ..., "7").
- Preserve headings verbatim.
- For text sections: put the content in `text`.
- For tables:
  - Program Basics: kind="kv_pairs" and rows as objects with key/value pairs.
  - Financial Analysis Summary: kind="matrix", columns must match the table headings,
    rows must be objects keyed by those column names.
- Include `citations` with page numbers where each section/subsection/table was found.
- Output MUST be valid JSON and conform to schema; no extra keys.
"""

prompt = """Extract this report into the JSON schema."""

resp = client.models.generate_content(
    model=MODEL,
    contents=[
        types.Content(
            role="user",
            parts=[
                types.Part.from_uri(uploaded.uri, mime_type="application/pdf"),
                types.Part.from_text(prompt),
            ],
        )
    ],
    config=types.GenerateContentConfig(
        system_instruction=system_instruction,
        response_mime_type="application/json",
        response_schema=REPORT_SCHEMA,
        temperature=0,
    ),
)

# Gemini returns JSON text in resp.text for structured outputs
data = json.loads(resp.text)
data.keys()


## 4) Validate + save canonical JSON

In [None]:
validate(instance=data, schema=REPORT_SCHEMA)

OUT_JSON = "/mnt/data/extracted_report_gemini.json"
with open(OUT_JSON, "w") as f:
    json.dump(data, f, indent=2)

OUT_JSON


## 5) Batch extraction for many reports

This helper processes a list of file paths and writes one JSON per report.
Tip: keep a local cache of uploaded file URIs to avoid re-uploading the same file.


In [None]:
from pathlib import Path

def extract_one(pdf_path: str, out_dir: str, model: str = MODEL):
    pdf_path = str(pdf_path)
    up = client.files.upload(file=pdf_path)

    resp = client.models.generate_content(
        model=model,
        contents=[
            types.Content(
                role="user",
                parts=[
                    types.Part.from_uri(up.uri, mime_type="application/pdf"),
                    types.Part.from_text("Extract this report into the JSON schema."),
                ],
            )
        ],
        config=types.GenerateContentConfig(
            system_instruction=system_instruction,
            response_mime_type="application/json",
            response_schema=REPORT_SCHEMA,
            temperature=0,
        ),
    )
    obj = json.loads(resp.text)
    validate(instance=obj, schema=REPORT_SCHEMA)

    Path(out_dir).mkdir(parents=True, exist_ok=True)
    out_path = str(Path(out_dir) / (Path(pdf_path).stem + ".json"))
    with open(out_path, "w") as f:
        json.dump(obj, f, indent=2)
    return out_path

# Example:
# pdfs = ["/path/to/r1.pdf", "/path/to/r2.pdf"]
# for p in pdfs:
#     print(extract_one(p, out_dir="/path/to/out_json"))
