In [None]:
import os

from google import genai
from google.genai import types

from deep_statutes.config import STATUTES_DATA_DIR
from deep_statutes.states import ABBREV_TO_US_STATE

# About

This notebook will give you a description of the hierarchical structure and formatting of a PDF file.

A good way to understand the structure of a state's statutes is to run this notebook on an `example.pdf` that has been made by running `uv run pdf-sample` on some directory of PDFs.


In [2]:
api_key = os.getenv('GEMINI_API_KEY')

In [3]:
state = 'co'
input_path = STATUTES_DATA_DIR / state / 'example.pdf'

In [4]:
state_name = ABBREV_TO_US_STATE[state.upper()]

prompt = f"""
This is a PDF that contains a subset of the pages of statutes of {state_name}.

Could you please tell me the following about the organization of this document: 
(a) does it have headers or footers, 
(b) are there any tables or other non-text elements that interrupt the flow of text,
(c) what is the hierarchy of headers (e.g. "Title", "Chapter", "Article", "Part"), and 
(d) are there any formatting exceptions that would make finding headers using a simple heuristic (e.g. regex) difficult?
"""

In [5]:
client = genai.Client(api_key=api_key)
file = client.files.upload(file = input_path)

model_pro = "gemini-2.5-pro-exp-03-25"

generate_content_config = types.GenerateContentConfig(
    response_mime_type='text/plain',
)

for chunk in client.models.generate_content_stream(
    model=model_pro,
    contents=[
        file,
        prompt,
    ]
):
    print(chunk.text, end='')

Okay, let's break down the organizational structure of the provided Colorado Revised Statutes pages based on the OCR text:

**(a) Headers or Footers:**

*   **Yes.**
*   **Header:** Typically includes "Colorado Revised Statutes 2024" at the top left or center of the page.
*   **Footer:** Consistently includes "Colorado Revised Statutes 2024", the page number (e.g., "Page X of Y"), and "Uncertified Printout" at the bottom of the page.

**(b) Tables or Non-Text Elements:**

*   Based on the provided OCR text, there are **no tables, images, or other significant non-text elements** interrupting the flow of the main statutory text. The content consists primarily of text, including headers, section numbers, section titles, statutory language, editor's notes, source notes, cross-references, and law reviews.

**(c) Hierarchy of Headers:**

The document follows a clear hierarchical structure, generally in this order from highest to lowest level:

1.  **TITLE [Number]** (e.g., `TITLE 1`, `TITLE 