# Regulatory Document Navigation

**Problem**: How can we surface and summarize the most relevant sections of lengthy, technical regulatory documents to support timely public engagement?

**Stakeholder Quote:**
> "Out of the 400+ pages, I'm interested in about 5. If I didn't know what I was looking for, I never would have found the sections that impact public-health data exchange."

This notebook demonstrates:
- Finding relevant dockets by topic
- Identifying key documents within dockets
- Comment-based relevance signals
- Upcoming deadlines for engagement

In [1]:
import duckdb
import pandas as pd

R2_BASE_URL = "https://pub-5fc11ad134984edf8d9af452dd1849d6.r2.dev"

conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs;")
print("âœ“ Ready")

âœ“ Ready


## 1. Find Dockets by Topic

Discovery is the first challenge - finding relevant regulations.

In [2]:
# Search by topic keywords
topics = {
    "health": ["health", "medical", "hospital", "patient", "healthcare"],
    "environment": ["climate", "emissions", "pollution", "environmental", "air quality"],
    "finance": ["banking", "securities", "investment", "financial", "credit"],
    "technology": ["data", "privacy", "cybersecurity", "artificial intelligence", "broadband"],
    "labor": ["worker", "wage", "employment", "workplace", "safety"]
}

topic = "health"  # Change this
keywords = topics[topic]
keyword_clause = " OR ".join([f"LOWER(title) LIKE '%{kw}%'" for kw in keywords])

results = conn.execute(f"""
    SELECT docket_id, agency_code, title, docket_type, modify_date
    FROM read_parquet('{R2_BASE_URL}/dockets.parquet')
    WHERE ({keyword_clause})
    ORDER BY modify_date DESC
    LIMIT 20
""").fetchdf()

print(f"Recent {topic} dockets:")
results

Recent health dockets:


Unnamed: 0,docket_id,agency_code,title,docket_type,modify_date
0,HHS-ONC-2025-0005,HHS,"Health Data, Technology, and Interoperability:...",Rulemaking,2026-01-14T15:17:40Z
1,CMS-2025-1822,CMS,Medicare and Medicaid Programs: Hospital Condi...,Rulemaking,2026-01-14T14:42:59Z
2,CMS-2025-1560,CMS,Announcement of Application from a Hospital Re...,Rulemaking,2026-01-14T14:41:37Z
3,CMS-2025-1823,CMS,Medicaid Program: Prohibition on Federal Medic...,Rulemaking,2026-01-14T12:28:51Z
4,FAA-2025-0562,FAA,Pilot Medical Disclosure Decision Making Model...,Nonrulemaking,2026-01-13T13:10:39Z
5,FDA-2026-P-0028,FDA,Requests that the FDA conduct â€œ[a] study to de...,Nonrulemaking,2026-01-09T17:28:56Z
6,FDA-2025-P-2162,FDA,Requests that absent new peer-reviewed studies...,Nonrulemaking,2026-01-08T18:05:51Z
7,FDA-2025-P-0701,FDA,Request that the FDA classify and regulate Nar...,Nonrulemaking,2026-01-08T18:02:41Z
8,FDA-2025-N-4679,FDA,Circulatory System Devices Panel of the Medica...,Nonrulemaking,2026-01-08T17:14:29Z
9,FDA-2025-N-2195,FDA,Agency Information Collection Activities; Prop...,Nonrulemaking,2026-01-08T17:07:23Z


## 2. Find Open Comment Periods

Critical: don't miss the window for engagement.

In [3]:
# Documents with open comment periods
open_comments = conn.execute(f"""
    SELECT 
        d.document_id,
        d.agency_code,
        d.title,
        d.document_type,
        d.comment_end_date,
        dk.title as docket_title
    FROM read_parquet('{R2_BASE_URL}/documents.parquet') d
    LEFT JOIN read_parquet('{R2_BASE_URL}/dockets.parquet') dk
        ON d.docket_id = dk.docket_id
    WHERE d.comment_end_date IS NOT NULL
      AND TRY_CAST(d.comment_end_date AS DATE) > CURRENT_DATE
      AND ({keyword_clause.replace('title', 'd.title')})
    ORDER BY d.comment_end_date ASC
    LIMIT 15
""").fetchdf()

print(f"Open comment periods for {topic}:")
if len(open_comments) == 0:
    print("  No open comment periods found for this topic")
else:
    open_comments

Open comment periods for health:


## 3. High-Engagement Dockets

Dockets with many comments signal public interest.

In [4]:
# Most commented dockets on this topic
high_engagement = conn.execute(f"""
    SELECT 
        d.docket_id,
        d.agency_code,
        d.title,
        COUNT(c.comment_id) as comment_count
    FROM read_parquet('{R2_BASE_URL}/dockets.parquet') d
    LEFT JOIN read_parquet('{R2_BASE_URL}/comments.parquet') c
        ON d.docket_id = c.docket_id
    WHERE ({keyword_clause.replace('title', 'd.title')})
    GROUP BY d.docket_id, d.agency_code, d.title
    ORDER BY comment_count DESC
    LIMIT 15
""").fetchdf()

print(f"High-engagement {topic} dockets:")
high_engagement

High-engagement health dockets:


Unnamed: 0,docket_id,agency_code,title,comment_count
0,BLM-2023-0001,BLM,Conservation and Landscape Health,8200872
1,FDA-2019-N-5959,FDA,Medication Guides: Patient Medication Informa...,3590420
2,FWS-HQ-NWRS-2022-0106,FWS,National Wildlife Refuge System; Biological In...,3032520
3,HHS-OCR-2023-0006,HHS,HIPAA Privacy Rule to Support Reproductive Hea...,937654
4,DEA-2023-0029,DEA,Telemedicine Prescribing of Controlled Substan...,544830
5,FDA-2023-N-2177,FDA,Medical Devices; Laboratory Developed Tests,324096
6,FDA-2023-N-3902,FDA,Banned Devices; Proposal to Ban Electrical Sti...,286803
7,HHS-OS-2022-0012,HHS,Nondiscrimination in Health Programs and Activ...,221676
8,HHS-OCR-2019-0007,HHS,Nondiscrimination in Health and Health Educati...,155966
9,CMS-2018-0135,CMS,Patient Protection and Affordable Care Act; Ex...,149472


## 4. Documents Within a Docket

Navigate the documents to find what matters.

In [5]:
# Select a docket to explore
docket_id = high_engagement['docket_id'].iloc[0] if len(high_engagement) > 0 else "EPA-HQ-OAR-2021-0317"

docs = conn.execute(f"""
    SELECT 
        document_id,
        document_type,
        title,
        posted_date,
        comment_start_date,
        comment_end_date
    FROM read_parquet('{R2_BASE_URL}/documents.parquet')
    WHERE docket_id = '{docket_id}'
    ORDER BY posted_date DESC
""").fetchdf()

print(f"Documents in {docket_id}:")
docs

Documents in BLM-2023-0001:


Unnamed: 0,document_id,document_type,title,posted_date,comment_start_date,comment_end_date
0,BLM-2023-0001-154334,Supporting & Related Material,Conservation Landscape Health Categorical Excl...,2024-06-11T04:00:00Z,,
1,BLM-2023-0001-154333,Supporting & Related Material,Final Economic Analysis Conservation_Lanscape ...,2024-06-11T04:00:00Z,,
2,BLM-2023-0001-154335,Rule,Conservation and Landscape Health,2024-05-09T04:00:00Z,2024-05-09T04:00:00Z,
3,BLM-2023-0001-154332,Rule,"New Document created by Little, Chandra (BLM)",2024-05-09T04:00:00Z,2024-05-09T04:00:00Z,
4,BLM-2023-0001-154332,Rule,Conservation and Landscape Health -- Final Rule,2024-05-09T04:00:00Z,2024-05-09T04:00:00Z,
5,BLM-2023-0001-154331,Proposed Rule,Conservation and Landscape Health; Extension o...,2023-06-20T04:00:00Z,2023-06-20T04:00:00Z,2023-07-06T03:59:59Z
6,BLM-2023-0001-0001,Proposed Rule,Conservation and Landscape Health,2023-04-03T04:00:00Z,2023-04-03T04:00:00Z,2023-07-06T03:59:59Z
7,BLM-2023-0001-0001,Proposed Rule,Conservation and Landscape Health,2023-04-03T04:00:00Z,2023-04-03T04:00:00Z,2023-06-21T03:59:59Z
8,BLM-2023-0001-0001,Proposed Rule,Conservation and Landscape Health,2023-04-03T04:00:00Z,2023-04-03T04:00:00Z,2023-06-21T03:59:59Z
9,BLM-2023-0001-0001,Proposed Rule,Conservation and Landscape Health,2023-04-03T04:00:00Z,2023-04-03T04:00:00Z,2023-06-21T03:59:59Z


## 5. Comment-Based Importance Signals

Comments can indicate which aspects are most contentious.

In [6]:
# Sample comments to understand key concerns
comments = conn.execute(f"""
    SELECT title, LEFT(comment, 300) as excerpt
    FROM read_parquet('{R2_BASE_URL}/comments.parquet')
    WHERE docket_id = '{docket_id}'
      AND comment IS NOT NULL
      AND LENGTH(comment) > 200
    ORDER BY LENGTH(comment) DESC
    LIMIT 5
""").fetchdf()

print(f"Key concerns raised in comments:")
for _, row in comments.iterrows():
    print(f"\n--- {row['title']} ---")
    print(row['excerpt'][:250] + "...")

Key concerns raised in comments:

--- Comment on FR Doc # 2023-06310 ---
Comment Letter on BLMProposed Rule<br/>I am Jim Zornes, former Forest Supervisor, District Ranger and Regional Planning Director with the USFS.  Here are my comments:<br/><span style='padding-left: 30px'></span>1<span style='padding-left: 30px'></spa...

--- Comment on FR Doc # 2023-06310 ---
Please accept my comments and suggestions for &ldquo;additional regulatory text&rdquo; relative to the protection of ACECs and important resources for which they are designated.<br/><br/>Subpart 1610&mdash;Resource Management Planning<br/>Section 161...

--- Comment on FR Doc # 2023-06310 ---
Please accept my comments and suggestions for Section 6101.4 - Definitions. This is my first submittal, of many, due to the limitation of words.<br/><br/>Section 6101.4&mdash;Definitions<br/>&ldquo;The proposed rule would define the term &ldquo;casua...

--- Comment on FR Doc # 2023-06310 ---
Date: June 30, 2023<br/>To: The Bureau of Lan

## 6. Future: Document Summarization Pipeline

To truly help users navigate long documents, we'd need:
1. Document text extraction (PDF/HTML parsing)
2. Section identification
3. LLM-based summarization
4. Topic tagging

In [7]:
print("""
ðŸ“„ Document Navigation Pipeline (Future Work)
=============================================

1. Extract Document Text
   - Parse PDFs using PyMuPDF or pdfplumber
   - Extract HTML content
   - Handle embedded images/tables

2. Structure Analysis
   - Identify sections/headings
   - Parse table of contents
   - Map regulatory structure

3. AI Summarization
   - Generate section summaries
   - Extract key requirements
   - Identify stakeholder impacts

4. Topic Indexing
   - Tag sections with topics
   - Enable topic-based navigation
   - Cross-reference similar sections across rules

5. User Interface
   - Searchable document viewer
   - "Skip to relevant sections" feature
   - Comment density overlay
""")


ðŸ“„ Document Navigation Pipeline (Future Work)

1. Extract Document Text
   - Parse PDFs using PyMuPDF or pdfplumber
   - Extract HTML content
   - Handle embedded images/tables

2. Structure Analysis
   - Identify sections/headings
   - Parse table of contents
   - Map regulatory structure

3. AI Summarization
   - Generate section summaries
   - Extract key requirements
   - Identify stakeholder impacts

4. Topic Indexing
   - Tag sections with topics
   - Enable topic-based navigation
   - Cross-reference similar sections across rules

5. User Interface
   - Searchable document viewer
   - "Skip to relevant sections" feature
   - Comment density overlay

