# AI RFI call & comment reader

This notebook inventories the call/outcome documents you asked to download and
loads the normalized AI comment corpus that already lives in `data/app_data/ai_corpus.db`.
Use it as a starting point for further analysis or summarization work.

In [48]:
from __future__ import annotations

from pathlib import Path
from zipfile import ZipFile

import pandas as pd
from bs4 import BeautifulSoup
from pdfminer.high_level import extract_text
from tqdm.auto import tqdm

from reader_utils import (
    resolve_project_root,
    get_project_paths,
    connect_export_db,
    read_text_blob,
)

PROJECT_ROOT = resolve_project_root()
PATHS = get_project_paths(PROJECT_ROOT)
EXPORT_DB_PATH = PATHS["export_db_path"]

print(f"Project root: {PROJECT_ROOT}")
print(f"Export DB   : {EXPORT_DB_PATH}")


Project root: /Users/spangher/Projects/stanford-research/rfi-research/regulations-demo
Export DB   : /Users/spangher/Projects/stanford-research/rfi-research/regulations-demo/data/app_data/ai_corpus.db


## Call / response documents

Each row below records one of the call notices, reports, or combined response
packages you listed. Paths stay relative to the repository root so collaborators
on other machines can resolve them easily.

In [49]:
CALL_RESPONSE_DOCS = (
    pd.read_csv('ai_call_response_docs_list.csv', index_col=0)
    .assign(path=lambda df: df['path'].apply(Path))
    .to_dict(orient='records')
)

In [50]:
def build_call_response_index(records: list[dict]) -> pd.DataFrame:
    rows: list[dict] = []
    for record in records:
        rel_path = record["path"]
        abs_path = (PROJECT_ROOT / rel_path).resolve()
        exists = abs_path.exists()
        size_mb = abs_path.stat().st_size / 1_000_000 if exists else None
        rows.append({
            **{k: v for k, v in record.items() if k != "path"},
            "path": rel_path.as_posix(),
            "exists": exists,
            "size_mb": size_mb,
            "suffix": rel_path.suffix.lower(),
        })
    return pd.DataFrame(rows)

call_response_df = build_call_response_index(CALL_RESPONSE_DOCS)
call_response_df.sort_values(["jurisdiction", "collection_id", "document_role"]).reset_index(drop=True).head(3)

Unnamed: 0,jurisdiction,collection_id,document_role,title,url,path,exists,size_mb,suffix
0,California,PR-02-2023,call_overview,CPPA CCPA updates landing page,https://cppa.ca.gov/regulations/ccpa_updates.html,data/comments/cppa_admt/PR-02-2023/raw/ccpa_up...,True,0.050189,.html
1,California,PR-02-2023,fsor_and_uid,CPPA FSOR & UID (PDF),https://cppa.ca.gov/regulations/pdf/ccpa_updat...,data/comments/cppa_admt/PR-02-2023/raw/ccpa_up...,True,0.584846,.pdf
2,California,PR-02-2023,fsor_appendix_a,CPPA FSOR Appendix A (PDF),https://cppa.ca.gov/regulations/pdf/ccpa_updat...,data/comments/cppa_admt/PR-02-2023/raw/ccpa_up...,True,5.017119,.pdf


In [60]:
def preview_document(row: pd.Series, max_chars: int = 800) -> str:
    path = PROJECT_ROOT / Path(row["path"])
    if not path.exists():
        return "<missing file>"
    suffix = path.suffix.lower()
    if suffix in {".html", ".htm"}:
        html = path.read_text(encoding="utf-8", errors="ignore")
        soup = BeautifulSoup(html, "html.parser")
        text = " ".join(s[: max_chars * 2 if max_chars is not None else max_chars] for s in soup.stripped_strings)
        return text[:max_chars]
    if suffix == ".zip":
        with ZipFile(path) as zf:
            members = sorted(zf.namelist())[:10]
        return "ZIP contains: " + ", ".join(members)
    if suffix == ".pdf":
        size_mb = path.stat().st_size / 1_000_000
        if size_mb > 5:
            return f"<PDF {size_mb:.1f} MB – load pages manually as needed>"
        try:
            text = extract_text(str(path), maxpages=2)
            return text[:max_chars]
        except Exception as exc:  # pragma: no cover
            return f"<pdf parse error: {exc}>"
    return path.read_text(encoding="utf-8", errors="ignore")[:max_chars]

call_response_df["preview"] = call_response_df.apply(preview_document, max_chars=None, axis=1)

In [61]:
call_response_df[["jurisdiction", "collection_id", "document_role", "title", "preview"]]

Unnamed: 0,jurisdiction,collection_id,document_role,title,preview
0,US Federal,NTIA-2023-0005,call_overview,NTIA – AI Accountability policy report landing...,AI Accountability Policy Report | National Tel...
1,US Federal,NTIA-2023-0005,summary_report,NTIA – AI Accountability final report (PDF),NTIA \n\nArtificial Intelligence \nAccountabil...
2,US Federal,NTIA-2023-0009,summary_report,NTIA – Dual use / open model report (PDF),Dual-Use Foundation \n Models with Widely \n ...
3,US Federal,OMB-2023-0020,policy_guidance,OMB M-24-10 memo (PDF),EXECUTIVE OFFICE OF THE PRESIDENT \nO F F I ...
4,US Federal,OMB-2023-0020,supporting_material,OMB docket attachment 2 (PDF),"Guidance on Advancing Governance, Innovation..."
5,US Federal,AI-RMF-2ND-DRAFT-2022,call_overview,NIST AI Risk Management Framework hub,AI Risk Management Framework | NIST Skip to ma...
6,US Federal,AI-RMF-2ND-DRAFT-2022,response_summary,NIST AI RMF response summary (PDF),Summary Analysis of Responses to the NIST Arti...
7,US Federal,AI-RMF-2ND-DRAFT-2022,final_framework,NIST AI 100-1 (PDF),NIST AI 100-1\n\nArtificial Intelligence Risk ...
8,US Federal,90-FR-9088,call_notice,Federal Register notice – AI Action Plan RFI,Federal Register :: Request Access Request Acc...
9,US Federal,90-FR-9088,responses_pdf,NITRD combined responses bundle (PDF),<PDF 658.8 MB – load pages manually as needed>


In [62]:
import prompt_utils as p

In [63]:
IDENTIFICATION_PROMPT = """You are a helpful legal assistant. 
Please tell me if this text CONTAINS government responses to comments submitted. Do NOT identify texts 
that simply mention comment-gathering process. The text must specifically AND directly respond directly to user comments.
Answer with just "yes" or "no":

<text>
{input_text}
</text>

Your response:
"""

In [64]:
all_responses = await p.process_batch(
    call_response_df['preview'], 
    prompt_template=IDENTIFICATION_PROMPT, 
    model='gpt-5-mini', 
    concurrency=50
)

  0%|          | 0/34 [00:00<?, ?it/s]

In [157]:
! open ../data/comments/nitrd_ai_rfi/90-FR-9088/raw/90-fr-9088-combined-responses.pdf

In [160]:
call_response_df.sort_values('size_mb', ascending=False)

Unnamed: 0,jurisdiction,collection_id,document_role,title,url,path,exists,size_mb,suffix,preview,contains_response
9,US Federal,90-FR-9088,responses_pdf,NITRD combined responses bundle (PDF),https://files.nitrd.gov/90-fr-9088/90-fr-9088-...,data/comments/nitrd_ai_rfi/90-FR-9088/raw/90-f...,True,658.839315,.pdf,<PDF 658.8 MB – load pages manually as needed>,no
10,US Federal,90-FR-9088,responses_zip,NITRD combined responses bundle (ZIP),https://files.nitrd.gov/90-fr-9088/90-fr-9088-...,data/comments/nitrd_ai_rfi/90-FR-9088/raw/90-f...,True,616.016643,.zip,"ZIP contains: 90-fr-9088-combined-responses/, ...",no
15,California,PR-02-2023,fsor_appendix_a,CPPA FSOR Appendix A (PDF),https://cppa.ca.gov/regulations/pdf/ccpa_updat...,data/comments/cppa_admt/PR-02-2023/raw/ccpa_up...,True,5.017119,.pdf,<PDF 5.0 MB – load pages manually as needed>,no
1,US Federal,NTIA-2023-0005,summary_report,NTIA – AI Accountability final report (PDF),https://www.ntia.gov/sites/default/files/publi...,data/comments/regulations_gov/NTIA-2023-0005/r...,True,3.047281,.pdf,NTIA \n\nArtificial Intelligence \nAccountabil...,no
21,European Union,AI-ACT-2021-ADOPTION-FEEDBACK,impact_assessment_part_1,AI Act impact assessment – part 1 (PDF),https://artificialintelligenceact.eu/wp-conten...,data/comments/eu_have_your_say_playwright/AI-A...,True,2.614941,.pdf,EN \n\n EN \n\n EUROPEAN COMMISSION Brusse...,no
30,United Kingdom,government_consultations_ai-regulation-a-pro-i...,response_pdf_print,Government response (print-ready PDF),https://assets.publishing.service.gov.uk/media...,data/comments/gov_uk/government_consultations_...,True,2.402245,.pdf,A pro-innovation approach \nto AI regulation\n...,no
7,US Federal,AI-RMF-2ND-DRAFT-2022,final_framework,NIST AI 100-1 (PDF),https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.1...,data/comments/nist_airmf/AI-RMF-2ND-DRAFT-2022...,True,1.946127,.pdf,NIST AI 100-1\n\nArtificial Intelligence Risk ...,no
2,US Federal,NTIA-2023-0009,summary_report,NTIA – Dual use / open model report (PDF),https://www.ntia.gov/sites/default/files/publi...,data/comments/regulations_gov/NTIA-2023-0009/r...,True,1.822164,.pdf,Dual-Use Foundation \n Models with Widely \n ...,no
29,United Kingdom,government_consultations_ai-regulation-a-pro-i...,response_pdf_web,Government response (web-ready PDF),https://assets.publishing.service.gov.uk/media...,data/comments/gov_uk/government_consultations_...,True,1.745743,.pdf,A pro-innovation approach \nto AI regulation\n...,no
16,California,PR-02-2023,fsor_appendix_b,CPPA FSOR Appendix B (PDF),https://www.cppa.ca.gov/regulations/pdf/ccpa_u...,data/comments/cppa_admt/PR-02-2023/raw/ccpa_up...,True,1.590247,.pdf,FSOR APPENDIX B – SUMMARIES AND RESPONSES TO 1...,yes


In [69]:
call_response_df['contains_response'] = all_responses

In [80]:
PULL_CHUNKS_PROMPT = """You are a helpful legal assistant. I will show you a government document.
It contains responses to comments submitted by members of the public.
Please extract ALL the text that directly responds to submitted comments.
Do NOT identify text that simply mentions the comment-gathering process. 
The text must specifically AND directly respond directly to comments.

Respond in the following format:

[
{{'content_of_comment': <any mention the report makes to identify a specific comment>,
'response_to_comment': <any reponse made by the agency>,
}},
]

Include on list item per specific comment (or group of comments) if possible. 
Respond only with text extracted from the file, no other information. 

<file>
{input_text}
</file>

Your response:
"""

In [95]:
call_response_df.loc[lambda df: df['contains_response'] == 'yes']['preview'].str.split().str.len()

4       927
16      998
25    32223
28    44284
31     1422
32     7779
33     6092
Name: preview, dtype: int64

In [None]:
import json
import ast
import prompt_utils

In [147]:
comment_responses = call_response_df.loc[lambda df: df['contains_response'] == 'yes']

In [123]:
output_3 = await prompt_utils.process_batch(
    comment_responses['preview'], 
    prompt_template=PULL_CHUNKS_PROMPT,
    model='gpt-5',
    max_attempts=5
)

  0%|          | 0/7 [00:00<?, ?it/s]

In [131]:
full_output = []
for output_iter in [output, output_2, output_3]:
    for idx, o in enumerate(output_iter):
        try:
            p = json.loads(o)
        except:
            p = ast.literal_eval(o)
        if not isinstance(p, list):
            p = [p]
        full_output.append({
            'doc_id': idx, 'output': p
        })

In [144]:
full_output_df = pd.DataFrame(full_output)
full_output_df.to_json('ai_extracted_comment_cache.jsonl', lines=True, orient='records')

In [None]:
from more_itertools import flatten

agg_responses = (
    full_output_df
     .loc[lambda df: ~df.apply(lambda x: 'error' in x['output'][0], axis=1)]
     .groupby('doc_id')['output']
     .aggregate(list).apply(lambda x: list(flatten(x)))
     .to_frame('responses_to_comments')
)

## Comment corpus loader

The cells below pull every normalized document for the AI RFIs/RFCs already in
`ai_corpus.db`. Update `TARGET_COLLECTIONS` if you add more dockets later.

In [149]:
TARGET_COLLECTIONS = [
    {"source": "regulations_gov", "collection_id": "NTIA-2023-0005", "label": "NTIA – AI Accountability RFC"},
    {"source": "regulations_gov", "collection_id": "NTIA-2023-0009", "label": "NTIA – Dual Use/Open Model RFC"},
    {"source": "regulations_gov", "collection_id": "OMB-2023-0020", "label": "OMB – M-24-10 draft guidance"},
    {"source": "nist_airmf", "collection_id": "AI-RMF-2ND-DRAFT-2022", "label": "NIST – AI RMF second draft"},
    {"source": "nitrd_ai_rfi", "collection_id": "90-FR-9088", "label": "OSTP/NITRD – AI Action Plan RFI"},
    {"source": "cppa_admt", "collection_id": "PR-02-2023", "label": "CPPA – PR 02-2023"},
    {"source": "eu_have_your_say_playwright", "collection_id": "WHITEPAPER-AI-2020", "label": "EU – AI White Paper"},
    {"source": "eu_have_your_say_playwright", "collection_id": "AI-ACT-2021-ADOPTION-FEEDBACK", "label": "EU – AI Act feedback"},
    {"source": "gov_uk", "collection_id": "ai-white-paper-2023", "label": "UK – AI regulation white paper"},
    {"source": "gov_uk", "collection_id": "government_consultations_ai-regulation-a-pro-innovation-approach-policy-proposals", "label": "UK – AI regulation consultation"},
    {"source": "gov_uk", "collection_id": "government_consultations_copyright-and-artificial-intelligence", "label": "UK IPO – AI & IP consultation"},
]

collection_lookup = pd.DataFrame(TARGET_COLLECTIONS)
collection_ids = collection_lookup["collection_id"].tolist()
placeholders = ", ".join(["?"] * len(collection_ids))

with connect_export_db(EXPORT_DB_PATH) as conn:
    summary_sql = f"""
        SELECT source, collection_id, COUNT(*) AS documents,
               SUM(CASE WHEN text_path IS NOT NULL THEN 1 ELSE 0 END) AS docs_with_text,
               MIN(submitted_at) AS first_submitted_at,
               MAX(submitted_at) AS last_submitted_at
        FROM documents
        WHERE collection_id IN ({placeholders})
        GROUP BY source, collection_id
        ORDER BY source, collection_id
    """
    summary_df = pd.read_sql(summary_sql, conn, params=collection_ids)

summary_df.merge(collection_lookup, on=["source", "collection_id"]).loc[:, [
    "source", "collection_id", "label", "documents", "docs_with_text", "first_submitted_at", "last_submitted_at"
]]

Unnamed: 0,source,collection_id,label,documents,docs_with_text,first_submitted_at,last_submitted_at
0,cppa_admt,PR-02-2023,CPPA – PR 02-2023,592,592,"April 1, 2028","November 8, 2024"
1,nist_airmf,AI-RMF-2ND-DRAFT-2022,NIST – AI RMF second draft,85,85,,
2,nitrd_ai_rfi,90-FR-9088,OSTP/NITRD – AI Action Plan RFI,2,2,2025-02-06,2025-02-06
3,regulations_gov,NTIA-2023-0005,NTIA – AI Accountability RFC,1452,1452,2023-04-13T04:00:00Z,2023-06-27T04:00:00Z
4,regulations_gov,NTIA-2023-0009,NTIA – Dual Use/Open Model RFC,335,335,2024-02-26T05:00:00Z,2024-06-20T04:00:00Z
5,regulations_gov,OMB-2023-0020,OMB – M-24-10 draft guidance,197,197,2023-11-03T04:00:00Z,2023-12-07T05:00:00Z


In [150]:
with connect_export_db(EXPORT_DB_PATH) as conn:
    detail_sql = f"""
        SELECT doc_id, source, collection_id, title, submitter_name, submitted_at, text_path
        FROM documents
        WHERE collection_id IN ({placeholders})
        ORDER BY collection_id, submitted_at
    """
    comments_df = pd.read_sql(detail_sql, conn, params=collection_ids)

print(f"Loaded {len(comments_df):,} normalized documents")

Loaded 2,663 normalized documents


In [151]:
def hydrate_text_blobs(text_paths: pd.Series) -> list[str | None]:
    hydrated: list[str | None] = []
    for raw in tqdm(text_paths, desc="Loading text blobs"):
        if isinstance(raw, str) and raw.strip():
            hydrated.append(read_text_blob(PROJECT_ROOT, raw))
        else:
            hydrated.append(None)
    return hydrated

comments_df["text"] = hydrate_text_blobs(comments_df["text_path"])
comments_df.head()

Loading text blobs:   0%|          | 0/2663 [00:00<?, ?it/s]

Unnamed: 0,doc_id,source,collection_id,title,submitter_name,submitted_at,text_path,text
0,CAIDP-AI-RFI-2025.pdf,nitrd_ai_rfi,90-FR-9088,CAIDP-AI-RFI-2025.pdf,,,data/comments/blobs/a6/a61897bca3e8eb6ba6d4cb1...,Comments to the \n\nUnited States Office of Sc...
1,2025-02305,nitrd_ai_rfi,90-FR-9088,Request for Information on the Development of ...,OSTP & NITRD,2025-02-06,data/comments/blobs/9b/9b6e5baccf22f4e40bfa5eb...,"9088 \n\nFederal Register / Vol. 90, No. 2..."
2,AI_RMF_2nd_draft.pdf,nist_airmf,AI-RMF-2ND-DRAFT-2022,second draft of the AI Risk Management Framework,second draft of the AI Risk Management Framework,,data/comments/blobs/0b/0b134d057ab06d3a3a014e9...,AI Risk Management Framework: Second Draft \n\...
3,Accenture.pdf,nist_airmf,AI-RMF-2ND-DRAFT-2022,Accenture,Accenture,,data/comments/blobs/43/4370afc6e8a6099d565f382...,"August 29, 2022 \n\nDr. Laurie E. Locascio \..."
4,Adelin_20Travers.pdf,nist_airmf,AI-RMF-2ND-DRAFT-2022,Adelin Travers,Adelin Travers,,data/comments/blobs/35/350725d68762a101c3a74ae...,1/19 \n\nComments on the 2nd draft of th...


In [37]:
comment_summary = (
    comments_df
    .assign(has_text=comments_df["text"].notna())
    .groupby(["source", "collection_id"], as_index=False)
    .agg(documents=("doc_id", "count"), docs_with_text=("has_text", "sum"))
    .merge(collection_lookup, on=["source", "collection_id"])
)

comment_summary[["source", "collection_id", "label", "documents", "docs_with_text"]]

Unnamed: 0,source,collection_id,label,documents,docs_with_text
0,cppa_admt,PR-02-2023,CPPA – PR 02-2023,592,592
1,nist_airmf,AI-RMF-2ND-DRAFT-2022,NIST – AI RMF second draft,85,85
2,nitrd_ai_rfi,90-FR-9088,OSTP/NITRD – AI Action Plan RFI,2,2
3,regulations_gov,NTIA-2023-0005,NTIA – AI Accountability RFC,1452,1452
4,regulations_gov,NTIA-2023-0009,NTIA – Dual Use/Open Model RFC,335,335
5,regulations_gov,OMB-2023-0020,OMB – M-24-10 draft guidance,197,197


In [9]:
sample_comments = (
    comments_df
    .dropna(subset=["text"])
    .groupby("collection_id")
    .head(1)
    .loc[:, ["collection_id", "doc_id", "submitter_name", "title", "text"]]
    .assign(text_preview=lambda df: df["text"].str.slice(0, 800))
)

sample_comments[["collection_id", "doc_id", "submitter_name", "text_preview"]]


Unnamed: 0,collection_id,doc_id,submitter_name,text_preview
0,90-FR-9088,CAIDP-AI-RFI-2025.pdf,,Comments to the \n\nUnited States Office of Sc...
2,AI-RMF-2ND-DRAFT-2022,AI_RMF_2nd_draft.pdf,second draft of the AI Risk Management Framework,AI Risk Management Framework: Second Draft \n\...
87,NTIA-2023-0005,FR-2023-07776,NTIA,"Federal Register / Vol. 88, No. 71 / Thurs..."
1539,NTIA-2023-0009,FR-2024-03763,NTIA,"Federal Register / Vol. 89, No. 38 / Monda..."
1874,OMB-2023-0020,FR-2023-24269,OMB,"Federal Register / Vol. 88, No. 212 / Frid..."
2071,PR-02-2023,web_cert.pdf,,Website Accessibility Certification \n\nCalifo...


# Match comments to references to them

In [178]:
print(comments_df
 .loc[lambda df: df['collection_id'].isin(comment_responses['collection_id'])]
 ['text'].iloc[5]
)

Shalanda Young 
Director Office of Management and Budget  
72517th St NW, Washington, DC 20503  

November 4, 2023  

Re:  AI  Memo  Request  for  Comment:  Comments  of  Merve  Hickok  to  the  Office  of 
Management and Budget (OMB) regarding the Draft Memorandum published in Federal 
Register 88 FR 75625 

Dear Director Young,  

The below is in response to the request for public comments on the draft memorandum 
titled  Advancing  Governance,  Innovation,  and  Risk  Management  for  Agency  Use  of  Artificial 
Intelligence (AI)1. 

As the Founder of AIethicist.org, President  and Research Director at  Center for AI  and 
Digital  Policy  (CAIDP),  and  Data  Ethics  Lecturer  at  University  of  Michigan,  School  of 
Information.  I  welcome  the  opportunity  to  comment  on  The  Notice  seeks  comments  on  the 
Proposed Memorandum for the Heads of Executive Departments and Agencies  (hereinafter the 
“Guidance”).2 My work is focused on AI policy and governance globally. In m

In [179]:
comments_df

Unnamed: 0,doc_id,source,collection_id,title,submitter_name,submitted_at,text_path,text
0,CAIDP-AI-RFI-2025.pdf,nitrd_ai_rfi,90-FR-9088,CAIDP-AI-RFI-2025.pdf,,,data/comments/blobs/a6/a61897bca3e8eb6ba6d4cb1...,Comments to the \n\nUnited States Office of Sc...
1,2025-02305,nitrd_ai_rfi,90-FR-9088,Request for Information on the Development of ...,OSTP & NITRD,2025-02-06,data/comments/blobs/9b/9b6e5baccf22f4e40bfa5eb...,"9088 \n\nFederal Register / Vol. 90, No. 2..."
2,AI_RMF_2nd_draft.pdf,nist_airmf,AI-RMF-2ND-DRAFT-2022,second draft of the AI Risk Management Framework,second draft of the AI Risk Management Framework,,data/comments/blobs/0b/0b134d057ab06d3a3a014e9...,AI Risk Management Framework: Second Draft \n\...
3,Accenture.pdf,nist_airmf,AI-RMF-2ND-DRAFT-2022,Accenture,Accenture,,data/comments/blobs/43/4370afc6e8a6099d565f382...,"August 29, 2022 \n\nDr. Laurie E. Locascio \..."
4,Adelin_20Travers.pdf,nist_airmf,AI-RMF-2ND-DRAFT-2022,Adelin Travers,Adelin Travers,,data/comments/blobs/35/350725d68762a101c3a74ae...,1/19 \n\nComments on the 2nd draft of th...
...,...,...,...,...,...,...,...,...
2658,part10_all_comments_combined_redacted_oral_not...,cppa_admt,PR-02-2023,SIRM@SIRM,SIRM@SIRM,"November 22, 2024",/Users/spangher/Projects/stanford-research/rfi...,"SIRM@SIRM\nSuse Califoria\nFebruary 19,2025,\n..."
2659,ccpa_updates_all_written_comments_p2.pdf#L006,cppa_admt,PR-02-2023,APPENDIX,APPENDIX,"November 22, 2024",/Users/spangher/Projects/stanford-research/rfi...,APPENDIX \n\n20\n\nMODIFIED TEXT OF PROPOSED R...
2660,part6_all_comments_combined_redacted_oral_not_...,cppa_admt,PR-02-2023,COVINGTON Covington & Burling LLP,COVINGTON Covington & Burling LLP,"November 8, 2024",/Users/spangher/Projects/stanford-research/rfi...,COVINGTON Covington & Burling LLP\nScoseane\na...
2661,part5_all_comments_combined_redacted_oral_not_...,cppa_admt,PR-02-2023,Black Business Association,Black Business Association,"November 8, 2024",/Users/spangher/Projects/stanford-research/rfi...,Black Business Association\nyp pA)\nee\nsonar ...


In [180]:
comment_responses_parsed = pd.concat([
    comment_responses[['jurisdiction', 'collection_id', 'document_role', 'title']].reset_index(drop=True),
    agg_responses
], axis=1)

In [190]:
MATCHING_PROMPT = """You are an expert legal assistant. I just finished conducting an RFI, and collected a lot of 
comments from members of the public. My department wrote a long response to these comments, and I am now
trying to go back and determine which comments these responses were specifically responding to. 

I will show you all the notes they made and then I will show you a comment. If one of the responses corresponds to the 
specific comment, please answer with the index of the note that refers to it. If none of the responses corresponds to it, 
please respond with -1.

<notes>
{notes}
</notes>

<comment>
{comment}
</comment>

Your response:
"""

In [201]:
comment_responses_parsed_w_comments = (
    comment_responses_parsed
        .merge(comments_df[['collection_id', 'doc_id', 'text']], on='collection_id')
)

In [233]:
SUMM_PROMPT = """Can you summarize the main points of this comment in 2-3 sentences?

{input_text}

Your response:
"""

In [254]:
summaries = await prompt_utils.process_batch(prompt_template=SUMM_PROMPT, texts=comment_responses_parsed_w_comments['text'])

  0%|          | 0/789 [00:00<?, ?it/s]

In [255]:
comment_responses_parsed_w_comments['comment_summary'] = summaries

In [256]:
prompts = (
    comment_responses_parsed_w_comments
         .apply(lambda x: MATCHING_PROMPT.format(
             notes='\n'.join(list(map(lambda y: f"{y[0]}: {y[1]['content_of_comment']}", enumerate(x['responses_to_comments'])))),
             comment=x['comment_summary']
         ), axis=1)
    .tolist()
)

In [257]:
indices_2 = await prompt_utils.process_batch(prompts=prompts, concurrency=50)

  0%|          | 0/789 [00:00<?, ?it/s]

In [263]:
(pd.Series(indices) == pd.Series(indices_2)).value_counts()

True     527
False    262
Name: count, dtype: int64

In [265]:
pd.Series(indices_2).value_counts()

-1    510
3      96
0      68
8      61
2      18
9      15
1      14
4       5
6       1
7       1
Name: count, dtype: int64

In [266]:
comment_responses_parsed_w_comments['response_index'] = indices
comment_responses_parsed_w_comments['response_index_2'] = indices_2

In [282]:
matched_comment_response_df = (
    comment_responses_parsed_w_comments
         # .assign(response_index=lambda df: df['response_index_2'])
         .loc[lambda df: df['response_index'].str.isdigit()]
         .assign(response_index=lambda df: df['response_index'].astype(int))
         .assign(responses_to_comments=lambda df: df.apply(lambda x: x['responses_to_comments'][x['response_index']], axis=1))
         .reset_index(drop=True)
         .pipe(lambda df: pd.concat([
             df[['collection_id', 'doc_id']], 
             pd.DataFrame(df['responses_to_comments'].tolist()),
             df[['text', 'comment_summary']]
         ], axis=1))
)

In [345]:
matched_comment_response_df[['content_of_comment', 'response_to_comment', 'comment_summary']].head(1).to_dict(orient='records')

[{'content_of_comment': 'A number of commenters characterized the definition of NSS as being too broad and expressed concerns that an agency could exempt systems that serve a de minimis national security purpose from implementing the guidance. For example, commenters offered that a potential misuse of the rule would shield AI deployed for rights- impacting or safety-impacting use cases, such as those in law enforcement and immigration.',
  'response_to_comment': 'OMB takes seriously these comments and acknowledges the public’s concern on this topic. OMB strongly believes that issues of AI governance, innovation, and risk for NSS must be managed appropriately. But given the legal authorities under which this guidance is being issued, it would be inappropriate to include NSS within its scope.  In particular, Section 10.1(i) of Executive Order 14110 directs that, “[t]he initial means, instructions, and guidance issued pursuant to subsections 10.1(a)-(h) of this section”—which include this

In [283]:
matched_comment_response_df[[
    'doc_id',
    'content_of_comment', #'response_to_comment', 
    'comment_summary'
]].iloc[5].to_dict()

{'doc_id': 'OMB-2023-0020-0192',
 'content_of_comment': 'A number of commenters characterized the definition of NSS as being too broad and expressed concerns that an agency could exempt systems that serve a de minimis national security purpose from implementing the guidance. For example, commenters offered that a potential misuse of the rule would shield AI deployed for rights- impacting or safety-impacting use cases, such as those in law enforcement and immigration.',
 'comment_summary': 'The comment endorses the Administration’s AI Executive Order and draft OMB guidance but urges that implementation meaningfully center civil rights and equity by building on the AI Bill of Rights—requiring AI systems to be proven safe, effective, and nondiscriminatory before use and adding requirements for notice, explanation, and data privacy. It also calls for adequately resourced agency governance (including empowered CAIOs and labor and civil-rights input), clearer and public rules for waivers and

In [301]:
full_matched_comment_df = pd.concat([
    (comment_responses_parsed_w_comments[['collection_id', 'doc_id', 'text', 'comment_summary']]
         .loc[lambda df: ~df['doc_id'].isin(matched_comment_response_df['doc_id'])]
         .loc[lambda df: df['text'].str.split().str.len() > 50]
         .assign(c=0)
        )
    ,
    matched_comment_response_df.assign(c=1).loc[lambda df: df['text'].str.split().str.len() > 50]
])

In [304]:
full_matched_comment_df['collection_id'].unique()

array(['OMB-2023-0020', 'PR-02-2023'], dtype=object)

In [306]:
comment_responses_parsed_w_comments['collection_id'].drop_duplicates()

0      OMB-2023-0020
197       PR-02-2023
Name: collection_id, dtype: object

In [327]:
TASK_PROMPT = """You are a helpful legal assistant. 
Summarize the call here in 1-2 sentences. Please express your summary to be read by a member of the public
and tell the public exactly what kind of feedback is being requested. (Please request feedback.)

<call>
{input_text}
</call>

Your response:
"""

In [331]:
calls = (
    call_response_df
     .loc[lambda df: df['document_role'].isin(['call_overview', 'policy_guidance'])]
     .loc[lambda df: df['collection_id'].isin(['PR-02-2023', 'OMB-2023-0020'])]
)
tasks = await prompt_utils.process_batch(
    texts=calls['preview'].tolist(),
    prompt_template=TASK_PROMPT,
    model='gpt-5'
)

In [334]:
calls[['collection_id']].assign(call=tasks).reset_index(drop=True)

Unnamed: 0,collection_id,call
0,OMB-2023-0020,OMB has issued guidance directing federal agen...
1,PR-02-2023,The California Privacy Protection Agency has f...


In [337]:
(full_matched_comment_df
 .merge(calls[['collection_id']].assign(call=tasks).reset_index(drop=True))
.to_csv('full_matched_comment_df__omb-pr.csv')
)

In [338]:
full_matched_comment_df

Unnamed: 0,collection_id,doc_id,text,comment_summary,c,content_of_comment,response_to_comment
0,OMB-2023-0020,FR-2023-24269,"Federal Register / Vol. 88, No. 212 / Frid...",The OMB is seeking public comment on a draft m...,0,,
6,OMB-2023-0020,OMB-2023-0020-0041,"Wilson Hall \nRoom 220 \nCharlottesville, VA 2...",The comment urges the OMB to avoid overburdeni...,0,,
7,OMB-2023-0020,OMB-2023-0020-0046,Ms. Clare Martorana \nFederal Chief Informatio...,The Professional Services Council (PSC) welcom...,0,,
8,OMB-2023-0020,OMB-2023-0020-0043,Sam Daniel J Timothy \nAI Research Scientist \...,The comment urges agencies to adapt roles and ...,0,,
9,OMB-2023-0020,OMB-2023-0020-0058,"750 9th Street, NW \nWashington, DC 20001 \nww...",The Blue Cross Blue Shield Association support...,0,,
...,...,...,...,...,...,...,...
319,PR-02-2023,part10_all_comments_combined_redacted_oral_not...,"SIRM@SIRM\nSuse Califoria\nFebruary 19,2025,\n...",SHRM and SHRM California urge the CPPA to adop...,1,Comment expresses concern that the revised def...,The Agency disagrees with this comment. The de...
320,PR-02-2023,ccpa_updates_all_written_comments_p2.pdf#L006,APPENDIX \n\n20\n\nMODIFIED TEXT OF PROPOSED R...,This appendix shows the modified proposed Cali...,1,Comment raises concerns that the removal of AI...,The Agency disagrees with this comment. The re...
321,PR-02-2023,part6_all_comments_combined_redacted_oral_not_...,COVINGTON Covington & Burling LLP\nScoseane\na...,CalChamber contends the CPPA’s draft ADMT regu...,1,Comment expresses concern that the revised def...,The Agency disagrees with this comment. The de...
322,PR-02-2023,part5_all_comments_combined_redacted_oral_not_...,Black Business Association\nyp pA)\nee\nsonar ...,The Black Business Association told the CPPA t...,1,Comment expresses concern that the revised def...,The Agency disagrees with this comment. The de...


# EPA Matched Comments

In [21]:
epa_matched_df = pd.read_csv('full_matched_comment_df__epa.csv', index_col=0)

In [62]:
epa_matched_df['comment_in_response'].dropna().iloc[10]

"One commenter (IV-D-14) believes the particulate matter emission limits proposed for lime manufacturing kilns and coolers do not represent the maximum achievable control technology and are much less stringent than the limits actually required by the Clean Air Act. The commenter notes that the proposed rule discredits performance test data which demonstrate that particulate emissions of less than half the proposed standard for existing plants are routinely achieved by claiming they may not be consistently achievable, but EPA has provided no statistics. The commenter claims that EPA has chosen instead to base the standards on permit limits, but has selectively eliminated from consideration those permits calling for stringent controls which are currently in place. The commenter gives the examples of Continental Lime which is in compliance with a BACT limit for PM emissions of 0.05 lb/ton limestone, and Western Lime which is in compliance with a permit limit for PM emissions of 0.06 lb/to

In [63]:
epa_matched_df.iloc[10].to_dict()

{'index': 'EPA-HQ-OAR-2002-0052',
 'comment_in_response': "One commenter (IV-D-14) believes the particulate matter emission limits proposed for lime manufacturing kilns and coolers do not represent the maximum achievable control technology and are much less stringent than the limits actually required by the Clean Air Act. The commenter notes that the proposed rule discredits performance test data which demonstrate that particulate emissions of less than half the proposed standard for existing plants are routinely achieved by claiming they may not be consistently achievable, but EPA has provided no statistics. The commenter claims that EPA has chosen instead to base the standards on permit limits, but has selectively eliminated from consideration those permits calling for stringent controls which are currently in place. The commenter gives the examples of Continental Lime which is in compliance with a BACT limit for PM emissions of 0.05 lb/ton limestone, and Western Lime which is in com

# Scratch

In [50]:
import pandas as pd 
import glob
nist_airmf_docs_df = pd.read_json('../data/comments/nist_airmf/AI-RMF-2ND-DRAFT-2022/AI-RMF-2ND-DRAFT-2022.meta.jsonl', lines=True)
uk_ai_proposal = '../data/comments/gov_uk/government_consultations_ai-regulation-a-pro-innovation-approach-policy-proposals/government_consultations_ai-regulation-a-pro-innovation-approach-policy-proposals.meta.jsonl'

In [52]:
uk_ai_proposal_df = pd.read_json(uk_ai_proposal, lines=True)

In [16]:
nist_airmf_docs_df.iloc[-1].to_dict()#.iloc[-84]

{'source': 'nist_airmf',
 'collection_id': 'AI-RMF-2ND-DRAFT-2022',
 'doc_id': 'nist_airmf_call.html',
 'title': 'NIST AI RMF Second Draft Call for Comments',
 'submitter': 'NIST',
 'submitter_type': 'agency',
 'org': 'National Institute of Standards and Technology',
 'submitted_at': NaT,
 'language': 'en',
 'urls': {'html': 'https://www.nist.gov/itl/ai-risk-management-framework/comments-2nd-draft-ai-risk-management-framework'},
 'extra': {'source_url': 'https://www.nist.gov/itl/ai-risk-management-framework/comments-2nd-draft-ai-risk-management-framework',
  'document_role': 'call'},
 'kind': 'call'}