
# Regulations.gov Scrape Explorer

Inspect the latest Regulations.gov harvest stored in `data/app_data/ai_corpus.db`. The notebook summarizes per-docket totals and lets you read randomly sampled comments or inspect specific entries.


In [49]:
from pathlib import Path
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display

plt.style.use('seaborn-v0_8')
sns.set_context('talk')

In [52]:
def detect_repo_root():
    here = Path.cwd()
    for candidate in [here, *here.parents]:
        if (candidate / 'data').exists() and (candidate / 'ai_corpus').exists():
            return candidate
    return here

REPO_ROOT = detect_repo_root()
DB_PATH = REPO_ROOT / 'data/app_data/ai_corpus.db'
BLOB_DIR = REPO_ROOT / 'data/comments/blobs'
assert DB_PATH.exists(), f"Database not found: {DB_PATH}"
connection = sqlite3.connect(DB_PATH)

In [106]:

query = """
SELECT
    collection_id,
    doc_id,
    title,
    submitter_name,
    submitter_type,
    submitted_at,
    text_path,
    pdf_path,
    COALESCE(json_extract(raw_meta, '$.document_kind'), 'response') AS document_kind
FROM documents
WHERE source = 'regulations_gov'
"""
reg_docs = pd.read_sql_query(query, connection)
reg_docs['document_kind'] = reg_docs['document_kind'].fillna('response')
reg_docs.head()


Unnamed: 0,collection_id,doc_id,title,submitter_name,submitter_type,submitted_at,text_path,pdf_path,document_kind
0,NTIA-2023-0005,NTIA-2023-0005-1453,Questions,,,2023-06-27T04:00:00Z,data/comments/blobs/a6/a63723c0a901bc77b465320...,data/NTIA-2023-0005/raw/NTIA-2023-0005-1453_at...,response
1,NTIA-2023-0005,NTIA-2023-0005-1451,NTIA COMMENTARY AI ACCOUNTABILITY ACEMOGLU JO...,,,2023-06-16T04:00:00Z,data/comments/blobs/1b/1b4888cec04e828f3e065d6...,data/NTIA-2023-0005/raw/NTIA-2023-0005-1451_at...,response
2,NTIA-2023-0005,NTIA-2023-0005-1139,Comment on FR Doc # 2023-07776,,,2023-06-15T04:00:00Z,data/comments/blobs/3a/3a26fd25977a528e94602c7...,data/NTIA-2023-0005/raw/NTIA-2023-0005-1139_at...,response
3,NTIA-2023-0005,NTIA-2023-0005-1092,Comment on FR Doc # 2023-07776,,,2023-06-15T04:00:00Z,data/comments/blobs/d9/d97b79a18462f41ae99d8f9...,data/NTIA-2023-0005/raw/NTIA-2023-0005-1092_at...,response
4,NTIA-2023-0005,NTIA-2023-0005-1121,Comment on FR Doc # 2023-07776,,,2023-06-15T04:00:00Z,data/comments/blobs/03/03b9515ba3dced6bf388862...,data/NTIA-2023-0005/raw/NTIA-2023-0005-1121_at...,response


In [107]:
reg_docs.shape

(11585, 9)

In [108]:
reg_docs['collection_id'].str.split('-').str.get(0).value_counts()

collection_id
AMS      4258
IRS      2202
EPA      2102
NTIA     1787
NOAA      408
OMB       358
FDA       140
NPS       104
SSA        58
APHIS      29
OSHA       27
FMCSA      22
FAR        22
NRC        18
FWS        15
FAA        14
ED          7
GSA         3
CFPB        2
OCC         2
FRA         2
CMS         1
OPM         1
VA          1
DEA         1
HUD         1
Name: count, dtype: int64

In [109]:
reg_docs['document_kind'].value_counts()

document_kind
response    11548
call           37
Name: count, dtype: int64


## Counts per docket

Government-issued updates are stored as `document_kind = 'call'` while public submissions remain `response`. The next cells aggregate both counts per docket.


In [100]:
counts = (
    reg_docs
    .groupby(['collection_id', 'document_kind'])
    .size()
    .unstack(fill_value=0)
    .rename(columns={
        'call': 'government_updates', 
        'response': 'comments'
    })
    .reset_index()
)
counts['total_docs'] = counts['government_updates'] + counts['comments']
counts.sort_values('total_docs', ascending=False).head(10)

document_kind,collection_id,government_updates,comments,total_docs
0,AMS-NOP-24-0023,0,4258,4258
82,IRS-2022-0023,0,2202,2202
88,NTIA-2023-0005,1,1451,1452
18,EPA-HQ-OW-2017-0300,0,746,746
11,EPA-HQ-OPP-2003-0132,0,590,590
84,NOAA-NMFS-2022-0127,0,405,405
9,EPA-HQ-OLEM-2022-0174,0,383,383
89,NTIA-2023-0009,1,334,335
91,OMB-2023-0020,1,196,197
92,OMB-2023-0021,0,161,161


In [105]:
(reg_docs
 .loc[lambda df: df['collection_id']=='AMS-NOP-24-0023']
 .loc[lambda df: df['doc_id'] == 'AMS-NOP-24-0023-3011']
)

Unnamed: 0,collection_id,doc_id,title,submitter_name,submitter_type,submitted_at,text_path,pdf_path,document_kind
7947,AMS-NOP-24-0023,AMS-NOP-24-0023-3011,HS: 2026 Sunsets (Misc) - Bobbie,,,2024-10-01T04:00:00Z,/Users/spangher/Projects/stanford-research/rfi...,/Users/spangher/Projects/stanford-research/rfi...,response



## Explore a specific docket

Set `DOCKET_ID` to any collection in the summary to list government updates and sample comments.


In [62]:

DOCKET_ID = 'OMB-2023-0020'  # change as needed

subset = reg_docs[reg_docs['collection_id'] == DOCKET_ID].copy()
if subset.empty:
    raise ValueError(f"No rows for docket {DOCKET_ID}")
subset_counts = (
    subset['document_kind']
    .value_counts()
    .rename_axis('document_kind')
    .reset_index(name='count')
)
subset_counts


Unnamed: 0,document_kind,count
0,response,196
1,call,1


In [64]:
gov_updates = subset.loc[lambda df: df['document_kind'] == 'call'][['doc_id', 'title', 'submitted_at']]
gov_updates.sort_values('submitted_at').head(10)

Unnamed: 0,doc_id,title,submitted_at
1983,FR-2023-24269,"Request for Comments on Advancing Governance, ...",2023-11-03T04:00:00Z


In [65]:

comments = (
    subset
        .loc[lambda df: df['document_kind'] == 'response']
        [['doc_id', 'submitter_name', 'submitter_type', 'submitted_at', 'text_path']]
)
comments.head(10)


Unnamed: 0,doc_id,submitter_name,submitter_type,submitted_at,text_path
1784,OMB-2023-0020-0197,,,2023-12-07T05:00:00Z,data/comments/blobs/ab/ab8abd75f71dc4bc839505d...
1785,OMB-2023-0020-0196,,,2023-12-07T05:00:00Z,data/comments/blobs/2b/2b26335bd0a4e0adf9aff28...
1786,OMB-2023-0020-0055,,,2023-12-06T05:00:00Z,data/comments/blobs/ef/efd3335f5672a97c8b8e980...
1787,OMB-2023-0020-0041,,,2023-12-06T05:00:00Z,data/comments/blobs/e5/e519b1e28ec9b9c004b11f8...
1788,OMB-2023-0020-0046,,,2023-12-06T05:00:00Z,data/comments/blobs/8c/8cd010d50f0a3ddfd531f96...
1789,OMB-2023-0020-0043,,,2023-12-06T05:00:00Z,data/comments/blobs/9b/9b36c42bf408fe66a24575c...
1790,OMB-2023-0020-0058,,,2023-12-06T05:00:00Z,data/comments/blobs/5c/5c9af63dffc4ad104ffe7c3...
1791,OMB-2023-0020-0039,,,2023-12-06T05:00:00Z,data/comments/blobs/f5/f57a02c717ada2ce9227148...
1792,OMB-2023-0020-0053,,,2023-12-06T05:00:00Z,data/comments/blobs/88/88310054db5ed3fdbc60fca...
1793,OMB-2023-0020-0049,,,2023-12-06T05:00:00Z,data/comments/blobs/68/680b57d2fdc8a9be2bda660...


In [47]:
def resolve_path(relative_path: str) -> Path:
    file_path = Path(relative_path)
    if file_path.is_absolute():
        return file_path
    return REPO_ROOT / relative_path

def read_text(sample_row, max_chars=1200):
    path = sample_row.get('text_path')
    if not path:
        return '<no text path recorded>'
    file_path = resolve_path(path)
    if not file_path.exists():
        return f'<missing text file: {file_path}>'
    text = file_path.read_text(encoding='utf-8', errors='replace')
    return text[:max_chars] + ('\n…' if len(text) > max_chars else '')

sample_size = min(5, len(comments))
sample = comments.sample(sample_size, random_state=42) if sample_size else pd.DataFrame()

for _, row in sample.iterrows():
    header = f"**{row['doc_id']} - {row['submitter_name'] or 'Unknown submitter'} ({row['submitted_at'] or 'no date'})**"
    body = read_text(row)
    display(Markdown(header))
    display(Markdown(body.replace('\n', '  \n')))

**NTIA-2023-0009-0245 - Unknown submitter (2024-04-03T04:00:00Z)**

Cohere Comments on the National Telecommunications & Information Administration’s Request for   
Comments on Dual Use Foundation Artificial Intelligence Models with Widely Available Model   
Weights    
  
Docket Number NTIA 240216-0052   
March 27, 2024   
  
Cohere appreciates the opportunity to submit comments in response to National Telecommunications &   
Information Administration (NTIA)’s Request for Comments (RFC) on Dual Use Foundation Artificial   
Intelligence Models with Widely Available Model Weights (open foundation models).    
  
Introduction   
  
Cohere is one of the leading enterprise-focused foundation model developers worldwide. Cohere   
empowers every developer and enterprise to build amazing products and capture true business value   
with language AI.    
  
Cohere is the only foundation model developer to have signed each of the White House’s Updated   
Voluntary Commitments1 and the Canadian Federal Government’s Voluntary Code of Conduct on the   
Responsible Development and Management of Advanced Generative AI Systems2 and to also endorse the   
G7’s Hiroshima Process International Code of Conduct for Organizations Developing Advanced AI   
Systems.    
  
Cohere For AI (C4AI) is Cohere's non-profit  
…

**NTIA-2023-0009-0276 - Unknown submitter (2024-04-03T04:00:00Z)**

Concerning the dangers of these models being open source, consider that the dangers must be compared to the baseline dangers of a bad actor simply having access to the internet and a search engine (ex. Google).  
There should be more concern with the data used to train these models, rather than just the model weights or parameters.  
The competitive moat of these large companies capable of building these models comes from the data they have access to and the amount of compute time necessary to train.  
In a recent Wall Street Journal interview with OpenAI’s CTO regarding their Sora text-to-video model, she admitted to using publicly available data and licensed data.  
Which companies are selling or allowing use of their data to OpenAI?  
OpenAI and other companies should be more transparent with their use of data from sources such as Facebook, Instagram, and YouTube.  
There could be data leakage and hallucinations as video output that would violate or encroach on United States citizens’ privacy.  
There are many more things to consider, but the more open source competitor models there are, the more accountable OpenAI and other large companies will have to be, with respect to lawmakers and citiz  
…

**NTIA-2023-0009-0196 - Unknown submitter (2024-04-03T04:00:00Z)**

Open weights prevent ridiculous monopolies that one or a few of the most technically advanced companies could otherwise contrive to instill on a public that is unable to compete. Regulatory capture is not good for the market, or democracy.

**NTIA-2023-0009-0167 - Unknown submitter (2024-04-03T04:00:00Z)**

I'm a former Google Staff Software engineer, from 2008-2018.  
As an opening position statement before answering some of your specific questions, I am in favor of widely available open model weights and am against restrictions on their distributions. The supposed harms are entirely theoretical. I believe there is no way to get broad international cooperation on restrictions.  
As it appears that the civil liberties groups are filing detailed arguments about the free speech and civil rights problems with restricting weights (for example, the Center For Democracy and Technology's open letter to Secretary of Commerce Gina Raimondo), I will instead focus on some of your technical questions instead of Another Comment That's A Blow By Blow Of Junger vs Daley And How Any Restriction On Code Amounts To Prior Restraint And Is Obviously Unconstitutional, and only note that I agree with their conclusions.  
-  
In Question 6-d, you ask if there are concerns around the proliferation of incompatible open licenses. Up until 3 days before the comment deadline, I would have replied that this wasn't much of an issue: license proliferation is annoying to keep track of, but as a model couldn't be combined wi  
…

**NTIA-2023-0009-0177 - Unknown submitter (2024-04-03T04:00:00Z)**

I am extremely concerned about the possibility of restrictions on open model weights. If there is an arbitrary cap beyond which only the largest corporate entities are allowed to progress, those corporate entities will be able to leverage that inherent advantage to exponentially outpace any possible competition. It would be de-facto cementing Google, Facebook, Apple, and Microsoft as the only viable solution when it comes to AI products.  
As a private citizen, I feel as though my right to be secure in my person and papers is under threat and that we are perilously close to trying to outlaw math.


### Inspect a specific document

Set `DOC_ID` to drill into a single entry.


In [48]:

DOC_ID = 'EPA-HQ-OW-2017-0300-0895'
if DOC_ID:
    doc_row = subset[subset['doc_id'] == DOC_ID]
    if doc_row.empty:
        raise ValueError(f"Doc {DOC_ID} not found in {DOCKET_ID}")
    display(doc_row)
    row = doc_row.iloc[0]
    if row['document_kind'] == 'response':
        print(read_text(row))
    else:
        print('<government document stored as PDF>')
else:
    print('Set DOC_ID to inspect a specific record.')


ValueError: Doc EPA-HQ-OW-2017-0300-0895 not found in NTIA-2023-0009