# Trello
 Card Answers

**1.** Create a data example around search with agency, text, and CFR criteria  
**2.** How would we obtain every Federal Register docket JSON? What would it look like?

---
## 1. Search by Agency, Text, and CFR

**Example:** CMS + FDA, text "medicare", 42 CFR Part 412

In [14]:
import sqlite3
from pathlib import Path

DB_PATH = Path.cwd() / "documents.db"

def search_local(agency_ids=None, search_term=None, cfr_title=None, cfr_part=None, limit=25):
    """Search documents.db by agency, text, and CFR."""
    conn = sqlite3.connect(DB_PATH)
    conn.row_factory = sqlite3.Row
    conditions, params = ["1=1"], []
    if agency_ids:
        ids = agency_ids if isinstance(agency_ids, (list, tuple)) else [agency_ids]
        conditions.append(f"UPPER(agency_id) IN ({','.join('?'*len(ids))})")
        params.extend([a.upper() for a in ids])
    if search_term:
        t = f"%{search_term}%"
        conditions.append("(document_title LIKE ? OR doc_abstract LIKE ?)")
        params.extend([t, t])
    if cfr_title is not None:
        conditions.append("cfr_part LIKE ?")
        params.append(f"%{cfr_title} CFR%")
    if cfr_part is not None:
        conditions.append("cfr_part LIKE ?")
        params.append(f"%{cfr_part}%")
    params.append(limit)
    rows = conn.execute(
        f"SELECT document_id, docket_id, agency_id, document_title, document_type, cfr_part, posted_date FROM documents WHERE {' AND '.join(conditions)} LIMIT ?",
        params
    ).fetchall()
    conn.close()
    return [dict(r) for r in rows]

# Run example (requires documents.db: run insert_documents.py after aws s3 sync)
if not DB_PATH.exists():
    print("documents.db not found. Run: aws s3 sync ... then insert_documents.py")
else:
    results = search_local(agency_ids=["CMS", "FDA"], search_term="medicare", cfr_title=42, cfr_part=412, limit=10)
    print(f"Found {len(results)} documents (agency=CMS/FDA, text='medicare', 42 CFR 412):\n")
    for r in results[:5]:
        print(f"  {r['document_id']} | {r['agency_id']} | {r['document_type']} | {r.get('cfr_part','')[:50]}")

Found 10 documents (agency=CMS/FDA, text='medicare', 42 CFR 412):

  CMS-2008-0048-0002 | CMS | Rule | 42 CFR 412
  CMS-2008-0048-0001 | CMS | Rule | 42 CFR 412
  CMS-2011-0059-0051 | CMS | Rule | 42 CFR Part 412
  CMS-2011-0059-0050 | CMS | Rule | 42 CFR 412
  CMS-2011-0059-0052 | CMS | Rule | 42 CFR Part 412


**Prereqs:** `documents.db` from Mirrulations data (`aws s3 sync ...` then `insert_documents.py`) or from regulations.gov API via `fetch_agency_documents.py`.

---
## 2. How to Obtain Every Federal Register Docket JSON

**Use Mirrulations S3** — mirrors regulations.gov (dockets + documents). ~2.2M files, ~9.1 GB.

In [15]:
# Copy and run in terminal to download all docket + document JSON (~9 GB, ~40 min):
print("aws s3 sync s3://mirrulations/raw-data/ data --exclude '*' --include '*/text-*/docket/*.json' --include '*/text-*/documents/*.json' --only-show-errors")

aws s3 sync s3://mirrulations/raw-data/ data --exclude '*' --include '*/text-*/docket/*.json' --include '*/text-*/documents/*.json' --only-show-errors


**Directory layout:**
```
data/<agency>/<docket_id>/text-<docket_id>/
├── docket/<docket_id>(N).json     # docket metadata
└── documents/<document_id>.json  # documents (Rules, etc.)
```

**What it looks like — Docket JSON:**

In [16]:
import json
from pathlib import Path

# Sample: load first docket JSON found
data_dir = Path.cwd() / "data"
docket_files = list(data_dir.rglob("docket/*.json"))[:1]
if docket_files:
    d = json.loads(docket_files[0].read_text())
    print("Docket JSON (abbreviated):")
    print(json.dumps({"id": d["data"]["id"], "agencyId": d["data"]["attributes"].get("agencyId"), "title": d["data"]["attributes"].get("title")}, indent=2))
else:
    print("No docket JSON found. Run aws s3 sync above.")

Docket JSON (abbreviated):
{
  "id": "CDC-2006-0106",
  "agencyId": "CDC",
  "title": "Agency information collection activities; proposals, submissions, and approvals"
}


**What it looks like — Document JSON** (includes `frDocNum` for Federal Register linkage):

In [17]:
# Sample: load first document JSON with frDocNum
doc_files = list(data_dir.rglob("documents/*.json"))
for p in doc_files[:20]:
    d = json.loads(p.read_text())
    attrs = d.get("data", {}).get("attributes", {})
    if attrs.get("frDocNum"):
        print("Document JSON (Rule with Federal Register link):")
        print(json.dumps({"id": d["data"]["id"], "docketId": attrs.get("docketId"), "documentType": attrs.get("documentType"), "cfrPart": attrs.get("cfrPart"), "frDocNum": attrs.get("frDocNum"), "title": (attrs.get("title") or "")[:60]}, indent=2))
        break
else:
    print("No Rule docs with frDocNum in first 20. Run aws s3 sync.")

Document JSON (Rule with Federal Register link):
{
  "id": "ADF-2017-0005-0001",
  "docketId": null,
  "documentType": "Notice",
  "cfrPart": null,
  "frDocNum": "2017-21745",
  "title": "Meetings: Board of Directors"
}


**Alternative:** regulations.gov API — paginate `/dockets` and `/documents` per agency. Rate limited (~1000/hr); use for subsets, not full corpus.