### 🐶 Load & Inspect Agency Mapping for D.O.G.E.

> The agency-to-regulation mapping is retrieved from the [agencies.json](https://www.ecfr.gov/developers/documentation/api/v1) endpoint of the Electronic Code of Federal Regulations (eCFR).

> While no explicit timestamp is provided in the API response, the data structure appears to reflect 2024 mappings, according to [DOGE's regulations sources](https://doge.gov/regulations).

> This mapping includes all top-level agencies (sorted by name), along with their respective child agencies, and serves as the foundation for linking CFR titles, chapters, and parts to their governing authorities.


### 🔎 Inspect API Output via CLI

> For a quick command-line inspection of the eCFR `agencies.json` response:  

> ```bash
> curl -X GET "https://www.ecfr.gov/api/admin/v1/agencies.json" -H "accept: application/json" | jq .
> ```  

> This fetches the agency mapping JSON directly and pretty-prints it using `jq`.  
> Grep for keywords, detect patterns, or understand the structure before processing it in Python.


```json
{
  "agencies": [
    {
      "name": "Department of Agriculture",
      "slug": "agriculture-department",
      "children": [
        {
          "name": "Agricultural Marketing Service",
          "slug": "agricultural-marketing-service",
          "cfr_references": [
            {
              "title": 7,
              "chapter": "I"
            },
            ...  // child cfr_references
          ]     
        },      
        ...     // siblings
      ],
      "cfr_references": [
        {
          "title": 2,
          "chapter": "IV"
        },
        {
          "title": 5,
          "chapter": "LXXIII"
        },
        ...   // parent cfr_references
      ]
    },
    ...   // agencies
  ]
}
```

#### API Structure Overview

> The root key is `agencies`, a list of agency dictionaries. Each agency has:
> - Metadata (`name`, `slug`, etc.)  
> - `cfr_references` → used to extract regulation text  
> - Optional `children` → same structure, no nested children

> Nested hierarchy to process:  
> `agencies` → `children` (if any) → `cfr_references`  

> 🔥🔥 Flattening `agencies` + `children` + their `cfr_references` builds the dataset for downstream analysis.


In [1]:
import sys
from datetime import datetime
from doge_data_challenge.helpers import init_notebook

paths = init_notebook()

# Access paths
archive_path = paths["ARCHIVE_PATH"]
xml_snapshot_path = paths["XML_SNAPSHOT_PATH"]
snapshot_date = paths["SNAPSHOT_DATE"]

# Format today's date
today_str = datetime.today().strftime("%Y-%m-%d")
print(today_str)

#from helpers.env_paths import load_paths
from doge_data_challenge.helpers.print_helpers import shorten_path

print(shorten_path(archive_path))
print(shorten_path(xml_snapshot_path))
print(shorten_path(snapshot_date))

for p in sys.path:
    print(shorten_path(p))


2025-05-01
~/repo/doge-data-challenge/archive
~/repo/doge-data-challenge/data/regulations_xml/2025-04-17
2025-04-17
~/repo/doge_data_challenge
~/repo
~/anaconda3/lib/python310.zip
~/anaconda3/lib/python3.10
~/anaconda3/lib/python3.10/lib-dynload

~/Library/Caches/pypoetry/virtualenvs/doge-data-challenge-t_Z9FBnC-py3.10/lib/python3.10/site-packages
~/repo/doge-data-challenge


In [13]:
from dotenv import load_dotenv
import requests
import json
import os


# API endpoint for agency metadata  
url = "https://www.ecfr.gov/api/admin/v1/agencies.json"
response = requests.get(url)   # HTTP 200 = success; 404 = invalid input
#print(response)
data = response.json()
# Pretty print the JSON
#print(json.dumps(data, indent=2))

# add error handling
#try:
#response.ok
#except:
#print as {e}


# Define archive directory and ensure it exists
#os.makedirs(archive_dir, exist_ok=True)

# Define full path with date-stamped filename
filename = os.path.join(ARCHIVE_PATH, f"agencies_snapshot_{today_str}.json")

# Save to file
#with open(filename, "w") as f:
#    json.dump(data, f, indent=2)
print("Saved snapshot of agency json to", shorten_path(filename))

Saved snapshot of agency json to ~/repo/doge-data-challenge/archive/agencies_snapshot_2025-04-28.json


In [39]:
flattened_rows = []
#print(data)
#print(data['agencies'])

for agency in data['agencies']:
    parent_name = agency.get('name') # key:value
    #print("parent_name=", parent_name)
    short_name  = agency.get('short_name')
    slug_name   = agency.get('slug')
    children    = agency.get('children')

    # Try to get 'cfr_references' and 'children' from the agency dictionary
    # If it's missing or None, assign an empty list to avoid iteration errors
    parent_cfr_refs = agency.get('cfr_references', [])
    children        = agency.get('children', [])

    # Loop over parent CFR refs, safe to iterate because it's a guaranteed list
    for ref in parent_cfr_refs:
        #print("..cfr_references title=", ref.get('title'), " chapter=", ref.get('chapter'))
        flattened_rows.append({"name": parent_name, 
                              "short_name": short_name, 
                              "slug": slug_name,
                              "title": ref.get('title'),
                              "subtitle": ref.get('subtitle'),
                              "chapter": ref.get('chapter'),
                              "subchapter": ref.get('subchapter'),
                              "part": ref.get("part")
                             })

    # Loop over children CFR refs
    for child in children:
        # child_name = child.get('name')
        for ref in child.get('cfr_references', []):
            flattened_rows.append({"name": parent_name, 
                                  "short_name": short_name, 
                                  "slug": slug_name,
                                  "title": ref.get('title'),
                                  "subtitle": ref.get('subtitle'),
                                  "chapter": ref.get('chapter'),
                                  "subchapter": ref.get('subchapter'),
                                  "part": ref.get("part")
                                 })


In [1]:
import pandas as pd

agencies_df = pd.DataFrame(flattened_rows)

# Preview the result
#agencies_df.head(len(agencies_df))
#print(f"Total CFR references across all agencies and children: {len(agencies_df)}")

# Define archive directory and ensure it exists
os.makedirs(ARCHIVE_PATH, exist_ok=True)

# Define full path with date-stamped filename
filename = os.path.join(ARCHIVE_PATH, f"flattened_agencies_list_{today_str}.csv")

# Save data frame to a csv file
agencies_df.to_csv(filename, index=False)
print("Saved data frame to", shorten_path(filename))

NameError: name 'flattened_rows' is not defined

In [45]:
agencies_df.head(len(agencies_df))

Unnamed: 0,name,short_name,slug,title,subtitle,chapter,subchapter,part
0,Administrative Conference of the United States,ACUS,administrative-conference-of-the-united-states,1,,III,,
1,Advisory Council on Historic Preservation,ACHP,advisory-council-on-historic-preservation,36,,VIII,,
2,Special Inspector General for Afghanistan Reco...,SIGAR,special-inspector-general-for-afghanistan-reco...,5,,LXXXIII,,
3,African Development Foundation,USADF,african-development-foundation,22,,XV,,
4,African Development Foundation,USADF,african-development-foundation,48,,57,,
...,...,...,...,...,...,...,...,...
482,Department of Veterans Affairs,VA,veterans-affairs-department,38,,I,,
483,Department of Veterans Affairs,VA,veterans-affairs-department,48,,8,,
484,Office of Vice President of the United States,,office-of-vice-president-of-the-united-states,32,,XXVIII,,
485,Water Resources Council,,water-resources-council,18,,VI,,


In [49]:
#####################
#
# EXPERIMENT/PRACTICE
#
#####################
# Processing/flattening JSON
# Converting to Pandas data frame
# Making directories, subdirectories, filenames using f-string for unique variable naming of directories/files
# Lambda function to add a column to a data frame 
##### Create a 'grouping_agency' column to unify parent and child agencies under a common label.
##### This helps ensure consistent grouping during downstream analysis. 
##### If an agency is a child (is_child == True), we assign its parent agency's name as the grouping label.
##### Otherwise, the agency uses its own name. This approach simplifies aggregation and relationship mapping
##### by allowing parent-child structures to be treated as a single entity.

#group_df = agencies_df[agencies_df['name'] == "Department of Agriculture"]
#group_df.head(len(group_df))

agencies_df['title_title'] = agencies_df.apply(
    lambda x: x['title'] * 2 if x['title'] > 10 else x['title'], 
    axis=1)
agencies_df.head(len(agencies_df))


Unnamed: 0,name,short_name,slug,title,subtitle,chapter,subchapter,part,title_title
0,Administrative Conference of the United States,ACUS,administrative-conference-of-the-united-states,1,,III,,,1
1,Advisory Council on Historic Preservation,ACHP,advisory-council-on-historic-preservation,36,,VIII,,,72
2,Special Inspector General for Afghanistan Reco...,SIGAR,special-inspector-general-for-afghanistan-reco...,5,,LXXXIII,,,5
3,African Development Foundation,USADF,african-development-foundation,22,,XV,,,44
4,African Development Foundation,USADF,african-development-foundation,48,,57,,,96
...,...,...,...,...,...,...,...,...,...
482,Department of Veterans Affairs,VA,veterans-affairs-department,38,,I,,,76
483,Department of Veterans Affairs,VA,veterans-affairs-department,48,,8,,,96
484,Office of Vice President of the United States,,office-of-vice-president-of-the-united-states,32,,XXVIII,,,64
485,Water Resources Council,,water-resources-council,18,,VI,,,36
