# ECFR Scraper: Getting Started

This notebook explains what the ECFR Scraper does and how to use it from the command line and programmatically.

- Package: `ecfr_scraper` (run with `python -m ecfr_scraper`)
- Console script: `ecfr-scraper` (after `pip install -e .`)
- Modules: `ecfr_scraper.scraper`, `ecfr_scraper.metadata`, `ecfr_scraper.utils`, `ecfr_scraper.cli`

## What it does

- Downloads eCFR title XMLs (Titles 1–50) from <https://www.govinfo.gov>.
- Caches using checksums so unchanged files are skipped.
- Optionally parses XML to structured JSON and computes lexical stats.
- Extracts per-file metadata (XML, ZIP, TXT; PDF is a placeholder).
- Shows progress for multi-title downloads.
- Logs to console and `ecfr_scraper.log`.

## Prerequisites

- Python 3.9 or later
- From the repo root, install dependencies:

```powershell
pip install -r requirements.txt
```

In [None]:
# Optional: install requirements into the current kernel
%pip install -q -r ../requirements.txt

## Command-line usage

You can run the tool as a module or via the console script (after installing in editable mode).

- Show help:

```powershell
python -m ecfr_scraper -h
```

- Download and parse a single title:

```powershell
python -m ecfr_scraper --title 21 --output .\data --verbose
```

- Download all titles in parallel (8 workers shown):

```powershell
python -m ecfr_scraper --all --workers 8 --output .\data
```

- Generate only metadata (skip JSON parsing):

```powershell
python -m ecfr_scraper --title 12 --metadata-only
```

In [None]:
# Run CLI help from the notebook (safe, quick)
import subprocess
subprocess.run(["python", "-m", "ecfr_scraper", "-h"], check=False)

## Programmatic usage (Python API)

Use the `ECFRScraper` class to download titles and parse XML to JSON.
This demo keeps it fast: it downloads one title and skips deep parsing by default.

In [None]:
from ecfr_scraper.scraper import ECFRScraper
from ecfr_scraper.utils import setup_logging
import os, json

# Configure logging for this session
setup_logging(verbose=True)

# Choose an output directory (relative to repo root)
out_dir = os.path.join("..", "data")
scraper = ECFRScraper(output_dir=out_dir)

# Demo: download a single title (1)
xml_path = scraper.download_title_xml(1, out_dir)
print("XML path:", xml_path)

# Optional: parse XML to JSON (can take some time).
# Set this to True to run the parse step.
RUN_PARSE = False
if RUN_PARSE and xml_path:
    data = scraper.parse_xml(xml_path)
    if data:
        json_path = xml_path.replace(".xml", ".json")
        scraper.export_to_json(data, json_path)
        print("JSON saved:", json_path)
    else:
        print("Parse failed.")

## Outputs

For each downloaded title N:

- XML: `./data/titleN.xml`
- Parsed JSON: `./data/titleN.json`
- File metadata: `./data/titleN.xml.metadata.json`
- Global checksums: `./checksums.json`
- Logs: `./ecfr_scraper.log`

## Advanced usage

- Download multiple titles in parallel using the CLI `--all` and `--workers`.
- Programmatically, you can call `download_all_titles(max_workers=8)` and then `process_downloaded_files(...)` to parse and export JSON.

In [None]:
# Example (commented to avoid long runs):
# files = scraper.download_all_titles(out_dir, max_workers=8)
# results = scraper.process_downloaded_files(files)
# print(results[:2])  # show first few results

## Extending

- Add more metadata handlers in `MetadataExtractor` (e.g., real PDF parsing with `pypdf` or `pdfminer.six`).
- Improve XML parsing (`ECFRScraper.parse_xml`) for namespace-aware parsing or richer structure.
- Enhance retries/backoff for HTTP requests.