| license | agpl-3.0 | ||||
|---|---|---|---|---|---|
| language |
|
||||
| library_name | doc2dict | ||||
| tags |
|
Immediate goal: slice each agreement's HTML into clauses with hierarchy — text + nesting depth — so that concatenating the slices in document order reconstructs the document faithfully.
That single criterion drives parser quality. Everything else (canonical schema, subdocument detection, classification taxonomies) is built on top of a parser that meets the bar. If concat-of-spans doesn't reproduce the source, the parser gets fixed before anything downstream gets built.
A sequence of parser scripts each emit a JSONL where one line = one parsed clause:
{"idx": 4, "level": 2, "span": "INDEMNIFICATION AGREEMENT\nTHIS INDEMNIFICATION AGREEMENT (the \"Agreement\")..."}idx— corpus row index.level— the parser's native nesting depth (doc2dict 0-indexed; lexnlp 1-indexed; intentionally not normalized so each parser's view is preserved).span— heading + body. Concatenating allspanvalues for oneidxin JSONL order should approximate the source document.
The source-of-truth dump (parse_source_of_truth.py) is the unparsed reference per doc; measure_reconstruction.py produces a parquet with per-doc word coverage and char ratio per parser, so disagreement and content loss are visible per row.
| parser | mean word coverage | range | what's missing |
|---|---|---|---|
| doc2dict baseline | 91.5% | 88.8–95.7% | tables + mixed-content children dropped by _collect_direct_text |
| doc2dict + agreement_config | 91.5% | 88.8–95.7% | same body extraction; only header typing differs |
| lexnlp (regex) | 97.6% | 94.1–98.7% | closest to source — minor whitespace artifacts |
Lex consistently reconstructs near-completely. doc2dict drops ~6–10% of content; the gap is in _collect_direct_text not capturing every text leaf (the _is_text_leaf heuristic skips tables and mixed-content children). Closing that gap is the immediate parser-quality work.
scripts/
parse_source_of_truth.py reference baseline — bs4 plain text + full HTML per doc
parse_doc2dict_baseline.py doc2dict with no mapping_dict
parse_doc2dict_with_config.py doc2dict with the validated EX-10 levels regex
parse_lexnlp.py lexnlp regex section detector (no overrides)
measure_reconstruction.py per-doc word_coverage + char_ratio per parser
compare.py side-by-side dumps + body-overlap summary
src/clause_extract/
canonical_id_parser.py 100% SOT-validated; clause-ID parsing primitive
agreement_config.py doc2dict mapping_dict for EX-10 (lexnlp-informed)
lexnlp_sections_regex.py AGPLv3 vendored from arthrod/lexpredict-lexnlp
The four parser scripts above and measure_reconstruction.py are the immediate concern. Canonicalization, subdocument detection, and the HF dataset push (described in TASKS.md) come after each parser meets the reconstruction bar.
Status of locked artifacts (still valid): the canonical-ID parser is 100% validated against the 973-clause source-of-truth ledger. The subdocument detector v1 design is documented in
docs/DETECTOR.mdand validates at ~75% precision / 90% recall on a hand-verified 100-doc sample.
git clone <this-repo>
cd clause-extract
uv sync # creates .venv with all deps including dev
uv run pytest # run tests (some require HF_TOKEN env var)export HF_TOKEN=<your-huggingface-token> # for SOT round-trip + corpus runs# Phase 0 (validation): canonical-ID parser round-trips the SOT ledger
uv run pytest -m sot tests/test_canonical_id_parser_sot.py
# Phase 1 (parser quality): produce JSONLs and measure reconstruction
HF_TOKEN=hf_xxx uv run scripts/parse_source_of_truth.py --output-dir data/runs/source_of_truth
HF_TOKEN=hf_xxx uv run scripts/parse_doc2dict_baseline.py --output-dir data/runs/doc2dict_baseline --no-truncate
HF_TOKEN=hf_xxx uv run scripts/parse_doc2dict_with_config.py --output-dir data/runs/doc2dict_with_config --no-truncate
HF_TOKEN=hf_xxx uv run scripts/parse_lexnlp.py --output-dir data/runs/lexnlp_baseline --no-truncate
uv run scripts/measure_reconstruction.py \
--source-of-truth-dir data/runs/source_of_truth \
--d2d-baseline-dir data/runs/doc2dict_baseline \
--d2d-config-dir data/runs/doc2dict_with_config \
--lex-dir data/runs/lexnlp_baseline \
--output-dir data/runs/reconstructionIf you're picking up implementation: read TASKS.md. Phase 0 is environment setup; Phase 1 is parser reconstruction quality — pushing every parser's mean word coverage above the agreed bar (default ≥95%). Subsequent phases (canonicalization, schema, HF dataset push) build on a parser that meets the bar.
- Reconstruction-quality bar — every parser's mean word coverage on the corpus is ≥95% (measured via
measure_reconstruction.py), with no doc below 80%. Truncation off (--no-truncate) for the gate run. canonical_id_parserround-trips 973/973 SOT clauses — every clause_id in the human ledger parses, reconstructs to the same string, and its derived parent is eitherNone(root) or another clause_id present in the same ledger.- Subdocument detector matches hand-verified set on 100-doc sample at ≥80% precision and ≥85% recall. The hand-verified ground truth is in
docs/DETECTOR.md. - End-to-end run on full corpus completes without errors and writes a valid HF dataset.
| File | Read when |
|---|---|
README.md |
Now — landing page |
TASKS.md |
First if you're Claude Code — the implementation work plan |
docs/GOAL.md |
Goal framing — slice-and-reconstruct as the immediate target, plus the eventual statistical use the corpus serves |
docs/SCHEMA.md |
Before implementing the canonicalizer — canonical record fields, types, derivation rules |
docs/DECISIONS.md |
Before challenging a design choice — locked decisions with rationale and what would change our mind |
docs/DETECTOR.md |
Before touching subdocument detection — algorithm, validation results, known false-positive/negative patterns |
docs/HANDOFF.md |
What's locked, what's open, validation gates, where artifacts live |
AGPL-3.0-or-later. The lexnlp_sections_regex.py module is a vendored port of regex-only code from arthrod/lexpredict-lexnlp, which is itself AGPLv3.
For the analytical scope (computing statistics by running this software internally on a private corpus), AGPL is not restrictive — the outputs are statistics, not licensed software. See docs/DECISIONS.md §AGPL stance for the full reasoning and the boundary that matters when this work eventually feeds Cicero.
arthrod/new3_results_master22017_274.59mb— corpus (1,066 EX-10s, HF private)arthrod/clause-prob-source-of-truth— manually-curated 25-doc ledger (HF private, ground truth for validation gate 2)arthrod/clause-extract-inspection— environment dump for reviewing session work (HF private)arthrod/lexpredict-lexnlp— fork of LexPredict's lexnlp, source of the regex section patterns