feat(data): centralize sample-dataset metadata in a manifest (#774) by SaguaroDev · Pull Request #886 · casact/chainladder-python

SaguaroDev · 2026-05-31T00:03:50Z

Closes #774.

Sample-dataset metadata was duplicated across four places: the load_sample if/elif chain, the tests, docs/library/sample_data.md, and the per-file include list in MANIFEST.in. They had already drifted: sample_data.md listed 23 of 46 datasets (with names like frieldand_uspp_auto_steady_state and friedland_us_industy_auto_case), and MANIFEST.in listed 22, so the sdist shipped incomplete.

This adds chainladder/utils/data/_manifest.py as the single source of truth (a plain Python dict, no new dependency per the issue discussion) and keys everything off it.

load_sample now looks up its Triangle config from the manifest instead of the if/elif chain. This is behavior-preserving: I captured the resolved origin/development/index/columns/cumulative for all 46 bundled samples before and after the refactor and they are byte-identical.

cl.list_samples() is the new utility @henrydingliu asked for. Per the design confirmed on the issue, it returns a DataFrame of name, index, columns, cumulative, and (when include_grain=True, the default) origin/development grain plus period counts. include_grain=False skips loading the data for a fast metadata-only listing. It doubles as the source for the docs table.

MANIFEST.in collapses the 22 hand-listed include lines to one recursive-include chainladder/utils/data *.csv. I built the sdist both ways and confirmed it now contains all 46 CSVs (was 22). Wheels were already complete via pyproject.toml's package-data wildcard; this fixes the sdist.

Tests test_load_sample now iterates the manifest rather than globbing the data directory, so adding a sample is a one-entry change and stray non-CSV files (__init__.py, _manifest.py) can't be mistaken for datasets. Added a both-ways sync assertion (manifest set == CSVs on disk) and a test_list_samples test.

docs/library/sample_data.md is regenerated from cl.list_samples() via scripts/regen_sample_data_docs.py. The table is now complete and accurate; rerun the script after adding a dataset.

Full test suite: 717 passed, 12 xfailed (pre-existing), 0 failures.

Note

Low Risk
Refactor is scoped to sample loading and packaging metadata; behavior is intended to be preserved with stronger sync tests, and no auth or production runtime paths are touched.

Overview
Introduces chainladder/utils/data/_manifest.py as the single registry for bundled sample CSV metadata (origin, development, index, columns, cumulative). load_sample now validates keys and builds Triangle instances from that dict instead of a long per-dataset if/elif chain.

Adds list_samples() (exported on the public API) to return a manifest-driven catalog, optionally loading each sample to report origin/development grain and period counts. MANIFEST.in switches from 22 explicit include lines to recursive-include chainladder/utils/data *.csv so sdists ship all bundled CSVs.

Tests assert manifest ↔ on-disk CSV sync, cover list_samples, and add an optional sdist build check that every sample CSV is packaged. docs/library/sample_data.md is regenerated from list_samples() via scripts/regen_sample_data_docs.py; API docs list the new function.

^{Reviewed by Cursor Bugbot for commit 577fc84. Bugbot is set up for automated code reviews on this repo. Configure here.}

) Sample-dataset metadata was duplicated across four places: the long if/elif chain in load_sample, the tests, docs/library/sample_data.md, and the per-file include list in MANIFEST.in. Adding a sample meant editing all four by hand, and they had already drifted (sample_data.md listed 23 of 46 datasets with several typo-d names; MANIFEST.in listed 22, so the sdist shipped incomplete). Introduce chainladder/utils/data/_manifest.py as the single source of truth (a plain Python dict, no new dependency), and key everything off it: - load_sample now looks up its Triangle config from the manifest instead of the if/elif chain. Verified behavior-preserving: the resolved origin/development/index/columns/cumulative for all 46 bundled samples is byte-identical before and after. - New public cl.list_samples() returns a DataFrame of name, index, columns, cumulative, and (optionally) grain + period counts. Doubles as the source for the docs table. - test_load_sample iterates the manifest rather than globbing the data directory, so adding a sample is a one-entry change and stray non-CSV files cannot be mistaken for datasets. Added a both-ways sync assertion (manifest == CSVs on disk) and a test_list_samples test. - MANIFEST.in collapses the 22 hand-listed includes to one recursive-include chainladder/utils/data *.csv. Verified the built sdist now contains all 46 CSVs (was 22). - docs/library/sample_data.md is regenerated from cl.list_samples() via scripts/regen_sample_data_docs.py; the table is now complete and accurate. Closes casact#774.

codecov · 2026-05-31T00:08:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.90%. Comparing base (449b5c1) to head (577fc84).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #886      +/-   ##
==========================================
- Coverage   87.04%   86.90%   -0.15%     
==========================================
  Files          86       87       +1     
  Lines        4986     4932      -54     
  Branches      646      624      -22     
==========================================
- Hits         4340     4286      -54     
  Misses        456      456              
  Partials      190      190

Flag	Coverage Δ
unittests	`86.90% <100.00%> (-0.15%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

henrydingliu · 2026-05-31T00:26:26Z

@SaguaroDev thanks for all this! will the new utility function list_sample get added to the api reference section of the doc automatically? or do we need to manually add it?

@henrydingliu

…asact#774) Addresses review feedback on casact#886 from @henrydingliu: - Add list_samples to docs/library/api.md autosummary and a matching generated stub so the new utility appears in the API reference. The reference is hand-maintained, not auto-discovered, so this is a manual add. Also dropped a pre-existing duplicate load_sample entry in the same autosummary block. - Add test_sdist_ships_all_samples: builds a source distribution and asserts every sample CSV is present, guarding against MANIFEST.in drifting out of sync again. Self-skips when the build package or a source checkout is unavailable, so it stays out of the fast suite as Henry suggested.

SaguaroDev · 2026-05-31T03:01:00Z

It's manual, not automatic. The API reference (docs/library/api.md) hand-lists each function in an autosummary block, so I added list_samples there plus a matching generated stub, and dropped a duplicate load_sample entry that was already in that block.

On the sdist: added test_sdist_ships_all_samples, which builds a source distribution and asserts every sample CSV is present. It self-skips when the build package or a source checkout isn't available, so it proves completeness in CI without weighing down the fast suite.

Pushed 577fc84.

henrydingliu · 2026-05-31T04:22:47Z

@@ -0,0 +1,6 @@
+chainladder.list\_samples


we've been deleting these stubs in #879. can you confirm if creating list_samples_rst is still necessary?

henrydingliu · 2026-05-31T04:32:30Z

@@ -4,54 +4,55 @@ Below is the list of all datasets that come included with the `chainladder` pack

 You can load any dataset with `cl.load_sample(...)` such as `cl.load_sample("abc")`.


actually, now that we have list_samples, would it be cleaner to just change this markdown into a notebook? i.e. just call list_samples in the notebook

SaguaroDev requested review from genedan, henrydingliu, jbogaardt and kennethshsu as code owners May 31, 2026 00:03

henrydingliu reviewed May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data): centralize sample-dataset metadata in a manifest (#774)#886

feat(data): centralize sample-dataset metadata in a manifest (#774)#886
SaguaroDev wants to merge 2 commits into
casact:mainfrom
SaguaroDev:774-centralize-sample-metadata

SaguaroDev commented May 31, 2026 •

edited by cursor Bot

Loading

Uh oh!

codecov Bot commented May 31, 2026 •

edited

Loading

Uh oh!

henrydingliu commented May 31, 2026

Uh oh!

SaguaroDev commented May 31, 2026

Uh oh!

henrydingliu May 31, 2026

Uh oh!

henrydingliu May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -4,54 +4,55 @@ Below is the list of all datasets that come included with the `chainladder` pack

		You can load any dataset with `cl.load_sample(...)` such as `cl.load_sample("abc")`.

Conversation

SaguaroDev commented May 31, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

henrydingliu commented May 31, 2026

Uh oh!

SaguaroDev commented May 31, 2026

Uh oh!

henrydingliu May 31, 2026

Choose a reason for hiding this comment

Uh oh!

henrydingliu May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SaguaroDev commented May 31, 2026 •

edited by cursor Bot

Loading

codecov Bot commented May 31, 2026 •

edited

Loading