Skip to content

feat(data): centralize sample-dataset metadata in a manifest (#774)#886

Open
SaguaroDev wants to merge 2 commits into
casact:mainfrom
SaguaroDev:774-centralize-sample-metadata
Open

feat(data): centralize sample-dataset metadata in a manifest (#774)#886
SaguaroDev wants to merge 2 commits into
casact:mainfrom
SaguaroDev:774-centralize-sample-metadata

Conversation

@SaguaroDev
Copy link
Copy Markdown
Contributor

@SaguaroDev SaguaroDev commented May 31, 2026

Closes #774.

Sample-dataset metadata was duplicated across four places: the load_sample if/elif chain, the tests, docs/library/sample_data.md, and the per-file include list in MANIFEST.in. They had already drifted: sample_data.md listed 23 of 46 datasets (with names like frieldand_uspp_auto_steady_state and friedland_us_industy_auto_case), and MANIFEST.in listed 22, so the sdist shipped incomplete.

This adds chainladder/utils/data/_manifest.py as the single source of truth (a plain Python dict, no new dependency per the issue discussion) and keys everything off it.

load_sample now looks up its Triangle config from the manifest instead of the if/elif chain. This is behavior-preserving: I captured the resolved origin/development/index/columns/cumulative for all 46 bundled samples before and after the refactor and they are byte-identical.

cl.list_samples() is the new utility @henrydingliu asked for. Per the design confirmed on the issue, it returns a DataFrame of name, index, columns, cumulative, and (when include_grain=True, the default) origin/development grain plus period counts. include_grain=False skips loading the data for a fast metadata-only listing. It doubles as the source for the docs table.

MANIFEST.in collapses the 22 hand-listed include lines to one recursive-include chainladder/utils/data *.csv. I built the sdist both ways and confirmed it now contains all 46 CSVs (was 22). Wheels were already complete via pyproject.toml's package-data wildcard; this fixes the sdist.

Tests test_load_sample now iterates the manifest rather than globbing the data directory, so adding a sample is a one-entry change and stray non-CSV files (__init__.py, _manifest.py) can't be mistaken for datasets. Added a both-ways sync assertion (manifest set == CSVs on disk) and a test_list_samples test.

docs/library/sample_data.md is regenerated from cl.list_samples() via scripts/regen_sample_data_docs.py. The table is now complete and accurate; rerun the script after adding a dataset.

Full test suite: 717 passed, 12 xfailed (pre-existing), 0 failures.


Note

Low Risk
Refactor is scoped to sample loading and packaging metadata; behavior is intended to be preserved with stronger sync tests, and no auth or production runtime paths are touched.

Overview
Introduces chainladder/utils/data/_manifest.py as the single registry for bundled sample CSV metadata (origin, development, index, columns, cumulative). load_sample now validates keys and builds Triangle instances from that dict instead of a long per-dataset if/elif chain.

Adds list_samples() (exported on the public API) to return a manifest-driven catalog, optionally loading each sample to report origin/development grain and period counts. MANIFEST.in switches from 22 explicit include lines to recursive-include chainladder/utils/data *.csv so sdists ship all bundled CSVs.

Tests assert manifest ↔ on-disk CSV sync, cover list_samples, and add an optional sdist build check that every sample CSV is packaged. docs/library/sample_data.md is regenerated from list_samples() via scripts/regen_sample_data_docs.py; API docs list the new function.

Reviewed by Cursor Bugbot for commit 577fc84. Bugbot is set up for automated code reviews on this repo. Configure here.

)

Sample-dataset metadata was duplicated across four places: the long
if/elif chain in load_sample, the tests, docs/library/sample_data.md,
and the per-file include list in MANIFEST.in. Adding a sample meant
editing all four by hand, and they had already drifted (sample_data.md
listed 23 of 46 datasets with several typo-d names; MANIFEST.in listed
22, so the sdist shipped incomplete).

Introduce chainladder/utils/data/_manifest.py as the single source of
truth (a plain Python dict, no new dependency), and key everything off
it:

- load_sample now looks up its Triangle config from the manifest instead
  of the if/elif chain. Verified behavior-preserving: the resolved
  origin/development/index/columns/cumulative for all 46 bundled samples
  is byte-identical before and after.
- New public cl.list_samples() returns a DataFrame of name, index,
  columns, cumulative, and (optionally) grain + period counts. Doubles
  as the source for the docs table.
- test_load_sample iterates the manifest rather than globbing the data
  directory, so adding a sample is a one-entry change and stray non-CSV
  files cannot be mistaken for datasets. Added a both-ways sync assertion
  (manifest == CSVs on disk) and a test_list_samples test.
- MANIFEST.in collapses the 22 hand-listed includes to one
  recursive-include chainladder/utils/data *.csv. Verified the built
  sdist now contains all 46 CSVs (was 22).
- docs/library/sample_data.md is regenerated from cl.list_samples() via
  scripts/regen_sample_data_docs.py; the table is now complete and
  accurate.

Closes casact#774.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.90%. Comparing base (449b5c1) to head (577fc84).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #886      +/-   ##
==========================================
- Coverage   87.04%   86.90%   -0.15%     
==========================================
  Files          86       87       +1     
  Lines        4986     4932      -54     
  Branches      646      624      -22     
==========================================
- Hits         4340     4286      -54     
  Misses        456      456              
  Partials      190      190              
Flag Coverage Δ
unittests 86.90% <100.00%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@henrydingliu
Copy link
Copy Markdown
Collaborator

@SaguaroDev thanks for all this! will the new utility function list_sample get added to the api reference section of the doc automatically? or do we need to manually add it?

…asact#774)

Addresses review feedback on casact#886 from @henrydingliu:

- Add list_samples to docs/library/api.md autosummary and a matching
  generated stub so the new utility appears in the API reference. The
  reference is hand-maintained, not auto-discovered, so this is a manual
  add. Also dropped a pre-existing duplicate load_sample entry in the
  same autosummary block.
- Add test_sdist_ships_all_samples: builds a source distribution and
  asserts every sample CSV is present, guarding against MANIFEST.in
  drifting out of sync again. Self-skips when the build package or a
  source checkout is unavailable, so it stays out of the fast suite as
  Henry suggested.
@SaguaroDev
Copy link
Copy Markdown
Contributor Author

It's manual, not automatic. The API reference (docs/library/api.md) hand-lists each function in an autosummary block, so I added list_samples there plus a matching generated stub, and dropped a duplicate load_sample entry that was already in that block.

On the sdist: added test_sdist_ships_all_samples, which builds a source distribution and asserts every sample CSV is present. It self-skips when the build package or a source checkout isn't available, so it proves completeness in CI without weighing down the fast suite.

Pushed 577fc84.

@@ -0,0 +1,6 @@
chainladder.list\_samples
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've been deleting these stubs in #879. can you confirm if creating list_samples_rst is still necessary?

@@ -4,54 +4,55 @@ Below is the list of all datasets that come included with the `chainladder` pack

You can load any dataset with `cl.load_sample(...)` such as `cl.load_sample("abc")`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, now that we have list_samples, would it be cleaner to just change this markdown into a notebook? i.e. just call list_samples in the notebook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Centralize sample-dataset metadata (load_sample / tests / docs / MANIFEST.in)

2 participants