feat(data): centralize sample-dataset metadata in a manifest (#774)#886
feat(data): centralize sample-dataset metadata in a manifest (#774)#886SaguaroDev wants to merge 2 commits into
Conversation
) Sample-dataset metadata was duplicated across four places: the long if/elif chain in load_sample, the tests, docs/library/sample_data.md, and the per-file include list in MANIFEST.in. Adding a sample meant editing all four by hand, and they had already drifted (sample_data.md listed 23 of 46 datasets with several typo-d names; MANIFEST.in listed 22, so the sdist shipped incomplete). Introduce chainladder/utils/data/_manifest.py as the single source of truth (a plain Python dict, no new dependency), and key everything off it: - load_sample now looks up its Triangle config from the manifest instead of the if/elif chain. Verified behavior-preserving: the resolved origin/development/index/columns/cumulative for all 46 bundled samples is byte-identical before and after. - New public cl.list_samples() returns a DataFrame of name, index, columns, cumulative, and (optionally) grain + period counts. Doubles as the source for the docs table. - test_load_sample iterates the manifest rather than globbing the data directory, so adding a sample is a one-entry change and stray non-CSV files cannot be mistaken for datasets. Added a both-ways sync assertion (manifest == CSVs on disk) and a test_list_samples test. - MANIFEST.in collapses the 22 hand-listed includes to one recursive-include chainladder/utils/data *.csv. Verified the built sdist now contains all 46 CSVs (was 22). - docs/library/sample_data.md is regenerated from cl.list_samples() via scripts/regen_sample_data_docs.py; the table is now complete and accurate. Closes casact#774.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #886 +/- ##
==========================================
- Coverage 87.04% 86.90% -0.15%
==========================================
Files 86 87 +1
Lines 4986 4932 -54
Branches 646 624 -22
==========================================
- Hits 4340 4286 -54
Misses 456 456
Partials 190 190
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@SaguaroDev thanks for all this! will the new utility function list_sample get added to the api reference section of the doc automatically? or do we need to manually add it? |
…asact#774) Addresses review feedback on casact#886 from @henrydingliu: - Add list_samples to docs/library/api.md autosummary and a matching generated stub so the new utility appears in the API reference. The reference is hand-maintained, not auto-discovered, so this is a manual add. Also dropped a pre-existing duplicate load_sample entry in the same autosummary block. - Add test_sdist_ships_all_samples: builds a source distribution and asserts every sample CSV is present, guarding against MANIFEST.in drifting out of sync again. Self-skips when the build package or a source checkout is unavailable, so it stays out of the fast suite as Henry suggested.
|
It's manual, not automatic. The API reference ( On the sdist: added Pushed 577fc84. |
| @@ -0,0 +1,6 @@ | |||
| chainladder.list\_samples | |||
There was a problem hiding this comment.
we've been deleting these stubs in #879. can you confirm if creating list_samples_rst is still necessary?
| @@ -4,54 +4,55 @@ Below is the list of all datasets that come included with the `chainladder` pack | |||
|
|
|||
| You can load any dataset with `cl.load_sample(...)` such as `cl.load_sample("abc")`. | |||
There was a problem hiding this comment.
actually, now that we have list_samples, would it be cleaner to just change this markdown into a notebook? i.e. just call list_samples in the notebook
Closes #774.
Sample-dataset metadata was duplicated across four places: the
load_sampleif/elif chain, the tests,docs/library/sample_data.md, and the per-fileincludelist inMANIFEST.in. They had already drifted:sample_data.mdlisted 23 of 46 datasets (with names likefrieldand_uspp_auto_steady_stateandfriedland_us_industy_auto_case), andMANIFEST.inlisted 22, so the sdist shipped incomplete.This adds
chainladder/utils/data/_manifest.pyas the single source of truth (a plain Python dict, no new dependency per the issue discussion) and keys everything off it.load_sample now looks up its
Triangleconfig from the manifest instead of the if/elif chain. This is behavior-preserving: I captured the resolvedorigin/development/index/columns/cumulativefor all 46 bundled samples before and after the refactor and they are byte-identical.cl.list_samples() is the new utility @henrydingliu asked for. Per the design confirmed on the issue, it returns a DataFrame of name, index, columns, cumulative, and (when
include_grain=True, the default) origin/development grain plus period counts.include_grain=Falseskips loading the data for a fast metadata-only listing. It doubles as the source for the docs table.MANIFEST.in collapses the 22 hand-listed
includelines to onerecursive-include chainladder/utils/data *.csv. I built the sdist both ways and confirmed it now contains all 46 CSVs (was 22). Wheels were already complete viapyproject.toml'spackage-datawildcard; this fixes the sdist.Tests
test_load_samplenow iterates the manifest rather than globbing the data directory, so adding a sample is a one-entry change and stray non-CSV files (__init__.py,_manifest.py) can't be mistaken for datasets. Added a both-ways sync assertion (manifest set == CSVs on disk) and atest_list_samplestest.docs/library/sample_data.md is regenerated from
cl.list_samples()viascripts/regen_sample_data_docs.py. The table is now complete and accurate; rerun the script after adding a dataset.Full test suite: 717 passed, 12 xfailed (pre-existing), 0 failures.
Note
Low Risk
Refactor is scoped to sample loading and packaging metadata; behavior is intended to be preserved with stronger sync tests, and no auth or production runtime paths are touched.
Overview
Introduces
chainladder/utils/data/_manifest.pyas the single registry for bundled sample CSV metadata (origin,development,index,columns,cumulative).load_samplenow validates keys and buildsTriangleinstances from that dict instead of a long per-datasetif/elifchain.Adds
list_samples()(exported on the public API) to return a manifest-driven catalog, optionally loading each sample to report origin/development grain and period counts.MANIFEST.inswitches from 22 explicitincludelines torecursive-include chainladder/utils/data *.csvso sdists ship all bundled CSVs.Tests assert manifest ↔ on-disk CSV sync, cover
list_samples, and add an optional sdist build check that every sample CSV is packaged.docs/library/sample_data.mdis regenerated fromlist_samples()viascripts/regen_sample_data_docs.py; API docs list the new function.Reviewed by Cursor Bugbot for commit 577fc84. Bugbot is set up for automated code reviews on this repo. Configure here.