Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature branch: Rename core + output assets to match new naming protocols #2818

Merged
merged 111 commits into from Dec 16, 2023

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Aug 30, 2023

See #2812 for issue description and progress tracking. This will collect naming PRs, which will all need to be merged in together to avoid major headaches for downstream users.

The PR includes renamed assets for:

  • core EIA 923, EIA 860 and harvested EIA tables
  • core FERC Form 1 tables
  • core EIA 861 tables
  • core FERC Form 714 tables
  • coding tables
  • glue/association tables
  • EPA CEMS parquet files and core assets

It also does a few other things:

  • fixes mixed usage of annual and yearly in core assets
  • reorganizes modules to match _core and core structure.

Not converted:

  • entity_types_eia , which is currently not loaded into the PUDL DB.
  • assets from the XBRL explosion (currently just clean_xbrl_metadata_json)

Steps for merging

  1. Rename all of the tables and merge docs PR into this branch
  2. Run a full build with validation for this branch
  3. Once the build passes and this branch is ready to be merged into dev, create a PR for dev -> main.
  4. Create a tag and push a tag called pre-name-change. Question: Should we do a final full software release or just let people install from git tab?
  5. Squash merge this branch into dev
  6. Once the schedule dev nightly build passes, merge dev -> main and tag a release using our desired release naming convention: vYYYY.MM.DD
  7. Help known users migrate!

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@e-belfer e-belfer self-assigned this Aug 30, 2023
@e-belfer e-belfer linked an issue Aug 30, 2023 that may be closed by this pull request
@bendnorman
Copy link
Member

Thanks @e-belfer! I feel pretty good about all of the proposed asset names in the naming convention sheet except for the assets currently in the glue group:

  • We decided to use assn for the asset type for all glue/association tables.
  • I'm not sure what the datasource portion of the asset name should be. EIA EPA crosswalk assets come from EPA. Most of the other current glue assets are hand-crafted excel sheets we create so maybe pudl is a reasonable datasource for these assets? Would it be awkward to have a module named pudl.core.pudl to contain these assets?
  • plants_pudl and utilities_pudl feel more like entity tables assets because they only contain pudl ids and names. If we follow the naming convention the new names would be core_pudl__entity_plants_pudl and core_pudl__entity_utilities_pudl. These names are pretty similar to core_eia__entity_utilities and core_eia__entity_plants. Do you think this will confuse users? Is there another way we can name these?

@e-belfer
Copy link
Member Author

e-belfer commented Sep 1, 2023

* I'm not sure what the `datasource` portion of the asset name should be. EIA EPA crosswalk assets come from EPA. Most of the other current glue assets are hand-crafted excel sheets we create so maybe `pudl` is a reasonable datasource for these assets? Would it be awkward to have a module named `pudl.core.pudl` to contain these assets?

I think the alternative to pudl.core.pudl would be using a name like "manual" or "hand" but it wouldn't well characterize all the machine learning outputs that will likely come to live here. Alternately we could use "cat" to point to Catalyst as the origin of this data, in the event that we start integrating glue tables from other non-agency sources.

* `plants_pudl` and `utilities_pudl` feel more like entity tables assets because they only contain pudl ids and names. If we follow the naming convention the new names would be `core_pudl__entity_plants_pudl` and `core_pudl__entity_utilities_pudl`. These names are pretty similar to `core_eia__entity_utilities` and `core_eia__entity_plants`. Do you think this will confuse users? Is there another way we can name these?

Is there a reason to keep the _pudl other than legacy compatibility? I think they'd be clear mirrors of the EIA tables but containing PUDL-only info in this case.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@e-belfer
Copy link
Member Author

Updated decisions: keep "pudl" as datasource, and leave duplicated PUDL as is.

@e-belfer e-belfer marked this pull request as ready for review September 14, 2023 11:20
@codecov
Copy link

codecov bot commented Sep 14, 2023

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (98492e6) 92.7% compared to head (9f3d293) 92.6%.
Report is 12 commits behind head on dev.

Files Patch % Lines
src/pudl/analysis/eia_ferc1_train.py 84.2% 3 Missing ⚠️
src/pudl/etl/cli.py 33.3% 2 Missing ⚠️
src/pudl/metadata/classes.py 90.0% 1 Missing ⚠️
src/pudl/output/pudltabl.py 80.0% 1 Missing ⚠️
src/pudl/transform/eia860.py 66.7% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2818     +/-   ##
=======================================
- Coverage   92.7%   92.6%   -0.0%     
=======================================
  Files        134     134             
  Lines      12597   12611     +14     
=======================================
+ Hits       11672   11682     +10     
- Misses       925     929      +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just a couple of assets in notebooks that might need to be renamed.

I'm thinking we should use this as the main feature branch for the core/output rename. What do you think @e-belfer?

Comment on lines 111 to 113
" utilities_entity_eia: pd.DataFrame,\n",
" utilities_eia860: pd.DataFrame,\n",
" core_eia860__scd_utilities: pd.DataFrame,\n",
" utilities_eia: pd.DataFrame,\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should utilities_entity_eia be core_eia__entity_utilities and utilities_eia be core_pudl__assn_utilities_eia?

notebooks/work-in-progress/better-heatrates.ipynb Outdated Show resolved Hide resolved
@bendnorman
Copy link
Member

Hmm I'll take a look and update this branch with the changes from dev again.

@bendnorman
Copy link
Member

It looks like retained_earnings_appropriations_ferc1 is a "disabled" FERC table. Should we rename it or just remove it from the metadata?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has been deleted from dev. Do we know why it's showing up here now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is strange. I'll look into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has been deleted on dev

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has been deleted on dev

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing inside this module is going to be called MCOE maybe we should change the name of the module?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! I decided to punt updating references to MCOE everwhere because it's a large change. #2910

@bendnorman bendnorman marked this pull request as draft December 14, 2023 23:23
@bendnorman bendnorman marked this pull request as ready for review December 15, 2023 17:41
@bendnorman
Copy link
Member

The CI and full build passed... should we merge this thing in?

@zaneselvans
Copy link
Member

I think merging it in and then using it ourselves for a while to iron out the kinks before we cut a new release with the new names is probably a good idea. I'm sure there will be issues and inconsistencies that we only discover when using it day-to-day.

@bendnorman
Copy link
Member

Should we point datasette at v2023.12.01 data so datasette continues to have the old names until we do a full release with the new names?

@zaneselvans
Copy link
Member

You mean just stop updating it?

@zaneselvans
Copy link
Member

zaneselvans commented Dec 15, 2023

I dunno. The docs say that Datasette reflects the nightly builds, so we would need to change that. And the latest tag on Read The Docs points at dev so by default the documentation that anyone sees after the merge will reflect the new names.

If we can get on an actual monthly release schedule, then having the public Kaggle and Datasette outputs track the stable release rather than nightly doesn't sound like a bad idea.

@bendnorman
Copy link
Member

Having datasette and docs point at stable seems like the ideal setup. I guess we should leave it on dev for now.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure we will find things to fix and improve as we use it, but I think that's only going to happen through actual usage.

@bendnorman bendnorman merged commit b8fa2b5 into dev Dec 16, 2023
13 checks passed
@bendnorman bendnorman deleted the rename-core-assets branch December 16, 2023 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Apply naming convention to core assets Apply naming convention to output assets
4 participants