Apply new naming convention to PUDL assets #2765

jdangerx · 2023-07-31T20:29:51Z

Pending approval of the design doc, we should actually apply the naming convention

Scope is placeholder, to be fleshed out on design doc completion.

Scope

Give feedback

Apply naming convention to raw and intermediate core assets #2777

7 of 7
Apply naming convention to output assets #2788

9 of 9
Create documentation that explains new naming convention to users and contributors #2868

5 of 5
Add warnings to PudlTabl
Apply naming convention to core assets #2812

128 of 128

metadata
Run a full nightly build before merging into dev
Double check everything has been renamed... see Apply naming convention to straggler assets #3052
Iron out association asset names #3030

13 of 13
Update table names in kaggle notebooks #3174
Eia923 2022 final release q4 update Nov 21 #3073

eia923 new-data rmi
Clean up devtools notebooks to use new names #3173
Remove _out tables from datasette and sort tables #3223

2 of 2

datasette output
Options

The text was updated successfully, but these errors were encountered:

bendnorman · 2023-08-09T14:46:53Z

Hey @jdangerx! I was thinking about breaking up the work into two PRs off of dev:

Rename the raw and intermediate core assets
Rename the user-facing tables and documentation

I could also rename the user-facing tables in separate PRs to avoid a monster PR though the table names on dataset and s3 will follow different naming conventions during the conversion. This might be fine given the tables already follow different conventions! I could start with the output tables because we have a disclaimer that the current names are temporary and finish with the core tables which most users likely rely on. What do you think?

Also, at what point do you think we should widdle down the number of output assets persisted to the database? Maybe before we do an official data release so users don't get attached to a table to have it disappear down the line.

jdangerx · 2023-08-09T21:54:12Z

I think splitting this into:

raw_, _core
core
_output, output (I guess you could put the _output as a separate PR earlier in the process, since it's also hidden from users)

Makes sense to me! I'm OK with having discrepancies on datasette vs. S3 since we're pretty clear about the nightly builds being unstable.

I like doing the non-user-facing stuff first as practice for the user-facing stuff - we can see what breaks when we try to publish the data with the new names, before it breaks stuff for more important tables...

I think it's a good idea to move as much stuff from output into _output as possible, before we make output tables accessible to the public. Putting something in output is basically a promise to users that it will stay there - we can always "promote" _output tables to output. Same for core, really, though I think we have less leeway there.

bendnorman · 2023-08-15T23:17:35Z

Sounds good!

Based on feedback in #2503, people would prefer the changes to happen once. Here is a more detailed plan of how we can roll out these changes:

Rename raw and intermediate core assets. Merge into dev. None of these assets are persisted to the database so this will not impact users.
Rename output and core assets in a feature branch. This step will include converting some output assets that don't need to be user-facing to intermediate assets that aren't persisted in the database. We don't want to release a bunch of new tables in a tagged version and later remove them. This branch will also add a deprecation warning to PudlTabl. s3://intake.catalyst.coop should be renamed to s3://pudl.catalyst.coop Rename AWS bucket #2574 prior to step 3.
Before merging the feature branch into dev, merge dev into main, tag a version pre-naming-change, and run a full build. Users can rely on this tagged version as they migrate to use the new table names and remove PudlTabl.
Merge the feature branch into dev, merge dev into main and tag a release using our desired release naming convention: vYYYY.MM.DD.
Help users migrate from depending on PudlTabl, old table names and pinning to dev to relying on the new tables in the tagged pudl.sqlite database in the public s3 bucket.
Once core users are migrated, deprecate PudlTabl.

How does this plan sound to y'all? @zaneselvans @arengel @grgmiller @gschivley?

bendnorman · 2023-12-04T20:25:41Z

Ok! I think we're going to merge this thing in! Here is a todo list for rolling this out:

Release 2022 data with old names

Open a PR merging the latest commits to pass the nightly builds on dev into main.
The commits are in main. The 2022 data will be in main so we can do a data release!
Create and push a tag called v2023.12.04. This will kick off a full build.
When the build passes we'll have directory called s3://pudl.catalyst.coop/v2023.12.04
Create a manual Zenodo data release with the tagged data (not entirely sure what this entails)
Notify known users that we'll be applying the rename to dev so if they are pinned to data on datasette or pulling from the s3://pudl.catalyst.coop/dev directory, their code will break. To resolve the breakage, point your code towards the data in s3://pudl.catalyst.coop/v2023.12.04 or replace your references to PUDL tables using this sheet. Users who are still using PudlTabl and are pinned to dev will not be affected because the PudlTabl methods have not been changed though we are planning on removing PudlTabl and users should migrate to pulling the data directly from the database.

Release PUDL with new names

Merge Feature branch: Rename core + output assets to match new naming protocols #2818 into dev.
Once the nightly builds have a successful run, open a PR from dev into main.
Once the new names are in main create a tag called v2023.12.{day}, push and wait for a build to pass
Create a manual Zenodo data release once the build passes
Notify our users about the changes and fix breakages (example notebooks... tbd)

Questions

Should we do a code freeze on dev and main while we are working through these release mechanics?
How do we want to name these releases? They will likely happen on different days so the names will be different or we could use a suffix like v2023.12.04.oldnames and v2023.12.04.

jdangerx · 2023-12-04T21:31:59Z

Sounds like @zaneselvans and I will get cracking on the data release once #3086 is merged.

bendnorman mentioned this issue Aug 10, 2023

Create design doc for naming convention of PUDL tables #2517

Closed

bendnorman mentioned this issue Aug 28, 2023

Apply naming convention to core assets #2812

Closed

e-belfer assigned bendnorman and e-belfer Aug 28, 2023

bendnorman mentioned this issue Oct 2, 2023

Remove intermediate tables from metadata #2911

Open

bendnorman mentioned this issue Nov 20, 2023

Minimal rename cleanup for Winter #3061

Closed

zaneselvans added this to the v2024.01 milestone Jan 12, 2024

bendnorman closed this as completed Jan 29, 2024

zaneselvans mentioned this issue Jan 30, 2024

Clean up EIA-861 and FERC-714 core assets #3322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply new naming convention to PUDL assets #2765

Apply new naming convention to PUDL assets #2765

jdangerx commented Jul 31, 2023 •

edited by bendnorman

Scope

bendnorman commented Aug 9, 2023 •

edited

jdangerx commented Aug 9, 2023

bendnorman commented Aug 15, 2023 •

edited

bendnorman commented Dec 4, 2023

jdangerx commented Dec 4, 2023

Apply new naming convention to PUDL assets #2765

Apply new naming convention to PUDL assets #2765

Comments

jdangerx commented Jul 31, 2023 • edited by bendnorman

Scope

bendnorman commented Aug 9, 2023 • edited

jdangerx commented Aug 9, 2023

bendnorman commented Aug 15, 2023 • edited

bendnorman commented Dec 4, 2023

Release 2022 data with old names

Release PUDL with new names

Questions

jdangerx commented Dec 4, 2023

jdangerx commented Jul 31, 2023 •

edited by bendnorman

bendnorman commented Aug 9, 2023 •

edited

bendnorman commented Aug 15, 2023 •

edited