Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dagsterification of ferc1_eia #2938

Merged
merged 16 commits into from
Oct 16, 2023
Merged

dagsterification of ferc1_eia #2938

merged 16 commits into from
Oct 16, 2023

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Oct 12, 2023

PR Overview

  • convert the table to a non-db asset
  • convert the inputs to the assets
  • convert the pudl_out method into a default magic in the rename (idk how this works i just replicated it 🤷🏻)
  • check for PKs
    • there were some dupes bc of the null override file: dedupe the override file.
    • there were more dupes bc there were records in both the null override file & the training data. they all seemed to be labeled with 1:m in the null sheet so I removed them from the training sheet
    • there were even more dupes bc of a weird merge choice that i cleared up & made a comment about
  • add the table metadata in resources (added draft ###ed out)
  • add any new columns: record_id_ferc1
  • convert the non-db asset into an db asset
  • fixed two deprecation warnings bc they were not cute in my logs

Questions

  • do we want alllll of the columns from both plant_parts_eia and plants_all_ferc1? rn i just added em all... i think that's data marty but idk it is a lot. ANSWER: all of em
  • asset_group: it feels weird to make a whole new asset group for this one guy... any input on that? CURRENT ANSWER: a new group named 🥁 ferc1_eia. a group of 1... this seems like our current pattern. i don't love that tbh but i'm going to move forward w/ that and if we come up with a collective good idea around this we should standardize.
  • NAMES: should i try to go with the new names for this table??

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • For major data coverage & analysis changes, run data validation tests
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.
  • Make sure you've included good docstrings.
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.

@cmgosnell cmgosnell linked an issue Oct 12, 2023 that may be closed by this pull request
@cmgosnell cmgosnell added eia923 Anything having to do with EIA Form 923 ferc1 Anything having to do with FERC Form 1 eia860 Anything having to do with EIA Form 860 dagster Issues related to our use of the Dagster orchestrator labels Oct 12, 2023
@zaneselvans
Copy link
Member

Hmm, I had thought that this was just going to be glue / crosswalk table that provided the necessary FERC & EIA columns for merging together data that had both FERC & EIA planty records.

@cmgosnell
Copy link
Member Author

cmgosnell commented Oct 13, 2023

Hmm, I had thought that this was just going to be glue / crosswalk table that provided the necessary FERC & EIA columns for merging together data that had both FERC & EIA planty records.

this is my one question about the pr. i haven't been in y'all's convos about the intention for this table so idk what the intention was here.

I believe could pretty easily make a skinny glue table and a wide marty table.

@zaneselvans
Copy link
Member

If the big wide table is what we've been generating, then I assume that's what RMI is depending on downstream, in which case we need to continue providing it in some form. How does this data currently get used? It must have a bajillion columns.

Regardless of that, it also seems like the glue table would be a generally useful thing. You could use it to stick together FERC + EIA data in a variety of contexts without needing to read this monster table. What columns would be part of the glue?

@cmgosnell
Copy link
Member Author

I added all of the extra columns a while back while working in the rmi-ferc1-eia repo because I kept going back and merging in more columns from the EIA or FERC data. If the whole point of this merge is to connect the financial data in FERC and the operational data in EIA... it seems like we should probably publish this big table. Especially because the EIA merge is a merge from the plant_parts_eia, which is not super hard but requires some understanding of the ppe.

I think the skinny table could just have: ["record_id_eia", "record_id_ferc1", "match_type"] (match_type is the only truly new column from the matching process)

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to talk more about the structure of the glue.

src/pudl/package_data/glue/ferc1_eia_null.csv Outdated Show resolved Hide resolved
src/pudl/output/pudltabl.py Show resolved Hide resolved
src/pudl/metadata/resources/glue.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/glue.py Outdated Show resolved Hide resolved
@cmgosnell
Copy link
Member Author

cmgosnell commented Oct 13, 2023

okay @zaneselvans and i chatted about table design and landed on keeping the wide table because the record_id_eia is only linkable to the plant_parts_eia and making users interact with that table to get the EIA data is... probably not a good idea. This table is at the "level" of the ferc records which necessarily means it includes a variety of EIA ppe records (i.e. aggregations of EIA generators).

in an ideal world we would follow up with this table with an addition table that is at the level of EIA generators (or EIA generator-owners). it would be a 1:m association table with one FERC1 record being linked to many EIA generator records. and the FERC1 data could be proportionally allocated across those generators. we did a lot of that work over in this repo.

@zaneselvans
Copy link
Member

On the Naming of Things, I do think that getting a more descriptive table name in here would be great. Under the new naming convention what would it be? What "data source" are we using for the composite analytical tables? pudl? Or nothing at all? Would out__yearly_plants_all_ferc1_plant_parts_eia_assn be appropriate? Is there a better way to indicate what the two sides of the table are? I guess it's not strictly an association table.

On the asset groups, to me it feels like plant_parts_eia, mega_gens and ferc1_eia are all part of the same constellation. Maybe just add this module to the list (currently of length 1) of modules that's used to build the plant parts asset group?

Should we create an issue to get the already-existing FERC1 data allocated to EIA generator-ownership slices table migrated over from the RMI repo and/or the coarse-grained primary-fuel based linkage for the remaining 20% of FERC1 records that aren't getting associated in the current process.

Do you want me to take a stab at a table description, and then you can tell me how it's wrong because I don't totally understand this table?

@cmgosnell cmgosnell marked this pull request as ready for review October 16, 2023 15:25
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dazhong says the Zenodo fix CI should be done in about 15 minutes!

src/pudl/etl/__init__.py Outdated Show resolved Hide resolved
Comment on lines +97 to +99
"sources": ["eia860", "eia923"],
"etl_group": "outputs",
"field_namespace": "eia",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this table really have no primary key?

Copy link
Member Author

@cmgosnell cmgosnell Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm sure it could have some gross composite key but i just copied this over when migrating it. so it did not beforehand and i think determining one and enforcing it feels oos for this pr.

@codecov
Copy link

codecov bot commented Oct 16, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (8ba81e9) 88.5% compared to head (d1da494) 88.5%.
Report is 6 commits behind head on dev.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2938     +/-   ##
=======================================
- Coverage   88.5%   88.5%   -0.1%     
=======================================
  Files         90      91      +1     
  Lines      10797   10805      +8     
=======================================
+ Hits        9564    9569      +5     
- Misses      1233    1236      +3     
Files Coverage Δ
src/pudl/analysis/classify_plants_ferc1.py 92.4% <ø> (ø)
src/pudl/analysis/ferc1_eia_record_linkage.py 96.1% <100.0%> (ø)
src/pudl/metadata/classes.py 86.4% <100.0%> (ø)
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia.py 100.0% <ø> (ø)
...udl/metadata/resources/ferc1_eia_record_linkage.py 100.0% <100.0%> (ø)
src/pudl/output/pudltabl.py 88.5% <ø> (-0.6%) ⬇️
src/pudl/workspace/datastore.py 77.6% <100.0%> (+0.5%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zaneselvans zaneselvans merged commit b939d28 into dev Oct 16, 2023
10 of 11 checks passed
@zaneselvans zaneselvans deleted the dag-the-ferc1-eia branch October 16, 2023 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 ferc1 Anything having to do with FERC Form 1
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Convert FERC1-EIA record linkage outputs to Dagster assets
2 participants