New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dagsterize mega_generators
and plant_parts_eia
#2714
Conversation
@zaneselvans @cmgosnell as briefly discussed on Slack, the primary issue with getting these tables in the DB is that
I'm not entirely sure where we stand on the harvesting process (#509 seems relevant). I guess my vote would be option 1 or 2 and then do option 3 when a more thorough harvesting revamp is conducted? |
I tried a quick fix to rework the harvesting process to harvest the owner utilities and results seem pretty good. I was able to harvest the missing ~1300 utilities but mysteriously lost ~400 utilities. Investigating where they went. |
#509 just retains columns that we can/should be harvesting using the current process. I could be wrong, but I don't think that the current harvesting process is capable of harvesting the same entity ID / value (like We realized in retrospect that leaving the (operator) I think the "automatically detect columns to harvest based on FKs" solution suggested in #1393 probably isn't the right approach, since we harvest IDs from columns that don't end up in the final table, and so would not have any FK relationships defined -- as should be the case for this table (since to normalize the ownership table, we should be removing the operator So I think the simple solution here is just to preemptive drop the operator ID and hope we don't lose any actually novel utility IDs, and the more involved solution is to allow the specification of multiple columns that should be harvested, mapping them to a particular destination column. But anyway in either case I think we should be able to close that other longstanding issue, get a bunch of new Utility IDs in, and fix some currently missing FK relationships! How did your attempted rework approach it? Any idea what happened with the lost utilities? The broader harvesting revamp has been kind of in the icebox for ages now sadly. |
@zaneselvans I had similar thoughts to what you said above. The column mapping solution seemed to work pretty well.
Basically what you said above, where I added a dictionary to the EIA entities mapping from column names to the "standard" name of the column, i.e. This added 1228 new utilities. The missing 400 IDs that I previously mentioned were just because I had forgotten to recreate the intermediate pickled assets with the new 2022 data, but all the IDs are showing up now. There are some slight changes to the non-ID columns with the owner utilities included. 206/15301 records in the |
For each entity type (key), the ID columns, static columns, annual columns, and mapped | ||
columns. | ||
|
||
Mapped columns allow for harvesting the same entity ID / value relationship | ||
from multiple columns in the same input dataframe. This is useful if an entity has a | ||
relationship with another entity of the same kind, for example owner and operator utilities | ||
showing up in the same ownerhip table records. The mapped column dictionary maps from | ||
column names of the second group of entity ID / value columns to the standard names | ||
for the entity ID / values columns. | ||
|
||
Mapped column dictionaries must include all columns that are to be harvested | ||
for that relationship, even if a column name maps to itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to explain the change in this docstring, but maybe this doesn't make sense. Not sure what the best explanation of this behavior is that would generalize to potential future scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not understanding the last bit:
Mapped column dictionaries must include all columns that are to be harvested
for that relationship, even if a column name maps to itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the example I was thinking of wouldn't actually come up, because it would require a table input that's not so well normalized/clean. I think I can take this part out, but this is what I was originally thinking:
- There are the columns
owner_utility_id_eia
,utility_id_eia
, but the owner and operator utility share a column, maybestate
or something. Perhaps we're only looking at CA utilities so there's one state column shared between owner and operator - In this case, you'd want to harvest
state
for both the owner and operator utility, but with the owner utility you don't need to rename thestate
column, so you map it to itselfstate:state
.
Glad it was just some stale inputs causing the loss of IDs. I don't totally understand the "all columns to be harvested" note in your docstring or exactly what context the remapped columns are going to be used in -- is it going to try and harvest those columns for the entity in every table? Is there a simpler solution here that deals with the poor normalization of the table first, and renames all of the Is there any reason not to redefine the final ownership table here to get rid of the unusual names and the |
In my current implementation, all the multiple ID column from the same table relationships are defined, and there is no automatic searching for multiple copies of entity attribute columns in a single table. Rather than dropping
The post-harvesting table won't have a separate column name for owner utilities, just the |
src/pudl/analysis/plant_parts_eia.py
Outdated
"mcoe": AssetIn(key="mcoe_generators_yearly"), | ||
"own_eia860": AssetIn(key="denorm_ownership_eia860"), | ||
}, | ||
io_manager_key=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this all that needs to happen so that this table isn't written to the DB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a default IO manager which will be used if this is None, which could be set explicitly instead. But unless it's set to pudl_sqlite_io_manager
it won't be written to the DB.
pudl.helpers.cleanstrings_snake, | ||
["record_id_eia", "appro_record_id_eia"], | ||
) | ||
.set_index("record_id_eia") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't write a table to the DB where the index is a non-numeric column, right? I ended up just making record_id_eia
a column and then setting it to be the index in the FERC to EIA match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong but I think the index column is ignored entirely when writing to the DB. The schema for the table which is defined in the metadata needs to indicate what set of columns comprise the primary key if there is one. If there's no natural primary key then I think it'll it'll use an autoincrementing integer key, which is what happens with the fuel_receipts_costs
table(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! My biggest comments are that I'm not sure the assets need to be in a factory function, and I'm not sure how/if the map_cols_dict
is being used.
src/pudl/transform/eia.py
Outdated
# A dictionary of columns representing additional data to be harvested, | ||
# whose names should map to an ID, static, or annual column name. | ||
( | ||
ENTITIES[entity.value]["map_cols_dict"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be equivalent:
ENTITIES[entity.value].get("map_cols_dict"])
I'm also not sure where this is being used, doesn't seem to be assigned to anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha oops that's supposed to be assigned to something, must have messed with it for testing and not fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense. I think this can still be changed to just:
map_cols_dict = ENTITIES[entity.value].get("map_cols_dict"])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right! Good call!
src/pudl/analysis/plant_parts_eia.py
Outdated
@@ -365,6 +366,54 @@ | |||
] | |||
|
|||
|
|||
def plant_parts_eia_asset_factory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be in a factory, or could these two assets just stand on their own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to stand alone assets.
"fraction_owned", | ||
"data_maturity", | ||
], | ||
"primary_key": [ | ||
"report_date", | ||
"plant_id_eia", | ||
"generator_id", | ||
"owner_utility_id_eia", | ||
"utility_id_eia", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since utility_id_eia
in this table refers to the owner rather than the operator we may want to give it a resource-specific metadata override at the bottom of pudl/metadata/fields.py
so the description highlights this difference.
@zschira For some reason the harvesting of the owner utilities got lost somewhere and weren't in the initial commits I pushed, so I just added that back in. Basically it uses the I'm not sure why In addition to fixing that test, I need to update the release notes and merge dev. Also @bendnorman I'm not sure if you want to get in here in the context of naming conventions. But tagging you just in case. |
I have a few questions about the mega_gens and ppl assets:
|
@zaneselvans Here is the |
It looks like the fraction of nuclear generation has just declined a little bit? Previously we were checking that it was 20% +/- 2% but now it seems to be at 17.75% which isn't sooo far off, and it looks like there was 1 big retirement in 2022, and another big retirement in 2021, so maybe that's all it is. Here's the NEI stats through 2021. And the failure here is just looking at 1 year of data, so it's going to be more volatile. My intuition is that either reducing the expected value to 19 +/- 2 or expanding the allowed range to 20 +/- 3 would be reasonable and fix the issue. |
@zaneselvans Ok great, that's along the lines of what I thought was happening, I was just suspicious of why this PR (adding ownership utilities) would change anything with the nuclear fraction. I'll make try playing with range of the expected value. |
) | ||
|
||
|
||
@asset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to be able to generate the PPE for only a subset of years, how have we been doing that type of thing? Add context
with a start and end year? Ideally you'd be able to run the fast ETL and only get those 2 years of PPE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you try and generate the PPE in the fast ETL right now? It'll only have the last couple of years of data available in the database to work with won't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I mean if you have full data in the database (with all the years) but you're making changes to the PPE and want to generate a new PPE with a couple years of data but don't want to wait the full 45 min
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. Hmm. A 45 minute asset that runs at the end of the DAG is going to be a bottleneck too. I wonder if there's anything we can do to split it up or speed it up to avoid extending the whole ETL by 45 minutes and also make it quick and easy to regenerate the whole thing on the fly in development. I'm not sure how to twiddle input parameters like this on a per-asset per-run basis but I assume it's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably wouldn't be hard to parallelize. Currently it loops through each plant part and aggregates records to that part level, then concatenates into one dataframe. This loop could be parallelized. But probably worth making that a separate issue/PR at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, sounds like a good follow-up issue. Too many 45-minute assets will definitely get frustrating!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me once the tests are passing, and @bendnorman's questions are addressed
src/pudl/transform/eia.py
Outdated
# A dictionary of columns representing additional data to be harvested, | ||
# whose names should map to an ID, static, or annual column name. | ||
( | ||
ENTITIES[entity.value]["map_cols_dict"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense. I think this can still be changed to just:
map_cols_dict = ENTITIES[entity.value].get("map_cols_dict"])
Unfortunately @bendnorman I think these are probably @cmgosnell questions...
I don't know the answer to any of these.
I suspect the answer here is NO -- this table is incredibly difficult to work with and explain to users, and I think once we've created a much simpler join table that allows the FERC and EIA data to be merged, that's the output that should end up in the DB and be used to pull the FERC financial info into our huge generator attributes table. |
Yes, sort of. The primary key for mega gens is
As Zane says above, probably not. I think the only reason you'd want this table is to make the FERC to EIA connection.
It's the |
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev #2714 +/- ##
=======================================
- Coverage 88.6% 88.6% -0.1%
=======================================
Files 90 90
Lines 10833 10808 -25
=======================================
- Hits 9600 9576 -24
+ Misses 1233 1232 -1
☔ View full report in Codecov by Sentry. |
PR Overview
PR Checklist
dev
).