Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dagsterize mega_generators and plant_parts_eia #2714

Merged
merged 25 commits into from Sep 26, 2023
Merged

Dagsterize mega_generators and plant_parts_eia #2714

merged 25 commits into from Sep 26, 2023

Conversation

katie-lamb
Copy link
Member

@katie-lamb katie-lamb commented Jul 5, 2023

PR Overview

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@katie-lamb katie-lamb added the ccai Tasks related to CCAI grant for entity matching label Jul 5, 2023
@katie-lamb katie-lamb added this to the 2023 Spring milestone Jul 5, 2023
@katie-lamb katie-lamb self-assigned this Jul 5, 2023
@katie-lamb katie-lamb marked this pull request as draft July 5, 2023 20:15
@katie-lamb katie-lamb changed the base branch from main to dagsterize-mcoe July 5, 2023 20:15
Base automatically changed from dagsterize-mcoe to dev August 11, 2023 22:13
@zaneselvans zaneselvans linked an issue Aug 14, 2023 that may be closed by this pull request
@katie-lamb
Copy link
Member Author

katie-lamb commented Aug 15, 2023

@zaneselvans @cmgosnell as briefly discussed on Slack, the primary issue with getting these tables in the DB is that mega_gens has null values in utility_id_eia which is part of its primary key. The null values arise because the utility owner IDs are not properly harvested (see #1393 ). I see 3 potential solutions, ordered from easiest and least satisfactory to hardest and most satisfactory:

  • Don't write the mega gens table to the DB. Just put plant_parts_eia in there. How useful is the mega gens table as a standalone output? Or is the PPL really the more useful output?
  • The "janky" solution that Zane proposes here, where we rename owner_utility_id_eia to utility_id_eia and utility_id_eia to operator_utility_id_eia (then the ownership relationship would be captured / harvested, and we could drop the operator_utility_id_eia column from the table altogether). I think we're talking about the original ownership_eia860 table here.
  • Rework the harvesting process slightly: "it should be possible to harvest selectively based on name and the foreign key relationships. Like any column that is already identified as referring to an entity table column (like utility_id_eia) even if it doesn't have exactly the same name as the column it refers to (e.g. owner_utility_id_eia) should still be considered a source for harvested values."

I'm not entirely sure where we stand on the harvesting process (#509 seems relevant). I guess my vote would be option 1 or 2 and then do option 3 when a more thorough harvesting revamp is conducted?

@katie-lamb
Copy link
Member Author

I tried a quick fix to rework the harvesting process to harvest the owner utilities and results seem pretty good. I was able to harvest the missing ~1300 utilities but mysteriously lost ~400 utilities. Investigating where they went.

@zaneselvans
Copy link
Member

zaneselvans commented Aug 16, 2023

#509 just retains columns that we can/should be harvesting using the current process.

I could be wrong, but I don't think that the current harvesting process is capable of harvesting the same entity ID / value (like utility_id_eia) from multiple columns in the same input dataframe, even though that's clearly a kind of data that can exist, since an entity can have a relationship with another entity of the same kind, as we see here with the owners and operators showing up in the same records.

We realized in retrospect that leaving the (operator) utility_id_eia in this table is not well normalized, since you can look up the operator based on the date and the plant ID. So one simple solution would be to just drop the operator ID column prior to harvesting, rename the owner IDs to be utility_id_eia and then all those owner IDs would get harvested, but we wouldn't be capturing any unique utility IDs that appeared only in the operator ID column in this table (of which there should be none but... you never know).

I think the "automatically detect columns to harvest based on FKs" solution suggested in #1393 probably isn't the right approach, since we harvest IDs from columns that don't end up in the final table, and so would not have any FK relationships defined -- as should be the case for this table (since to normalize the ownership table, we should be removing the operator utility_id_eia column and only keeping the owner ID column... which should probably be renamed to utility_id_eia as per our convention more broadly).

So I think the simple solution here is just to preemptive drop the operator ID and hope we don't lose any actually novel utility IDs, and the more involved solution is to allow the specification of multiple columns that should be harvested, mapping them to a particular destination column.

But anyway in either case I think we should be able to close that other longstanding issue, get a bunch of new Utility IDs in, and fix some currently missing FK relationships!

How did your attempted rework approach it? Any idea what happened with the lost utilities?

The broader harvesting revamp has been kind of in the icebox for ages now sadly.

@katie-lamb
Copy link
Member Author

katie-lamb commented Aug 16, 2023

@zaneselvans I had similar thoughts to what you said above. The column mapping solution seemed to work pretty well.

How did your attempted rework approach it? Any idea what happened with the lost utilities?

Basically what you said above, where I added a dictionary to the EIA entities mapping from column names to the "standard" name of the column, i.e. owner_utility_id_eia to utility_id_eia.

This added 1228 new utilities. The missing 400 IDs that I previously mentioned were just because I had forgotten to recreate the intermediate pickled assets with the new 2022 data, but all the IDs are showing up now.

There are some slight changes to the non-ID columns with the owner utilities included. 206/15301 records in the utility_entity_eia table have a different utility_name_eia, but for the most part it seems like superficial changes, i.e. "Banning, City of" vs. "City of Banning". This is a histogram of the string similarity/edit distance between these new and old names. Most of the records with 0 similarity are because one of the records is null. Most records with a score of > .4 still share a token or two. I've also included this for street_address in the utilities_eia860 table of which there were 151/125280 rows changed. All the rest of the columns are unchanged except for 1 or 2 records with a changed state, city, zip code.

jaro_winkler_utility_name_sim
levenshtein_utility_name

jaro_winkler_street_address

levenshtein_street_address

Comment on lines 303 to 314
For each entity type (key), the ID columns, static columns, annual columns, and mapped
columns.

Mapped columns allow for harvesting the same entity ID / value relationship
from multiple columns in the same input dataframe. This is useful if an entity has a
relationship with another entity of the same kind, for example owner and operator utilities
showing up in the same ownerhip table records. The mapped column dictionary maps from
column names of the second group of entity ID / value columns to the standard names
for the entity ID / values columns.

Mapped column dictionaries must include all columns that are to be harvested
for that relationship, even if a column name maps to itself.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to explain the change in this docstring, but maybe this doesn't make sense. Not sure what the best explanation of this behavior is that would generalize to potential future scenarios.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding the last bit:

Mapped column dictionaries must include all columns that are to be harvested
for that relationship, even if a column name maps to itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example I was thinking of wouldn't actually come up, because it would require a table input that's not so well normalized/clean. I think I can take this part out, but this is what I was originally thinking:

  • There are the columns owner_utility_id_eia, utility_id_eia, but the owner and operator utility share a column, maybe state or something. Perhaps we're only looking at CA utilities so there's one state column shared between owner and operator
  • In this case, you'd want to harvest state for both the owner and operator utility, but with the owner utility you don't need to rename the state column, so you map it to itself state:state.

@zaneselvans
Copy link
Member

Glad it was just some stale inputs causing the loss of IDs. I don't totally understand the "all columns to be harvested" note in your docstring or exactly what context the remapped columns are going to be used in -- is it going to try and harvest those columns for the entity in every table?

Is there a simpler solution here that deals with the poor normalization of the table first, and renames all of the owner columns to just use the normal column names, since there's no reason to have the operator ID reference in the table, except to make sure that we get all of the utility_id_eia values that show up anywhere? I.e. can we just look at harvesting multiple ID columns and not need to worry about potentially finding multiple copies of entity attribute columns in a single table (although I can see that being more general, in cases where we have multiple entities being referenced in the same table, potentially each with their own associated attributes).

Is there any reason not to redefine the final ownership table here to get rid of the unusual names and the utility_id_eia column that refers to the operator (post-harvesting).

@katie-lamb
Copy link
Member Author

except to make sure that we get all of the utility_id_eia values that show up anywhere? I.e. can we just look at harvesting multiple ID columns and not need to worry about potentially finding multiple copies of entity attribute columns in a single table (although I can see that being more general, in cases where we have multiple entities being referenced in the same table, potentially each with their own associated attributes).

In my current implementation, all the multiple ID column from the same table relationships are defined, and there is no automatic searching for multiple copies of entity attribute columns in a single table. Rather than dropping utility_id_eia and potentially missing some utility IDs, I thought this was a more general solution that could be easily applied to different table. Christina also mentioned the goal of not dropping columns pre-harvesting.

Is there any reason not to redefine the final ownership table here to get rid of the unusual names and the utility_id_eia column that refers to the operator (post-harvesting).

The post-harvesting table won't have a separate column name for owner utilities, just the utility_id_eia that has both owner and operator tables in it

@katie-lamb katie-lamb marked this pull request as ready for review September 11, 2023 20:07
"mcoe": AssetIn(key="mcoe_generators_yearly"),
"own_eia860": AssetIn(key="denorm_ownership_eia860"),
},
io_manager_key=None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this all that needs to happen so that this table isn't written to the DB?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a default IO manager which will be used if this is None, which could be set explicitly instead. But unless it's set to pudl_sqlite_io_manager it won't be written to the DB.

pudl.helpers.cleanstrings_snake,
["record_id_eia", "appro_record_id_eia"],
)
.set_index("record_id_eia")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't write a table to the DB where the index is a non-numeric column, right? I ended up just making record_id_eia a column and then setting it to be the index in the FERC to EIA match

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong but I think the index column is ignored entirely when writing to the DB. The schema for the table which is defined in the metadata needs to indicate what set of columns comprise the primary key if there is one. If there's no natural primary key then I think it'll it'll use an autoincrementing integer key, which is what happens with the fuel_receipts_costs table(s).

Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! My biggest comments are that I'm not sure the assets need to be in a factory function, and I'm not sure how/if the map_cols_dict is being used.

# A dictionary of columns representing additional data to be harvested,
# whose names should map to an ID, static, or annual column name.
(
ENTITIES[entity.value]["map_cols_dict"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be equivalent:

ENTITIES[entity.value].get("map_cols_dict"])

I'm also not sure where this is being used, doesn't seem to be assigned to anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha oops that's supposed to be assigned to something, must have messed with it for testing and not fixed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense. I think this can still be changed to just:

map_cols_dict = ENTITIES[entity.value].get("map_cols_dict"])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right! Good call!

@@ -365,6 +366,54 @@
]


def plant_parts_eia_asset_factory(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be in a factory, or could these two assets just stand on their own?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to stand alone assets.

src/pudl/transform/eia860.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/__init__.py Show resolved Hide resolved
"fraction_owned",
"data_maturity",
],
"primary_key": [
"report_date",
"plant_id_eia",
"generator_id",
"owner_utility_id_eia",
"utility_id_eia",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since utility_id_eia in this table refers to the owner rather than the operator we may want to give it a resource-specific metadata override at the bottom of pudl/metadata/fields.py so the description highlights this difference.

@katie-lamb
Copy link
Member Author

@zschira For some reason the harvesting of the owner utilities got lost somewhere and weren't in the initial commits I pushed, so I just added that back in. Basically it uses the map_cols_dict to rename columns and harvest additional relationships from the tables.

I'm not sure why test_nuclear_fraction is failing. Somehow more nuclear gens were added? I'll poke into that later.

In addition to fixing that test, I need to update the release notes and merge dev.

Also @bendnorman I'm not sure if you want to get in here in the context of naming conventions. But tagging you just in case.

@bendnorman
Copy link
Member

I have a few questions about the mega_gens and ppl assets:

  • Are there suitable natural keys for these tables?
  • Does every generator in mcoe_generators_monthly exist in mega_gens?
  • Eventually, should the ppl asset not live in the database if it's only used to link EIA and FERC plant parts? Is it a reasonable table for users to access?
  • Is the mega_gens table the mcoe_generators_monthly table but split out by ownership fractions?

@katie-lamb
Copy link
Member Author

@zaneselvans Here is the test_nuclear_fraction failure in case it seems familiar to you. Otherwise I'll compare the tables to dev and see what's changed.

@zaneselvans
Copy link
Member

It looks like the fraction of nuclear generation has just declined a little bit? Previously we were checking that it was 20% +/- 2% but now it seems to be at 17.75% which isn't sooo far off, and it looks like there was 1 big retirement in 2022, and another big retirement in 2021, so maybe that's all it is. Here's the NEI stats through 2021. And the failure here is just looking at 1 year of data, so it's going to be more volatile. My intuition is that either reducing the expected value to 19 +/- 2 or expanding the allowed range to 20 +/- 3 would be reasonable and fix the issue.

@katie-lamb
Copy link
Member Author

@zaneselvans Ok great, that's along the lines of what I thought was happening, I was just suspicious of why this PR (adding ownership utilities) would change anything with the nuclear fraction.

I'll make try playing with range of the expected value.

)


@asset(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to generate the PPE for only a subset of years, how have we been doing that type of thing? Add context with a start and end year? Ideally you'd be able to run the fast ETL and only get those 2 years of PPE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you try and generate the PPE in the fast ETL right now? It'll only have the last couple of years of data available in the database to work with won't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I mean if you have full data in the database (with all the years) but you're making changes to the PPE and want to generate a new PPE with a couple years of data but don't want to wait the full 45 min

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. Hmm. A 45 minute asset that runs at the end of the DAG is going to be a bottleneck too. I wonder if there's anything we can do to split it up or speed it up to avoid extending the whole ETL by 45 minutes and also make it quick and easy to regenerate the whole thing on the fly in development. I'm not sure how to twiddle input parameters like this on a per-asset per-run basis but I assume it's possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably wouldn't be hard to parallelize. Currently it loops through each plant part and aggregates records to that part level, then concatenates into one dataframe. This loop could be parallelized. But probably worth making that a separate issue/PR at this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, sounds like a good follow-up issue. Too many 45-minute assets will definitely get frustrating!

Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me once the tests are passing, and @bendnorman's questions are addressed

# A dictionary of columns representing additional data to be harvested,
# whose names should map to an ID, static, or annual column name.
(
ENTITIES[entity.value]["map_cols_dict"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense. I think this can still be changed to just:

map_cols_dict = ENTITIES[entity.value].get("map_cols_dict"])

@zaneselvans
Copy link
Member

Unfortunately @bendnorman I think these are probably @cmgosnell questions...

Are there suitable natural keys for these tables?
Does every generator in mcoe_generators_monthly exist in mega_gens?
Is the mega_gens table the mcoe_generators_monthly table but split out by ownership fractions?

I don't know the answer to any of these.

Eventually, should the PPE asset not live in the database if it's only used to link EIA and FERC plant parts? Is it a reasonable table for users to access?

I suspect the answer here is NO -- this table is incredibly difficult to work with and explain to users, and I think once we've created a much simpler join table that allows the FERC and EIA data to be merged, that's the output that should end up in the DB and be used to pull the FERC financial info into our huge generator attributes table.

@katie-lamb
Copy link
Member Author

@bendnorman

  • Are there suitable natural keys for these tables?

Yes, sort of. The primary key for mega gens is "plant_id_eia", "generator_id", "report_date", "utility_id_eia", "ownership_record_type", but there are still 126 records with null utility_id_eia. The key for plant_parts_eia is record_id_eia, which is a string concatenation of plant ID, report year, plant part, ownership record type, utility ID, and retirement status.

  • Does every generator in mcoe_generators_monthly exist in mega_gens?

mcoe_generators_yearly is actually the input to mega_generators_eia, but yes, every generator in the yearly MCOE table exists in the mega generators table.

  • Eventually, should the ppl asset not live in the database if it's only used to link EIA and FERC plant parts? Is it a reasonable table for users to access?

As Zane says above, probably not. I think the only reason you'd want this table is to make the FERC to EIA connection.

  • Is the mega_gens table the mcoe_generators_monthly table but split out by ownership fractions?

It's the mcoe_generators_yearly table that's really the input here, but yes, it's essentially the MCOE split out by ownership fraction and with a couple FERC-y columns added on to help with the eventual match to FERC.

@codecov
Copy link

codecov bot commented Sep 26, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (f6cbcb5) 88.6% compared to head (9615220) 88.6%.
Report is 28 commits behind head on dev.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2714     +/-   ##
=======================================
- Coverage   88.6%   88.6%   -0.1%     
=======================================
  Files         90      90             
  Lines      10833   10808     -25     
=======================================
- Hits        9600    9576     -24     
+ Misses      1233    1232      -1     
Files Coverage Δ
src/pudl/analysis/ferc1_eia.py 98.0% <100.0%> (ø)
src/pudl/analysis/plant_parts_eia.py 96.5% <100.0%> (+<0.1%) ⬆️
src/pudl/helpers.py 86.3% <100.0%> (ø)
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia860.py 100.0% <ø> (ø)
src/pudl/output/eia860.py 100.0% <ø> (ø)
src/pudl/output/pudltabl.py 89.1% <ø> (-1.1%) ⬇️
src/pudl/transform/eia.py 97.3% <100.0%> (+<0.1%) ⬆️
src/pudl/transform/eia860.py 88.8% <100.0%> (+<0.1%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@katie-lamb katie-lamb merged commit a64b03e into dev Sep 26, 2023
10 of 11 checks passed
@katie-lamb katie-lamb deleted the dagsterize-ppe branch September 26, 2023 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ccai Tasks related to CCAI grant for entity matching
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Convert FERC1-EIA record linkage outputs to Dagster assets
4 participants