Dagsterize `mega_generators` and `plant_parts_eia` #2714

katie-lamb · 2023-07-05T20:15:12Z

PR Overview

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

katie-lamb · 2023-08-15T16:18:38Z

@zaneselvans @cmgosnell as briefly discussed on Slack, the primary issue with getting these tables in the DB is that mega_gens has null values in utility_id_eia which is part of its primary key. The null values arise because the utility owner IDs are not properly harvested (see #1393 ). I see 3 potential solutions, ordered from easiest and least satisfactory to hardest and most satisfactory:

Don't write the mega gens table to the DB. Just put plant_parts_eia in there. How useful is the mega gens table as a standalone output? Or is the PPL really the more useful output?
The "janky" solution that Zane proposes here, where we rename owner_utility_id_eia to utility_id_eia and utility_id_eia to operator_utility_id_eia (then the ownership relationship would be captured / harvested, and we could drop the operator_utility_id_eia column from the table altogether). I think we're talking about the original ownership_eia860 table here.
Rework the harvesting process slightly: "it should be possible to harvest selectively based on name and the foreign key relationships. Like any column that is already identified as referring to an entity table column (like utility_id_eia) even if it doesn't have exactly the same name as the column it refers to (e.g. owner_utility_id_eia) should still be considered a source for harvested values."

I'm not entirely sure where we stand on the harvesting process (#509 seems relevant). I guess my vote would be option 1 or 2 and then do option 3 when a more thorough harvesting revamp is conducted?

katie-lamb · 2023-08-16T00:06:01Z

I tried a quick fix to rework the harvesting process to harvest the owner utilities and results seem pretty good. I was able to harvest the missing ~1300 utilities but mysteriously lost ~400 utilities. Investigating where they went.

zaneselvans · 2023-08-16T15:44:53Z

#509 just retains columns that we can/should be harvesting using the current process.

I could be wrong, but I don't think that the current harvesting process is capable of harvesting the same entity ID / value (like utility_id_eia) from multiple columns in the same input dataframe, even though that's clearly a kind of data that can exist, since an entity can have a relationship with another entity of the same kind, as we see here with the owners and operators showing up in the same records.

We realized in retrospect that leaving the (operator) utility_id_eia in this table is not well normalized, since you can look up the operator based on the date and the plant ID. So one simple solution would be to just drop the operator ID column prior to harvesting, rename the owner IDs to be utility_id_eia and then all those owner IDs would get harvested, but we wouldn't be capturing any unique utility IDs that appeared only in the operator ID column in this table (of which there should be none but... you never know).

I think the "automatically detect columns to harvest based on FKs" solution suggested in #1393 probably isn't the right approach, since we harvest IDs from columns that don't end up in the final table, and so would not have any FK relationships defined -- as should be the case for this table (since to normalize the ownership table, we should be removing the operator utility_id_eia column and only keeping the owner ID column... which should probably be renamed to utility_id_eia as per our convention more broadly).

So I think the simple solution here is just to preemptive drop the operator ID and hope we don't lose any actually novel utility IDs, and the more involved solution is to allow the specification of multiple columns that should be harvested, mapping them to a particular destination column.

But anyway in either case I think we should be able to close that other longstanding issue, get a bunch of new Utility IDs in, and fix some currently missing FK relationships!

How did your attempted rework approach it? Any idea what happened with the lost utilities?

The broader harvesting revamp has been kind of in the icebox for ages now sadly.

katie-lamb · 2023-08-16T22:36:12Z

@zaneselvans I had similar thoughts to what you said above. The column mapping solution seemed to work pretty well.

How did your attempted rework approach it? Any idea what happened with the lost utilities?

Basically what you said above, where I added a dictionary to the EIA entities mapping from column names to the "standard" name of the column, i.e. owner_utility_id_eia to utility_id_eia.

This added 1228 new utilities. The missing 400 IDs that I previously mentioned were just because I had forgotten to recreate the intermediate pickled assets with the new 2022 data, but all the IDs are showing up now.

There are some slight changes to the non-ID columns with the owner utilities included. 206/15301 records in the utility_entity_eia table have a different utility_name_eia, but for the most part it seems like superficial changes, i.e. "Banning, City of" vs. "City of Banning". This is a histogram of the string similarity/edit distance between these new and old names. Most of the records with 0 similarity are because one of the records is null. Most records with a score of > .4 still share a token or two. I've also included this for street_address in the utilities_eia860 table of which there were 151/125280 rows changed. All the rest of the columns are unchanged except for 1 or 2 records with a changed state, city, zip code.

katie-lamb · 2023-08-16T23:05:19Z

src/pudl/metadata/resources/__init__.py

+For each entity type (key), the ID columns, static columns, annual columns, and mapped
+columns.
+
+Mapped columns allow for harvesting the same entity ID / value relationship
+from multiple columns in the same input dataframe. This is useful if an entity has a
+relationship with another entity of the same kind, for example owner and operator utilities
+showing up in the same ownerhip table records. The mapped column dictionary maps from
+column names of the second group of entity ID / value columns to the standard names
+for the entity ID / values columns.
+
+Mapped column dictionaries must include all columns that are to be harvested
+for that relationship, even if a column name maps to itself.


I tried to explain the change in this docstring, but maybe this doesn't make sense. Not sure what the best explanation of this behavior is that would generalize to potential future scenarios.

I'm not understanding the last bit:

Mapped column dictionaries must include all columns that are to be harvested
for that relationship, even if a column name maps to itself.

I think the example I was thinking of wouldn't actually come up, because it would require a table input that's not so well normalized/clean. I think I can take this part out, but this is what I was originally thinking:

There are the columns owner_utility_id_eia, utility_id_eia, but the owner and operator utility share a column, maybe state or something. Perhaps we're only looking at CA utilities so there's one state column shared between owner and operator

In this case, you'd want to harvest state for both the owner and operator utility, but with the owner utility you don't need to rename the state column, so you map it to itself state:state.

zaneselvans · 2023-08-17T16:37:58Z

Glad it was just some stale inputs causing the loss of IDs. I don't totally understand the "all columns to be harvested" note in your docstring or exactly what context the remapped columns are going to be used in -- is it going to try and harvest those columns for the entity in every table?

Is there a simpler solution here that deals with the poor normalization of the table first, and renames all of the owner columns to just use the normal column names, since there's no reason to have the operator ID reference in the table, except to make sure that we get all of the utility_id_eia values that show up anywhere? I.e. can we just look at harvesting multiple ID columns and not need to worry about potentially finding multiple copies of entity attribute columns in a single table (although I can see that being more general, in cases where we have multiple entities being referenced in the same table, potentially each with their own associated attributes).

Is there any reason not to redefine the final ownership table here to get rid of the unusual names and the utility_id_eia column that refers to the operator (post-harvesting).

katie-lamb · 2023-08-17T16:56:16Z

except to make sure that we get all of the utility_id_eia values that show up anywhere? I.e. can we just look at harvesting multiple ID columns and not need to worry about potentially finding multiple copies of entity attribute columns in a single table (although I can see that being more general, in cases where we have multiple entities being referenced in the same table, potentially each with their own associated attributes).

In my current implementation, all the multiple ID column from the same table relationships are defined, and there is no automatic searching for multiple copies of entity attribute columns in a single table. Rather than dropping utility_id_eia and potentially missing some utility IDs, I thought this was a more general solution that could be easily applied to different table. Christina also mentioned the goal of not dropping columns pre-harvesting.

Is there any reason not to redefine the final ownership table here to get rid of the unusual names and the utility_id_eia column that refers to the operator (post-harvesting).

The post-harvesting table won't have a separate column name for owner utilities, just the utility_id_eia that has both owner and operator tables in it

katie-lamb · 2023-09-11T20:33:01Z

src/pudl/analysis/plant_parts_eia.py

+            "mcoe": AssetIn(key="mcoe_generators_yearly"),
+            "own_eia860": AssetIn(key="denorm_ownership_eia860"),
+        },
+        io_manager_key=None,


Is this all that needs to happen so that this table isn't written to the DB?

There's a default IO manager which will be used if this is None, which could be set explicitly instead. But unless it's set to pudl_sqlite_io_manager it won't be written to the DB.

katie-lamb · 2023-09-11T20:37:41Z

src/pudl/analysis/plant_parts_eia.py

-                pudl.helpers.cleanstrings_snake,
-                ["record_id_eia", "appro_record_id_eia"],
-            )
-            .set_index("record_id_eia")


You can't write a table to the DB where the index is a non-numeric column, right? I ended up just making record_id_eia a column and then setting it to be the index in the FERC to EIA match

I could be wrong but I think the index column is ignored entirely when writing to the DB. The schema for the table which is defined in the metadata needs to indicate what set of columns comprise the primary key if there is one. If there's no natural primary key then I think it'll it'll use an autoincrementing integer key, which is what happens with the fuel_receipts_costs table(s).

src/pudl/metadata/resources/__init__.py

zschira

Looks good! My biggest comments are that I'm not sure the assets need to be in a factory function, and I'm not sure how/if the map_cols_dict is being used.

zschira · 2023-09-13T19:47:09Z

src/pudl/transform/eia.py

+    # A dictionary of columns representing additional data to be harvested,
+    # whose names should map to an ID, static, or annual column name.
+    (
+        ENTITIES[entity.value]["map_cols_dict"]


This should be equivalent:

ENTITIES[entity.value].get("map_cols_dict"])

I'm also not sure where this is being used, doesn't seem to be assigned to anything.

Ha oops that's supposed to be assigned to something, must have messed with it for testing and not fixed it.

Ah makes sense. I think this can still be changed to just:

map_cols_dict = ENTITIES[entity.value].get("map_cols_dict"])

Ah right! Good call!

zschira · 2023-09-13T19:51:24Z

src/pudl/analysis/plant_parts_eia.py

@@ -365,6 +366,54 @@
 ]


+def plant_parts_eia_asset_factory(


Does this need to be in a factory, or could these two assets just stand on their own?

Changed to stand alone assets.

src/pudl/transform/eia860.py

src/pudl/metadata/resources/__init__.py

zaneselvans · 2023-09-14T15:22:14Z

src/pudl/metadata/resources/eia860.py

                "fraction_owned",
                "data_maturity",
            ],
            "primary_key": [
                "report_date",
                "plant_id_eia",
                "generator_id",
-                "owner_utility_id_eia",
+                "utility_id_eia",


Since utility_id_eia in this table refers to the owner rather than the operator we may want to give it a resource-specific metadata override at the bottom of pudl/metadata/fields.py so the description highlights this difference.

katie-lamb · 2023-09-20T17:33:23Z

@zschira For some reason the harvesting of the owner utilities got lost somewhere and weren't in the initial commits I pushed, so I just added that back in. Basically it uses the map_cols_dict to rename columns and harvest additional relationships from the tables.

I'm not sure why test_nuclear_fraction is failing. Somehow more nuclear gens were added? I'll poke into that later.

In addition to fixing that test, I need to update the release notes and merge dev.

Also @bendnorman I'm not sure if you want to get in here in the context of naming conventions. But tagging you just in case.

bendnorman · 2023-09-21T18:49:34Z

I have a few questions about the mega_gens and ppl assets:

Are there suitable natural keys for these tables?
Does every generator in mcoe_generators_monthly exist in mega_gens?
Eventually, should the ppl asset not live in the database if it's only used to link EIA and FERC plant parts? Is it a reasonable table for users to access?
Is the mega_gens table the mcoe_generators_monthly table but split out by ownership fractions?

katie-lamb · 2023-09-21T22:55:22Z

@zaneselvans Here is the test_nuclear_fraction failure in case it seems familiar to you. Otherwise I'll compare the tables to dev and see what's changed.

zaneselvans · 2023-09-22T19:49:45Z

It looks like the fraction of nuclear generation has just declined a little bit? Previously we were checking that it was 20% +/- 2% but now it seems to be at 17.75% which isn't sooo far off, and it looks like there was 1 big retirement in 2022, and another big retirement in 2021, so maybe that's all it is. Here's the NEI stats through 2021. And the failure here is just looking at 1 year of data, so it's going to be more volatile. My intuition is that either reducing the expected value to 19 +/- 2 or expanding the allowed range to 20 +/- 3 would be reasonable and fix the issue.

katie-lamb · 2023-09-22T20:21:48Z

@zaneselvans Ok great, that's along the lines of what I thought was happening, I was just suspicious of why this PR (adding ownership utilities) would change anything with the nuclear fraction.

I'll make try playing with range of the expected value.

katie-lamb · 2023-09-24T19:14:42Z

src/pudl/analysis/plant_parts_eia.py

+    )
+
+
+@asset(


It would be nice to be able to generate the PPE for only a subset of years, how have we been doing that type of thing? Add context with a start and end year? Ideally you'd be able to run the fast ETL and only get those 2 years of PPE.

What happens if you try and generate the PPE in the fast ETL right now? It'll only have the last couple of years of data available in the database to work with won't it?

I guess I mean if you have full data in the database (with all the years) but you're making changes to the PPE and want to generate a new PPE with a couple years of data but don't want to wait the full 45 min

Ah, okay. Hmm. A 45 minute asset that runs at the end of the DAG is going to be a bottleneck too. I wonder if there's anything we can do to split it up or speed it up to avoid extending the whole ETL by 45 minutes and also make it quick and easy to regenerate the whole thing on the fly in development. I'm not sure how to twiddle input parameters like this on a per-asset per-run basis but I assume it's possible.

It probably wouldn't be hard to parallelize. Currently it loops through each plant part and aggregates records to that part level, then concatenates into one dataframe. This loop could be parallelized. But probably worth making that a separate issue/PR at this point.

Agree, sounds like a good follow-up issue. Too many 45-minute assets will definitely get frustrating!

zschira

This looks good to me once the tests are passing, and @bendnorman's questions are addressed

zschira · 2023-09-22T20:41:48Z

src/pudl/transform/eia.py

+    # A dictionary of columns representing additional data to be harvested,
+    # whose names should map to an ID, static, or annual column name.
+    (
+        ENTITIES[entity.value]["map_cols_dict"]


Ah makes sense. I think this can still be changed to just:

map_cols_dict = ENTITIES[entity.value].get("map_cols_dict"])

zaneselvans · 2023-09-25T16:11:25Z

Unfortunately @bendnorman I think these are probably @cmgosnell questions...

Are there suitable natural keys for these tables?
Does every generator in mcoe_generators_monthly exist in mega_gens?
Is the mega_gens table the mcoe_generators_monthly table but split out by ownership fractions?

I don't know the answer to any of these.

Eventually, should the PPE asset not live in the database if it's only used to link EIA and FERC plant parts? Is it a reasonable table for users to access?

I suspect the answer here is NO -- this table is incredibly difficult to work with and explain to users, and I think once we've created a much simpler join table that allows the FERC and EIA data to be merged, that's the output that should end up in the DB and be used to pull the FERC financial info into our huge generator attributes table.

katie-lamb · 2023-09-25T20:15:41Z

@bendnorman

Are there suitable natural keys for these tables?

Yes, sort of. The primary key for mega gens is "plant_id_eia", "generator_id", "report_date", "utility_id_eia", "ownership_record_type", but there are still 126 records with null utility_id_eia. The key for plant_parts_eia is record_id_eia, which is a string concatenation of plant ID, report year, plant part, ownership record type, utility ID, and retirement status.

Does every generator in mcoe_generators_monthly exist in mega_gens?

mcoe_generators_yearly is actually the input to mega_generators_eia, but yes, every generator in the yearly MCOE table exists in the mega generators table.

Eventually, should the ppl asset not live in the database if it's only used to link EIA and FERC plant parts? Is it a reasonable table for users to access?

As Zane says above, probably not. I think the only reason you'd want this table is to make the FERC to EIA connection.

Is the mega_gens table the mcoe_generators_monthly table but split out by ownership fractions?

It's the mcoe_generators_yearly table that's really the input here, but yes, it's essentially the MCOE split out by ownership fraction and with a couple FERC-y columns added on to help with the eventual match to FERC.

codecov · 2023-09-26T08:36:38Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (f6cbcb5) 88.6% compared to head (9615220) 88.6%.
Report is 28 commits behind head on dev.

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #2714     +/-   ##
=======================================
- Coverage   88.6%   88.6%   -0.1%     
=======================================
  Files         90      90             
  Lines      10833   10808     -25     
=======================================
- Hits        9600    9576     -24     
+ Misses      1233    1232      -1

Files	Coverage Δ
src/pudl/analysis/ferc1_eia.py	`98.0% <100.0%> (ø)`
src/pudl/analysis/plant_parts_eia.py	`96.5% <100.0%> (+<0.1%)`	⬆️
src/pudl/helpers.py	`86.3% <100.0%> (ø)`
src/pudl/metadata/fields.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/eia.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/eia860.py	`100.0% <ø> (ø)`
src/pudl/output/eia860.py	`100.0% <ø> (ø)`
src/pudl/output/pudltabl.py	`89.1% <ø> (-1.1%)`	⬇️
src/pudl/transform/eia.py	`97.3% <100.0%> (+<0.1%)`	⬆️
src/pudl/transform/eia860.py	`88.8% <100.0%> (+<0.1%)`	⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

katie-lamb added 2 commits July 3, 2023 13:09

fixed typos in docstrings

2a0b4fb

added ppe and mega gens asset

d90a7ec

katie-lamb added the ccai Tasks related to CCAI grant for entity matching label Jul 5, 2023

katie-lamb added this to the 2023 Spring milestone Jul 5, 2023

katie-lamb self-assigned this Jul 5, 2023

katie-lamb marked this pull request as draft July 5, 2023 20:15

katie-lamb changed the base branch from main to dagsterize-mcoe July 5, 2023 20:15

remove old output functions

fcef6b0

Base automatically changed from dagsterize-mcoe to dev August 11, 2023 22:13

Merge branch 'dev' into dagsterize-ppe

0bc5d9b

zaneselvans linked an issue Aug 14, 2023 that may be closed by this pull request

Convert FERC1-EIA record linkage outputs to Dagster assets #2439

Closed

clean up and update unit test

8858be1

katie-lamb added 2 commits August 16, 2023 13:23

Merge branch 'dev' into dagsterize-ppe

dead9d7

harvest owner utilities

8f75e58

katie-lamb added 2 commits August 16, 2023 16:57

add migrations

933c75c

Merge branch 'dev' into dagsterize-ppe

f6864c6

katie-lamb commented Aug 16, 2023

View reviewed changes

katie-lamb added 7 commits August 18, 2023 14:52

Merge branch 'dev' into dagsterize-ppe

4c7298b

add new utility pudl ids

905dc8b

Merge branch 'dev' into dagsterize-ppe

4385c17

fix bug with new utility ids

2c998e1

took record id out of index

b4901f4

fix ferc to eia connection

836672b

Merge branch 'dev' into dagsterize-ppe

7c4b719

katie-lamb marked this pull request as ready for review September 11, 2023 20:07

katie-lamb requested a review from zschira September 11, 2023 20:07

katie-lamb commented Sep 11, 2023

View reviewed changes

src/pudl/metadata/resources/__init__.py Outdated Show resolved Hide resolved

katie-lamb added 4 commits September 12, 2023 09:55

normalize ownership table

e87efec

Merge branch 'dev' into dagsterize-ppe

77ae443

fix column names and metadata for ownership table

3f107ca

add migration

3d73c7b

zschira requested changes Sep 13, 2023

View reviewed changes

zaneselvans reviewed Sep 14, 2023

View reviewed changes

fix harvesting of owner utils and make ppe separate assets

41fc5a9

katie-lamb added 2 commits September 22, 2023 18:18

update release notes and change nuke fraction

2dbd2ff

Merge branch 'dev' into dagsterize-ppe

54b4c68

katie-lamb commented Sep 24, 2023

View reviewed changes

zschira approved these changes Sep 25, 2023

View reviewed changes

katie-lamb added 2 commits September 25, 2023 13:25

clean up and fix alembic migration

3757abe

update validation test row counts

9615220

katie-lamb merged commit a64b03e into dev Sep 26, 2023
10 of 11 checks passed

katie-lamb deleted the dagsterize-ppe branch September 26, 2023 17:02

		)


		@asset(

Dagsterize mega_generators and plant_parts_eia #2714

Dagsterize mega_generators and plant_parts_eia #2714

Conversation

katie-lamb commented Jul 5, 2023 • edited

PR Overview

PR Checklist

katie-lamb commented Aug 15, 2023 • edited

katie-lamb commented Aug 16, 2023

zaneselvans commented Aug 16, 2023 • edited

katie-lamb commented Aug 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb Aug 17, 2023 • edited

Choose a reason for hiding this comment

zaneselvans commented Aug 17, 2023

katie-lamb commented Aug 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zschira left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb commented Sep 20, 2023

bendnorman commented Sep 21, 2023

katie-lamb commented Sep 21, 2023

zaneselvans commented Sep 22, 2023

katie-lamb commented Sep 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zschira left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Sep 25, 2023

katie-lamb commented Sep 25, 2023

codecov bot commented Sep 26, 2023

Codecov Report

Dagsterize `mega_generators` and `plant_parts_eia` #2714

Dagsterize `mega_generators` and `plant_parts_eia` #2714

katie-lamb commented Jul 5, 2023 •

edited

katie-lamb commented Aug 15, 2023 •

edited

zaneselvans commented Aug 16, 2023 •

edited

katie-lamb commented Aug 16, 2023 •

edited

katie-lamb Aug 17, 2023 •

edited