incorporate boiler_id into epacamd_eia table #2558

cmgosnell · 2023-05-01T13:41:04Z

PR Overview

note: this pr does not YET also update the alembic schema or remove the original epacamd_eia table from the db.

i think there are two high-level steps needed:

remove the duplicate glue db tables. rn there are two crosswalk glue databse tables : epacamd_eia_subplant_ids and epacamd_eia. i don’t think we need to keep them both as database tables if we are preserving the boiler id in the epacamd_eia_subplant_ids table. So I believe we could remove epacamd_eia as a db table.
- I think to do that you’d need to change the io_manager_key of this asset. I believe you can just remove the specific io manager listed here to the default which just makes the asset and stores it as a pickled df like the rest of the non-db assets.
- we might want to think about some naming cleanup as well perhaps epacamd_eia -> clean_epacamd_eia & epacamd_eia_subplant_ids -> epacamd_eia.
- W/ the pre-subplant_id asset converted to a non-db asset, we’ll need to condense the two crosswalk table schemas in metadata.resources.glue.RESOURCE_METADATA
update alembic! alembic was still vv fresh when i tried this and i had no idea how to do it.

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

note: this commit does not also update the alembic schema or remove the original epacamd_eia table from the db.

codecov · 2023-05-01T15:26:14Z

Codecov Report

Patch coverage: 80.0% and project coverage change: -0.1 ⚠️

Comparison is base (3f2db6d) 88.4% compared to head (915b798) 88.4%.

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #2558     +/-   ##
=======================================
- Coverage   88.4%   88.4%   -0.1%     
=======================================
  Files         88      88             
  Lines      10176   10176             
=======================================
- Hits        9001    8999      -2     
- Misses      1175    1177      +2

Impacted Files	Coverage Δ
src/pudl/metadata/resources/glue.py	`100.0% <ø> (ø)`
src/pudl/output/pudltabl.py	`93.2% <50.0%> (ø)`
src/pudl/etl/glue_assets.py	`98.2% <100.0%> (ø)`

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

aesharpe · 2023-06-29T16:42:07Z

@cmgosnell some thougts:

Taking a closer look at the epacamd_eia_subplant_ids table and seeing this part of the glue_assets.py module:

    epacamd_eia_complete = (
        augement_crosswalk_with_generators_eia860(epacamd_eia, generators_eia860)
        .pipe(augement_crosswalk_with_epacamd_ids, emissions_unit_ids_epacems)
        .pipe(augement_crosswalk_with_bga_eia860, boiler_generator_assn_eia860)
    )

When we augment the crosswalk with the generators_eia860 table we fill empty emissions_unit_id_epa values with existing generator_id values:

emissions_unit_id_epa=lambda x: x.emissions_unit_id_epa.fillna(x.generator_id)

I'm not sure this is the best idea, because many of these generators are renewables and thus don't have an associated smokestack as is implied by the existence of an emissions_unit_id_pudl. For example:

epacamd_eia_subplant_ids[(epacamd_eia_subplant_ids["plant_id_eia"]==141) & (epacamd_eia_subplant_ids["generator_id"]=="PV-3")]

	plant_id_eia	plant_id_epa	subplant_id	unit_id_pudl	emissions_unit_id_epa	generator_id	boiler_id
392	141	141	3	nan	PV-3	PV-3

generators_eia860[(generators_eia860["plant_id_eia"]==141) & (generators_eia860["generator_id"]=="PV-3")].drop_duplicates(["plant_id_eia"])[["plant_id_eia", "generator_id", "technology_description"]]

	plant_id_eia	generator_id	technology_description
31903	141	PV-3	Solar Photovoltaic

What is the impetus for equating emissions_unit_id_epa with generator_id?

…ssions_unit_id_epa with generator id. Also remove the replacement of empty plant_id_epa values with plant_id_eia to avoid confusion about which plants are reporting to EPA and which are not. Also add boiler id to the bga table merge in augement_crosswalk_with_bga_eia860. Noticed that there is an issue with some of the BGA relationships getting dropped during the subplant id creation, but I'm not sure how to fix this yet.

aesharpe · 2023-06-30T18:52:18Z

To address the above comment, instead of filling the NA gaps in emissions_unit_id_epa with generator_id, I created a temporary column called network_x_key that does the same thing. This column is then fed into the sub-plant id creation method instead of emissions_unit_id_epa and then gets dropped due to it's exclusion from the metadata, and the emissions_unit_id_epa column remains intact.

aesharpe · 2023-06-30T18:55:58Z

BGA-to-subplant_id bug

I noticed that we merge the sub-plant id table with the bga table, but the m:m bga boiler-generator relationship is not retained after the sub-plant id creation. I'm not exactly sure why, but I think it has to do with how the network x is currently calculating the sub-plant ids. I have a feeling there are some (hopefully) easy tweaks to fix this, but I'm not super familiar with network x.

for example:

bga_eia860[bga_eia860["plant_id_eia"]==52151]

	plant_id_epa	emissions_unit_id_epa	plant_id_eia	boiler_id	generator_id
4302	52151	1	52151	PB1	GEN1
4303	52151	1	52151	RF1	GEN1
4304	52151	1	52151	PB2	GEN2
4305	52151	1	52151	RF2	GEN2

Vs.

subplant_id[subplant_id["plant_id_epa"]==52151]

	plant_id_eia	plant_id_epa	subplant_id	unit_id_pudl	emissions_unit_id_epa	generator_id	boiler_id
24733	52151	52151	0	1	1	GEN1	RF1
24734	52151	52151	0	2	1	GEN2	RF2

cmgosnell · 2023-07-10T12:51:48Z

I noticed that we merge the sub-plant id table with the bga table, but the m:m bga boiler-generator relationship is not retained after the sub-plant id creation. I'm not exactly sure why, but I think it has to do with how the network x is currently calculating the sub-plant ids. I have a feeling there are some (hopefully) easy tweaks to fix this, but I'm not super familiar with network x.

hm good catch! i was wondering about this. it makes a degree of sense honestly that the boiler_id could be ignored/condensed. Conceptually it makes sense that the subplant id doesn't need to retain the detail about the boilers. from the generation side of things it should mostly care about the generators because that is largely what the epa linked to the smokestacks. So my first inclination would not be to go in and change the network_x process. (it might be good to drop the boiler_id column before/right after the subplant_id making so this doesn't confuse anyone in the future.

I thiiiink this means that if we want the boilers we need a 1:m merge of the bga table after the fact. since rn we have two version of the epacamd_eia glue table it would be good to ponder whether or not we should make a specific denorm_{whatever the table name is} table to receive these boiler ids. and somewhat related but not necessary to make this new table is whether the epacamd_eia should be dropped as a db table - like does it provide any unique information from the subplant_id & this new potential denorm_ table with the boiler_ids

…e subplant_id table. Add code to merge the boiler_id back into the subplant_id table after the network x process because some boilers get dropped.

aesharpe · 2023-07-18T17:33:03Z

Restarted the migrations because I was getting the error raise ValueError("Constraint must have a name") when I tried to create a new migration for this data off of the current dev branch. Conversations on slack in the technical questions channel indicated that this was the best course of action (for now).

aesharpe · 2023-07-18T17:38:47Z

src/pudl/etl/glue_assets.py

+    # the network x process uses unit_id_pudl and generator_id. During processing,
+    # the boiler_id data are truncated and we only retain unique values for generator_id
+    # and unit_id_pudl. This step adds the lost boiler_id info back into the table.
+    subplant_ids_updated = pd.merge(
+        subplant_ids_updated.drop(columns="boiler_id").assign(
+            unit_id_pudl=lambda x: x.unit_id_pudl.astype(
+                "float"
+            )  # necessary step for tests
+        ),
+        boiler_generator_assn_eia860[
+            [
+                "plant_id_eia",
+                "generator_id",
+                "boiler_id",
+                "unit_id_pudl",
+            ]
+        ].drop_duplicates(),
+        on=["plant_id_eia", "generator_id", "unit_id_pudl"],
+        how="outer",
+    ).drop_duplicates()
+


The reason that I added this chunk of code was because the network x process was dropping some of the boiler_ids (see comments above). This isn't pretty, but it remerges the subplant id table with the bga table so that we retain all the boiler-generator relationships.

There are two main reasons it's not ideal:

we already have a function called augement_crosswalk_with_bga_eia860() so it's weird / duplicative that we are merging the bga table with the subplant id table for a second time. (Can't find a way around this for now without changing how the network X is working which we agreed wasn't a good idea for right now).

The pytest test/unit/transform/glue_test.py::test_epacamd_eia_subplant_ids was failing without the line that converts the unit_id_pudl field to a float. This step makes the tests pass, but it causes the asset build to fail -.- grr: TypeError: Cannot interpret 'Int64Dtype()' as a data type. I don't know how to get around it other than adding this column dtype conversion step into the tests rather than here. But I think that would cause the same error in the tests. I would prefer to avoid this dtype issue if possible! It wasn't an issue before this merge was added which I don't really understand.

on your second point, my first instinct is to add a convert_dtypes() step in the unit tests for all input dfs. hard agree that this convert to float is not ideal.

and on #1... maybe we should slightly generalize the augment_crosswalk_with_bga_eia860. it looks like i? or you? idk added boiler_id into that merge. but if the nextwork_x step doesn't care about boilers/effectively drops boiler detail then they shouldn't be in the table pre-network_x anyway.

We could give augment_crosswalk_with_bga_eia860 an idx: list[str] arg that is either just ["plant_id_eia", "generator_id"] or ["plant_id_eia", "generator_id", "boiler_id"] and always boiler_generator_assn_eia860[idx + ["unit_id_pudl"]].drop_duplicates()

I don't think this function is used more than once so we could also just call it augment_crosswalk_with_unit_id_pudl and remove boiler_id from the merge cols

aesharpe · 2023-07-18T18:40:00Z

denorm_{whatever the table name is}

I think it would be really confusing to have a version of the subplant table with all the boilers and a version with just some or none. My vote is for a subplant table with all the boilers.

whether the epacamd_eia should be dropped as a db table - like does it provide any unique information from the subplant_id & this new potential denorm_ table with the boiler_ids

Technically it shouldn't provide anything new. If it does, we probably did something wrong. I think now that erroneous filling of emissions_unit_id_pudl has stopped it makes sense to just keep this subplant id table.

It probably makes sense then to change the names to something like epacamd_eia_annual for the crosswalk we are removing from the database and keep epacamd_eia_subplant_ids.

… fail in the emacamd_eia_subplant_ids function and add a convert_dtypes() function to the boiler_generator_assn_eia860_test table instead

…he metadata for the subplant_id table

aesharpe · 2023-07-18T20:58:46Z

I just removed the epacamd_eia table from the db. The only novel information it contained that is not encapsulated in the subplant_id table is the year that the crosswalk data was created with. The epacamd_eia table had a date column but the subplant_id table just takes the unique values form both years and calls it a day.

…ut removing that table from the database

…Replace epacamd_eia validation test with epacamd_eia_subplant_id validation test

cmgosnell

this all looks good! two hopefully small requests for tweaks. one here and one in comment land.

if we are removing the old epacamd_eia table (which i think is the right call!) can we rename the epacamd_eia_subplant_id -> epacamd_eia? its just such a nicer name.

cmgosnell · 2023-07-20T16:05:40Z

src/pudl/etl/glue_assets.py

-            ["plant_id_eia", "generator_id", "unit_id_pudl"]
+            [
+                "plant_id_eia",
+                "generator_id",
+                "unit_id_pudl",
+                "boiler_id",
+            ]
        ].drop_duplicates(),
        how="outer",
-        on=["plant_id_eia", "generator_id"],
-        validate="m:1",
+        on=["plant_id_eia", "generator_id", "boiler_id"],
+        validate="m:m",
    )


since this merge is happening pre-network x, it seems like we should probably revert these changes no? so there is no boiler_id pre-network_x.

I don't think it really matters? Because we need to merge with the bga table for the unit_id which is used in the network x process. We could drop the boiler_id more explicitly before the process, but it wouldn't make a huge difference.

incorporate boiler_id into epacamd_eia table

74d9083

note: this commit does not also update the alembic schema or remove the original epacamd_eia table from the db.

cmgosnell assigned cmgosnell and aesharpe and unassigned cmgosnell Jun 5, 2023

aesharpe added 2 commits June 27, 2023 15:45

Merge branch 'dev' into boiler_id_w_subplant

391de53

Merge branch 'dev' into boiler_id_w_subplant

77a4726

aesharpe added 3 commits June 29, 2023 19:15

Reset migrations

c68a758

Merge branch 'dev' into boiler_id_w_subplant

8eead72

aesharpe marked this pull request as ready for review June 30, 2023 19:05

aesharpe linked an issue Jul 3, 2023 that may be closed by this pull request

integrate new entergy deprish study + manual overrides/gap filling of deprish<>eia catalyst-cooperative/rmi-ferc1-eia#276

Open

3 tasks

aesharpe added 2 commits July 14, 2023 11:24

Merge branch 'dev' into boiler_id_w_subplant

3e3575b

Restart the migrations from scratch plus the boiler_id addition to th…

eca0ba0

…e subplant_id table. Add code to merge the boiler_id back into the subplant_id table after the network x process because some boilers get dropped.

aesharpe reviewed Jul 18, 2023

View reviewed changes

aesharpe added 2 commits July 18, 2023 14:53

Remove the janky dtype conversion that was causing the asset build to…

ab4e5d5

… fail in the emacamd_eia_subplant_ids function and add a convert_dtypes() function to the boiler_generator_assn_eia860_test table instead

Remove the epacamd_eia crosswalk table from the database and update t…

1729282

…he metadata for the subplant_id table

aesharpe added 2 commits July 18, 2023 18:00

Remove newly bad references to the epacamd_eia table and add note abo…

a4598ae

…ut removing that table from the database

Replace epacamd_eia output table with epacamd_eia_subplant_id table. …

915b798

…Replace epacamd_eia validation test with epacamd_eia_subplant_id validation test

cmgosnell commented Jul 20, 2023

View reviewed changes

Base automatically changed from dev to main January 5, 2024 04:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorporate boiler_id into epacamd_eia table #2558

incorporate boiler_id into epacamd_eia table #2558

cmgosnell commented May 1, 2023 •

edited by aesharpe

Loading

codecov bot commented May 1, 2023 •

edited

Loading

aesharpe commented Jun 29, 2023 •

edited

Loading

aesharpe commented Jun 30, 2023

aesharpe commented Jun 30, 2023 •

edited

Loading

cmgosnell commented Jul 10, 2023

aesharpe commented Jul 18, 2023

aesharpe Jul 18, 2023 •

edited

Loading

cmgosnell Jul 18, 2023

cmgosnell Jul 18, 2023

aesharpe Jul 18, 2023

aesharpe commented Jul 18, 2023

aesharpe commented Jul 18, 2023

cmgosnell left a comment

cmgosnell Jul 20, 2023

aesharpe Jul 24, 2023

incorporate boiler_id into epacamd_eia table #2558

Are you sure you want to change the base?

incorporate boiler_id into epacamd_eia table #2558

Conversation

cmgosnell commented May 1, 2023 • edited by aesharpe Loading

PR Overview

PR Checklist

codecov bot commented May 1, 2023 • edited Loading

Codecov Report

aesharpe commented Jun 29, 2023 • edited Loading

aesharpe commented Jun 30, 2023

aesharpe commented Jun 30, 2023 • edited Loading

BGA-to-subplant_id bug

cmgosnell commented Jul 10, 2023

aesharpe commented Jul 18, 2023

aesharpe Jul 18, 2023 • edited Loading

Choose a reason for hiding this comment

cmgosnell Jul 18, 2023

Choose a reason for hiding this comment

cmgosnell Jul 18, 2023

Choose a reason for hiding this comment

aesharpe Jul 18, 2023

Choose a reason for hiding this comment

aesharpe commented Jul 18, 2023

aesharpe commented Jul 18, 2023

cmgosnell left a comment

Choose a reason for hiding this comment

cmgosnell Jul 20, 2023

Choose a reason for hiding this comment

aesharpe Jul 24, 2023

Choose a reason for hiding this comment

cmgosnell commented May 1, 2023 •

edited by aesharpe

Loading

codecov bot commented May 1, 2023 •

edited

Loading

aesharpe commented Jun 29, 2023 •

edited

Loading

aesharpe commented Jun 30, 2023 •

edited

Loading

aesharpe Jul 18, 2023 •

edited

Loading