Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorporate boiler_id into epacamd_eia table #2558

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented May 1, 2023

PR Overview

note: this pr does not YET also update the alembic schema or remove the original epacamd_eia table from the db.

i think there are two high-level steps needed:

  • remove the duplicate glue db tables. rn there are two crosswalk glue databse tables : epacamd_eia_subplant_ids and epacamd_eia. i don’t think we need to keep them both as database tables if we are preserving the boiler id in the epacamd_eia_subplant_ids table. So I believe we could remove epacamd_eia as a db table.
    • I think to do that you’d need to change the io_manager_key of this asset. I believe you can just remove the specific io manager listed here to the default which just makes the asset and stores it as a pickled df like the rest of the non-db assets.
    • we might want to think about some naming cleanup as well perhaps epacamd_eia -> clean_epacamd_eia & epacamd_eia_subplant_ids -> epacamd_eia.
    • W/ the pre-subplant_id asset converted to a non-db asset, we’ll need to condense the two crosswalk table schemas in metadata.resources.glue.RESOURCE_METADATA
  • update alembic! alembic was still vv fresh when i tried this and i had no idea how to do it.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

note: this commit does not also update the alembic schema or remove
the original epacamd_eia table from the db.
@codecov
Copy link

codecov bot commented May 1, 2023

Codecov Report

Patch coverage: 80.0% and project coverage change: -0.1 ⚠️

Comparison is base (3f2db6d) 88.4% compared to head (915b798) 88.4%.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2558     +/-   ##
=======================================
- Coverage   88.4%   88.4%   -0.1%     
=======================================
  Files         88      88             
  Lines      10176   10176             
=======================================
- Hits        9001    8999      -2     
- Misses      1175    1177      +2     
Impacted Files Coverage Δ
src/pudl/metadata/resources/glue.py 100.0% <ø> (ø)
src/pudl/output/pudltabl.py 93.2% <50.0%> (ø)
src/pudl/etl/glue_assets.py 98.2% <100.0%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@cmgosnell cmgosnell assigned cmgosnell and aesharpe and unassigned cmgosnell Jun 5, 2023
@aesharpe
Copy link
Member

aesharpe commented Jun 29, 2023

@cmgosnell some thougts:

Taking a closer look at the epacamd_eia_subplant_ids table and seeing this part of the glue_assets.py module:

    epacamd_eia_complete = (
        augement_crosswalk_with_generators_eia860(epacamd_eia, generators_eia860)
        .pipe(augement_crosswalk_with_epacamd_ids, emissions_unit_ids_epacems)
        .pipe(augement_crosswalk_with_bga_eia860, boiler_generator_assn_eia860)
    )

When we augment the crosswalk with the generators_eia860 table we fill empty emissions_unit_id_epa values with existing generator_id values:

emissions_unit_id_epa=lambda x: x.emissions_unit_id_epa.fillna(x.generator_id)

I'm not sure this is the best idea, because many of these generators are renewables and thus don't have an associated smokestack as is implied by the existence of an emissions_unit_id_pudl. For example:

epacamd_eia_subplant_ids[(epacamd_eia_subplant_ids["plant_id_eia"]==141) & (epacamd_eia_subplant_ids["generator_id"]=="PV-3")]
plant_id_eia plant_id_epa subplant_id unit_id_pudl emissions_unit_id_epa generator_id boiler_id
392 141 141 3 nan PV-3 PV-3
generators_eia860[(generators_eia860["plant_id_eia"]==141) & (generators_eia860["generator_id"]=="PV-3")].drop_duplicates(["plant_id_eia"])[["plant_id_eia", "generator_id", "technology_description"]]
plant_id_eia generator_id technology_description
31903 141 PV-3 Solar Photovoltaic

What is the impetus for equating emissions_unit_id_epa with generator_id?

…ssions_unit_id_epa with generator id. Also remove the replacement of empty plant_id_epa values with plant_id_eia to avoid confusion about which plants are reporting to EPA and which are not.

Also add boiler id to the bga table merge in augement_crosswalk_with_bga_eia860.

Noticed that there is an issue with some of the BGA relationships getting dropped during the subplant id creation, but I'm not sure how to fix this yet.
@aesharpe
Copy link
Member

To address the above comment, instead of filling the NA gaps in emissions_unit_id_epa with generator_id, I created a temporary column called network_x_key that does the same thing. This column is then fed into the sub-plant id creation method instead of emissions_unit_id_epa and then gets dropped due to it's exclusion from the metadata, and the emissions_unit_id_epa column remains intact.

@aesharpe
Copy link
Member

aesharpe commented Jun 30, 2023

BGA-to-subplant_id bug

I noticed that we merge the sub-plant id table with the bga table, but the m:m bga boiler-generator relationship is not retained after the sub-plant id creation. I'm not exactly sure why, but I think it has to do with how the network x is currently calculating the sub-plant ids. I have a feeling there are some (hopefully) easy tweaks to fix this, but I'm not super familiar with network x.

for example:

bga_eia860[bga_eia860["plant_id_eia"]==52151]
plant_id_epa emissions_unit_id_epa plant_id_eia boiler_id generator_id
4302 52151 1 52151 PB1 GEN1
4303 52151 1 52151 RF1 GEN1
4304 52151 1 52151 PB2 GEN2
4305 52151 1 52151 RF2 GEN2

Vs.

subplant_id[subplant_id["plant_id_epa"]==52151]
plant_id_eia plant_id_epa subplant_id unit_id_pudl emissions_unit_id_epa generator_id boiler_id
24733 52151 52151 0 1 1 GEN1 RF1
24734 52151 52151 0 2 1 GEN2 RF2

@aesharpe aesharpe marked this pull request as ready for review June 30, 2023 19:05
@cmgosnell
Copy link
Member Author

I noticed that we merge the sub-plant id table with the bga table, but the m:m bga boiler-generator relationship is not retained after the sub-plant id creation. I'm not exactly sure why, but I think it has to do with how the network x is currently calculating the sub-plant ids. I have a feeling there are some (hopefully) easy tweaks to fix this, but I'm not super familiar with network x.

hm good catch! i was wondering about this. it makes a degree of sense honestly that the boiler_id could be ignored/condensed. Conceptually it makes sense that the subplant id doesn't need to retain the detail about the boilers. from the generation side of things it should mostly care about the generators because that is largely what the epa linked to the smokestacks. So my first inclination would not be to go in and change the network_x process. (it might be good to drop the boiler_id column before/right after the subplant_id making so this doesn't confuse anyone in the future.

I thiiiink this means that if we want the boilers we need a 1:m merge of the bga table after the fact. since rn we have two version of the epacamd_eia glue table it would be good to ponder whether or not we should make a specific denorm_{whatever the table name is} table to receive these boiler ids. and somewhat related but not necessary to make this new table is whether the epacamd_eia should be dropped as a db table - like does it provide any unique information from the subplant_id & this new potential denorm_ table with the boiler_ids

…e subplant_id table. Add code to merge the boiler_id back into the subplant_id table after the network x process because some boilers get dropped.
@aesharpe
Copy link
Member

Restarted the migrations because I was getting the error raise ValueError("Constraint must have a name") when I tried to create a new migration for this data off of the current dev branch. Conversations on slack in the technical questions channel indicated that this was the best course of action (for now).

Comment on lines 348 to 368
# the network x process uses unit_id_pudl and generator_id. During processing,
# the boiler_id data are truncated and we only retain unique values for generator_id
# and unit_id_pudl. This step adds the lost boiler_id info back into the table.
subplant_ids_updated = pd.merge(
subplant_ids_updated.drop(columns="boiler_id").assign(
unit_id_pudl=lambda x: x.unit_id_pudl.astype(
"float"
) # necessary step for tests
),
boiler_generator_assn_eia860[
[
"plant_id_eia",
"generator_id",
"boiler_id",
"unit_id_pudl",
]
].drop_duplicates(),
on=["plant_id_eia", "generator_id", "unit_id_pudl"],
how="outer",
).drop_duplicates()

Copy link
Member

@aesharpe aesharpe Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that I added this chunk of code was because the network x process was dropping some of the boiler_ids (see comments above). This isn't pretty, but it remerges the subplant id table with the bga table so that we retain all the boiler-generator relationships.

There are two main reasons it's not ideal:

  1. we already have a function called augement_crosswalk_with_bga_eia860() so it's weird / duplicative that we are merging the bga table with the subplant id table for a second time. (Can't find a way around this for now without changing how the network X is working which we agreed wasn't a good idea for right now).

  2. The pytest test/unit/transform/glue_test.py::test_epacamd_eia_subplant_ids was failing without the line that converts the unit_id_pudl field to a float. This step makes the tests pass, but it causes the asset build to fail -.- grr: TypeError: Cannot interpret 'Int64Dtype()' as a data type. I don't know how to get around it other than adding this column dtype conversion step into the tests rather than here. But I think that would cause the same error in the tests. I would prefer to avoid this dtype issue if possible! It wasn't an issue before this merge was added which I don't really understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on your second point, my first instinct is to add a convert_dtypes() step in the unit tests for all input dfs. hard agree that this convert to float is not ideal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and on #1... maybe we should slightly generalize the augment_crosswalk_with_bga_eia860. it looks like i? or you? idk added boiler_id into that merge. but if the nextwork_x step doesn't care about boilers/effectively drops boiler detail then they shouldn't be in the table pre-network_x anyway.

We could give augment_crosswalk_with_bga_eia860 an idx: list[str] arg that is either just ["plant_id_eia", "generator_id"] or ["plant_id_eia", "generator_id", "boiler_id"] and always boiler_generator_assn_eia860[idx + ["unit_id_pudl"]].drop_duplicates()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this function is used more than once so we could also just call it augment_crosswalk_with_unit_id_pudl and remove boiler_id from the merge cols

@aesharpe
Copy link
Member

denorm_{whatever the table name is}

I think it would be really confusing to have a version of the subplant table with all the boilers and a version with just some or none. My vote is for a subplant table with all the boilers.

whether the epacamd_eia should be dropped as a db table - like does it provide any unique information from the subplant_id & this new potential denorm_ table with the boiler_ids

Technically it shouldn't provide anything new. If it does, we probably did something wrong. I think now that erroneous filling of emissions_unit_id_pudl has stopped it makes sense to just keep this subplant id table.

It probably makes sense then to change the names to something like epacamd_eia_annual for the crosswalk we are removing from the database and keep epacamd_eia_subplant_ids.

… fail in the emacamd_eia_subplant_ids function and add a convert_dtypes() function to the boiler_generator_assn_eia860_test table instead
@aesharpe
Copy link
Member

I just removed the epacamd_eia table from the db. The only novel information it contained that is not encapsulated in the subplant_id table is the year that the crosswalk data was created with. The epacamd_eia table had a date column but the subplant_id table just takes the unique values form both years and calls it a day.

…Replace epacamd_eia validation test with epacamd_eia_subplant_id validation test
Copy link
Member Author

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all looks good! two hopefully small requests for tweaks. one here and one in comment land.

if we are removing the old epacamd_eia table (which i think is the right call!) can we rename the epacamd_eia_subplant_id -> epacamd_eia? its just such a nicer name.

Comment on lines -392 to 417
["plant_id_eia", "generator_id", "unit_id_pudl"]
[
"plant_id_eia",
"generator_id",
"unit_id_pudl",
"boiler_id",
]
].drop_duplicates(),
how="outer",
on=["plant_id_eia", "generator_id"],
validate="m:1",
on=["plant_id_eia", "generator_id", "boiler_id"],
validate="m:m",
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this merge is happening pre-network x, it seems like we should probably revert these changes no? so there is no boiler_id pre-network_x.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it really matters? Because we need to merge with the bga table for the unit_id which is used in the network x process. We could drop the boiler_id more explicitly before the process, but it wouldn't make a huge difference.

Base automatically changed from dev to main January 5, 2024 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

integrate new entergy deprish study + manual overrides/gap filling of deprish<>eia
2 participants