Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate generators in analysis.allocate_gen_fuel #3633

Open
grgmiller opened this issue May 14, 2024 · 1 comment
Open

Duplicate generators in analysis.allocate_gen_fuel #3633

grgmiller opened this issue May 14, 2024 · 1 comment
Labels
analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. bug Things that are just plain broken. eia923 Anything having to do with EIA Form 923

Comments

@grgmiller
Copy link
Collaborator

Describe the bug

As noted in singularity-energy#3:
When running the pudl.analysis.allocate_gen_fuel pipeline for 2016 and 2017, we were getting a TypeError at group_duplicate_keys(), because this function was trying to groupby().sum() non-numeric columns like generator_retirement_date.

The group_duplicate_keys() will only work if we drop any datetime and boolean columns before using this function, and considering carefully whether we want to sum any of the frac columns or not.

We were only running into this issue with group_duplicate_keys() because there were duplicate keys in the dataframe, so where are duplicate keys were getting introduced in the first place?

It turns out that when creating the gen_assoc table with associate_generator_tables(), one of the steps is remove_inactive_generators(), which removes certain generators by creating six different dataframes with different generators based on their operating status: existing, retiring_generators, retired_plants, proposed_generators, proposed_plants, and unassociated_plants. These six dataframes are then concat'ed together. Previously our assumption was that these six dataframes should be non-overlapping. However, it turns out that this is not always the case.

For example, in 2016, plant 56846 generator GTG1 ended up in both proposed_generators and proposed_plants, which was causing it to be duplicated.

We fix this by simply adding .drop_duplicates() after these six dataframes are concat'ed together. This fixes the issue that we were experiencing in 2016 and 2017.

For now, we will leave group_duplicate_keys() alone even though it does not work. It effectively acts as an error if there are ever any duplicate keys since it will raise a typeerror like we saw for 2016 and 2017.

For the pudl team: I'm not sure why you all were not hitting this error when running this same pipeline in pudl, but we were hitting it in OGE. Maybe it is because we are running this pipeline for a single year at a time, so this specific generator is showing up with a proposed status in these years, but maybe not in your version if you are running all years at once.

Bug Severity

  • Medium: With some effort, I can work around the bug.

To Reproduce

We are still using the 2023.12.02 version of pudl in OGE, but as far as I can tell, the analysis.allocate_gen_fuel code has not really changed between these versions, so you should be able to reproduce this by running this pipeline for 2016 or 2017 data.

Expected behavior

A clear and concise description of what you expected to happen, or what you expected the data to look like.

Software Environment?

PUDL v.2023.12.02

Additional context

I didn't open a PR because our forked version of pudl is pretty far behind your current version, and since this seems to be a one-line change, I thought it may be faster for someone on the pudl team to introduce this. I also wasn't sure how you want to deal with the group_duplicate_keys() function - remove it, or modify it. I think that if duplicate keys actually existed in the data, this function would not currently behave as expected.

@grgmiller grgmiller added the bug Things that are just plain broken. label May 14, 2024
@zaneselvans zaneselvans added eia923 Anything having to do with EIA Form 923 analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. labels May 14, 2024
@zaneselvans
Copy link
Member

@cmgosnell does any of this ring a bell to you? I think you've got the most familiarity with these systems.

@grgmiller we basically only ever run the full ETL across all years of data, or 1-2 recent years in our integration testing, so there are many, many possible combinations of ETL parameters that could have unknown / unexpected behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. bug Things that are just plain broken. eia923 Anything having to do with EIA Form 923
Projects
Status: New
Development

No branches or pull requests

2 participants