Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address issue where 861 ETL fails w/o all years of data #1671

Merged

Conversation

arengel
Copy link
Collaborator

@arengel arengel commented Jun 3, 2022

Proposing a minor change to how balancing authority fixes are applied so that not all years of data are required to make the fixes work. This should address #828.

In the existing process, when I try do something like this:

pudl_out = PudlTabl(...)
pudl_out.etl_eia861(eia861_settings=Eia861Settings(years=sorted(range(2005, 2021)))

I get an error in these lines of code because there are values in BA_ID_NAME_FIXES.index that are not in df.index.

df.loc[
BA_ID_NAME_FIXES.index, "balancing_authority_id_eia"
] = BA_ID_NAME_FIXES.balancing_authority_id_eia

I played around a little bit and I think replacing that with the merge I suggest here addresses the problem and all the tests appear to pass when I run them locally.

Given the small size of the change, I didn't think any additional documentation beyond the brief comment was needed but happy to add if that would be helpful.

update to `pudl.transform.eia861.balancing_authority` to address catalyst-cooperative#828 and allow the 861 ETL to run w/o all years of data
@zaneselvans zaneselvans linked an issue Jun 3, 2022 that may be closed by this pull request
@zaneselvans zaneselvans self-requested a review June 4, 2022 00:47
@arengel
Copy link
Collaborator Author

arengel commented Jun 6, 2022

It looks like the same failing test across the three ci-tests. It's tests.integration.output_test.test_plant_parts_eia_filled and specifically when it goes to get fuel cost data from the EIA API and gets a Forbidden error. Since I don't think my changes affected what is exercised by this test, I wonder if this has to do with this being a PR from a fork.

I looked a little deeper and noticed that the beginning of 'Run PyTest with Tox' looks different on this PR than #1656. Here it starts:

Run conda run -n pudl-test tox
  conda run -n pudl-test tox
  shell: /usr/bin/bash -e {0}
  env:
   CONDA_PKGS_DIR: /home/runner/conda_pkgs_dir
   API_KEY_EIA: 
ERROR conda.cli.main_run:execute(41): `conda run tox` failed. (See above for error)
GLOB sdist-make: /home/runner/work/pudl/pudl/setup.py

Whereas on on #1656 it looks like this:

Run conda run -n pudl-test tox
  conda run -n pudl-test tox
  shell: /usr/bin/bash -e {0}
  env:
    CONDA_PKGS_DIR: /home/runner/conda_pkgs_dir
    API_KEY_EIA: ***
GLOB sdist-make: /home/runner/work/pudl/pudl/setup.py

The missing API_KEY_EIA would explain the failure, though I'm not sure how to fix this issue or if I even can. On the ci-notify action, this PR is missing the SLACK_WEBHOOK_URL which could be why that fails.

@zaneselvans
Copy link
Member

Hey @arengel, yes, the tests are failing on here because you don't have access to the repository github secrets as an outside contributor.

The existing tests only check whether all years can be processed together, so while this change seems like it's fine, and doesn't break the tests, we don't really know how the system will behave when only partial sets of years are processed. In general within PUDL we've been ensuring that both all the data can be processed, and that just the most recent year of data can be processed (since we typically just process the most recent year in the integration tests). Though in the case of the EIA-861 we've only tested that all of the data can be processed.

I pulled down this branch and experimented with a few different combinations of years, and unsurprisingly most combinations failed. Just doing 2020, or 2019-2020, or 2005-2020 all failed for reasons other than the problem fixed by this PR, sometimes in other tables. One could also restrict the list of tables that are being processed, but the BA table in particular draws BA IDs and state associations from across a large number of tables to construct the final table, so you won't get the same outputs necessarily if you don't process all of the tables.

Is there a reason why you're trying to avoid processing all of the years of data? On my machine doing all the EIA-861 tables for all the years takes ~3 minutes.

@arengel
Copy link
Collaborator Author

arengel commented Jun 8, 2022

Thanks for taking a look at this @zaneselvans! And also for taking the time to test and explain the implications of what I was proposing, I definitely have a better understanding of how PUDL works now.

To answer your question, I was looking for a way to centrally control the years that we get when pulling 861 data in a way similar to how setting start_date and end_date in PudlTabl controls the years we get for 860 or 923. So that part of it is more organization than anything else. The reason I was looking at this in first place had to do with simplifying the process of validating the way we combine a number of different 861 tables.

Given that the goal here is convenience and not performance, we don’t want to affect the processing at all. We certainly don’t want the output to change based on the set of years we pull out or the set of tables processed. It’s now clear that even if using Eia861Settings to set the years worked (as in didn’t raise errors) it’s not what we’re looking for. Instead we’ll select the years we want from the data we get from PudlTabl.

I’ll leave this PR open in case this change adds useful flexibility for PUDL but we definitely don’t need it for the Hub.

@zaneselvans
Copy link
Member

Yeah I think the right thing to do is just select the years you want from the table after the whole thing is returned.

However, this PR still seems like an improvement -- there's no reason that the BA function should totally crash and burn just because some of the fixes are outside the range of years being processed, and I've tested that it produces exactly the same output as the current code when all years are processed, so I'm inclined to go ahead and merge it, even if it doesn't satisfy your original desire.

@zaneselvans zaneselvans merged commit 9562cdb into catalyst-cooperative:dev Jun 8, 2022
@arengel arengel deleted the eia861_ba_fix branch June 8, 2022 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

861 failing w/o all years of data on balancing authority fixes
2 participants