Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform eia861 short form #3565

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

Nancy9ice
Copy link

@Nancy9ice Nancy9ice commented Apr 14, 2024

Closes #3540.

What problem does this address?
This addresses the lack of the eia861 short_form table

What did you change?

To-do list

  • The column mapping that lets us extract this data from the spreadsheets should be updated to use the same naming conventions as the rest of PUDL. E.g. ba_code should be balancing_authority_code_eia. Most of these columns should already be defined in pudl.metadata.fields. The column mappings are stored in src/pudl/package_data/eia861/column_maps/short_form_eia861.csv
  • A new core_eia861__yearly_short_form() asset will need to be defined in pudl.transform.eia861
  • Columns will need to be cleaned up to have appropriate NA values and to conform to the appropriate data types. See the other transform functions for examples. It looks like there are a number of boolean columns which currently contain Y or N. There may also be columns where only certain values are valid, which need to be cleaned up (e.g. do all state abbreviations actually correspond to a real state?)
  • The new short_form table should be integrated into the processes downstream which compile the table of balancing authorities
  • Either the 2019 short form data will need to be extracted from the other tables where it was reported and added to this one, or the contents of this table will need to be added to all those other tables. EDIT-- this now lives here: Add 2019 data to EIA861 Short Form table #3654
  • If we're adding a new table to the PUDL DB, it will need to have a schema defined in pudl.metadata.resources.eia861

Testing

How did you make sure this worked? How can a reviewer verify this?

I confirmed that this worked after I ran the 'make pytest-coverage' command. @zaneselvans please could you run the build-deploy-pudl GitHub Action so that I'll see if there's any other thing to do in the code?

@zaneselvans
Copy link
Member

Note that 150+ auto-generated RST files that document the API have been commited in this PR. They're not supposed to be part of the repository, and are built as an intermediate step between the Python modules and the HTML output of the docs.

@Nancy9ice
Copy link
Author

Note that 150+ auto-generated RST files that document the API have been commited in this PR. They're not supposed to be part of the repository, and are built as an intermediate step between the Python modules and the HTML output of the docs.

Okay, I'll backtrack the changes and update the PR

This reverts commit 574dbbc.

remove index files
@Nancy9ice Nancy9ice force-pushed the transform-eia861-short-form branch 2 times, most recently from 6d95595 to a33045a Compare April 16, 2024 07:37
@Nancy9ice
Copy link
Author

@zaneselvans I have backtracked the changes and removed the index files that were added before

Copy link
Member

@aesharpe aesharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @Nancy9ice !

See my comments about column names etc. (Many of which are old things we did poorly and would love for you to help fix).

Other Updates:

  • Add a note to our release_notes.rst file explaining the addition of this table
  • Add this table to the list of foreign_key_rule exceptions in the schema for core_eia860__scd_utilities in metadata/resources/eia860

src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Outdated Show resolved Hide resolved
],
"primary_key": [
"utility_id_eia",
"state",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is state definitely a primary key?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is state definitely a primary key?

'state' is not the primary key independently. It is one of the variables that contribute to the compound primary key. All the fields stated under the 'primary_key' all come together to form the compound primary key. Am I wrong about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I just wanted to check! (I looked at it in a notebook and it seems like balancing_authority_code_eia doesn't need to be part of the primary key (the tables are the same length if you drop duplicates without balancing_authority_code_eia) however, it's possible that in the future this column could be part of the primary key? @cmgosnell what do you think here?

src/pudl/transform/eia861.py Outdated Show resolved Hide resolved
@aesharpe
Copy link
Member

Also @Nancy9ice take a look at #3540 if you haven't already and make sure to address the issues there too.

@aesharpe aesharpe added new-data Requests for integration of new data. eia861 Anything having to do with EIA Form 861 labels Apr 17, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@aesharpe
Copy link
Member

aesharpe commented Apr 23, 2024

Hi @Nancy9ice, thanks for these edits (and sorry it's taken a minute to respond)--

Generally speaking we like to keep our PRs pretty small. When I was suggesting changing the boolean column names, I meant those pertaining to the short_form table not the whole of PUDL. But it's awesome that you gave that a shot!! If they all work nicely it will be a great addition to PUDL. However, it's harder to test for failures when there are lots of different types of changes in one PR. I would recommend just changing the column names in the short_form table and then splitting out the other boolean column name changes into a separate PR. Once you have the column changes in a separate PR, see if you can spin up Dagster and run the ETL. That will tell you where the errors are!

That way we can test run the whole ETL on the column name changes and not have to worry about short form table errors getting in the way (or vice versa).

@Nancy9ice
Copy link
Author

Hi @Nancy9ice, thanks for these edits (and sorry it's taken a minute to respond)--

Generally speaking we like to keep our PRs pretty small. When I was suggesting changing the boolean column names, I meant those pertaining to the short_form table not the whole of PUDL. But it's awesome that you gave that a shot!! If they all work nicely it will be a great addition to PUDL. However, it's harder to test for failures when there are lots of different types of changes in one PR. I would recommend just changing the column names in the short_form table and then splitting out the other boolean column name changes into a separate PR. Once you have the column changes in a separate PR, see if you can spin up Dagster and run the ETL. That will tell you where the errors are!

That way we can test run the whole ETL on the column name changes and not have to worry about short form table errors getting in the way (or vice versa).

Okay. I have done the edits. This time, I touched only the columns associated with the short-form table. That is, green_pricing, demand_side_management, water_heater, net_metering, and time_responsive_programs. I also ensured that these column changes were reflected in the other tables that bore these columns too. I'll await your feedback as regards these changes while working on the other changes you suggested under this pull request.

@zaneselvans
Copy link
Member

My guess is the right way to do this is pull the short form data out of all the other tables where it got stashed in 2019, and integrate it with the other short form data. I think the short form is a simplified / less extensive version of a lot of the data and probably doesn't match the schema / meaning of the full form exactly. This would result in the other tables being more uniform across all the years, and the short form data being all in a single place with a single format. But would also probably mean needing to touch all of the existing transforms to exclude the short form data, and then creating another set of short-form-only transforms that feeds into something that combines the different sources of short form data together into a unified table. Which does sound a bit messy.

src/pudl/metadata/resources/eia861.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Show resolved Hide resolved
src/pudl/transform/eia861.py Outdated Show resolved Hide resolved
src/pudl/transform/eia861.py Show resolved Hide resolved
balancing_authority_code_eia,-1,4,4,5,5,5,5,-1,5,5,5
sales_revenue,4,5,5,6,6,6,6,-1,6,6,6
sales_mwh,5,6,6,7,7,7,7,-1,7,7,7
customers,6,7,7,8,8,8,8,-1,8,8,8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this would have consequences for other tables and so we don't want to deal with it right now, but I think this should probably be num_customers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the src.pudl.metadata.fields, there's the customers key in the JSON fields. There's no num_customers or total_customers. Please check to confirm that this is correct so I'll know if I still need to change this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "customers key"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nancy9ice what I meant here is that I think we originally did a poor job of naming this column, or that it was named before we had thought about our naming conventions, so I was suggesting that we take this opportunity to rename customers which is pretty vague, to something that provides a bit more context about the column, like customers_num or num_customers though searching through fields.py for other instances of _num and num_ it doesn't seem as if we have established a strong convention right now, so maybe we should just ignore this for now and deal with it in a future round of renaming / deeper EIA-861 integration.

@aesharpe
Copy link
Member

Hey @Nancy9ice I added one little comment, merged in main, and did some alembic gymnastics. What do you think about @zaneselvans's comment above regarding the 2019 data?

@zaneselvans
Copy link
Member

The fact that the 2019 short form data is interwoven with all of the other data is unfortunate, and seems like it'll take a bit of work to untangle. Probably just getting a table with all the other short form data available in the database is a good incremental step forward, and we should write up a separate issue about removing all the 2019 short form data from the other tables and integrating it into the dedicated short-form table to be addressed in the future.

@aesharpe
Copy link
Member

aesharpe commented May 22, 2024

I can write up an issue for the last legs here and move forward with merging this. Might be a good_first_issue for folks.

@Nancy9ice
Copy link
Author

Hey @Nancy9ice I added one little comment, merged in main, and did some alembic gymnastics. What do you think about @zaneselvans's comment above regarding the 2019 data?

It's a good idea to have a separate issue for that because it would require series of steps just like this issue

@zaneselvans
Copy link
Member

It looks like there's some alembic migration / database schema issues in here that need to be fixed. I tried running the tests locally and got the same multiple heads problem that's causing the unit tests to fail in CI.

@Nancy9ice
Copy link
Author

It looks like there's some alembic migration / database schema issues in here that need to be fixed. I tried running the tests locally and got the same multiple heads problem that's causing the unit tests to fail in CI.

Can I get the full error so I'll see what's wrong?

@zaneselvans
Copy link
Member

@Nancy9ice I think Austen has fixed the alembic issue, but for future reference all of the checks that run on a PR provide "Details" links to the logs generated by that check, and it's a good idea to click through and take a look at what happened when one of them fails.

I'm wondering if you've installed our git pre-commit hooks since it looks like the Ruff linter is failing within the pre-commit.ci check, and that (along with the unit tests) should be getting run by the pre-commit hooks automatically whenever you try to make a commit locally.

@zaneselvans
Copy link
Member

zaneselvans commented May 23, 2024

EDIT sorry, this was my bad. I hadn't rematerialized the furthest upstream extracted EIA-861 assets and there were some column name changes.

Now I'm getting a KeyError when attempting to materialize the new core_eia861__yearly_short_form table.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "core_eia861__yearly_short_form":

...

KeyError: 'balancing_authority_code_eia'

  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary
    yield
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_utils/__init__.py", line 468, in iterate_with_context
    next_output = next(iterator)
                  ^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/execution/plan/compute_generator.py", line 141, in _coerce_op_compute_fn_to_iterator
    result = invoke_compute_fn(
             ^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/execution/plan/compute_generator.py", line 129, in invoke_compute_fn
    return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass)
                                                                    ^^^^^^^^^^^^^^^^^^
  File "/Users/zane/code/catalyst/pudl/src/pudl/transform/eia861.py", line 1247, in core_eia861__yearly_short_form
    raw_sf["balancing_authority_code_eia"] = raw_sf[
                                             ^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err

...

@Nancy9ice
Copy link
Author

@Nancy9ice I think Austen has fixed the alembic issue, but for future reference all of the checks that run on a PR provide "Details" links to the logs generated by that check, and it's a good idea to click through and take a look at what happened when one of them fails.

I'm wondering if you've installed our git pre-commit hooks since it looks like the Ruff linter is failing within the pre-commit.ci check, and that (along with the unit tests) should be getting run by the pre-commit hooks automatically whenever you try to make a commit locally.

Oh, I think I skipped that. I'll add that now.

@aesharpe
Copy link
Member

Here's the issue for the 2019 data! #3654

@aesharpe
Copy link
Member

Now I'm getting a KeyError when attempting to materialize the new core_eia861__yearly_short_form table.

I was able to successfully materialize the core_eia861__yearly_short_form table and all it's predecessors. Not sure what might be going on here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eia861 Anything having to do with EIA Form 861 new-data Requests for integration of new data.
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

Transform the EIA-861 short form table
3 participants