Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data validation errors after integrating eia860m 2020-11 #943

Closed
zaneselvans opened this issue Mar 9, 2021 · 1 comment · Fixed by #1103
Closed

Data validation errors after integrating eia860m 2020-11 #943

zaneselvans opened this issue Mar 9, 2021 · 1 comment · Fixed by #1103
Assignees
Labels
cli Scripts and other command line interfaces to PUDL. testing Writing tests, creating test data, automating testing, etc.

Comments

@zaneselvans
Copy link
Member

After simplifying our test suite setup (issue #942) I ran the data validation tests, to make sure they still worked with the new setup. There were a few tables with more rows than expected because (I think) of the integration of the eia860m data through November 2020. These included plants_eia860, utilities_eia860, pu_eia860, and generators_eia860 which all would be expected to change with the addition of new generators.

However, there were some other data validation failures that don't really make sense. Null distributed_generation column in the MCOE output, and too many records in the generation_fuel_eia923 table, which should be tracked down:

FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_eia923-1551264-1250340-104195] - ValueError: Too many records (128817>109404.75) in dataframe gf_eia923
FAILED test/validate/mcoe_test.py::test_no_null_cols_mcoe[eia_annual-mcoe-all] - ValueError: Null column: distributed_generation found in dataframe mcoe
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_eia923-1551264-1250340-104195] - ValueError: Too many records (1545804>1312857.0) in dataframe gf_eia923
FAILED test/validate/mcoe_test.py::test_no_null_cols_mcoe[eia_monthly-mcoe-all] - ValueError: Null column: distributed_generation found in dataframe mcoe
@zaneselvans zaneselvans added the testing Writing tests, creating test data, automating testing, etc. label Mar 9, 2021
@cmgosnell cmgosnell self-assigned this Mar 16, 2021
@cmgosnell cmgosnell assigned aesharpe and unassigned cmgosnell Jun 7, 2021
@aesharpe aesharpe added the cli Scripts and other command line interfaces to PUDL. label Jul 5, 2021
@zaneselvans
Copy link
Member Author

I added a list of deprecated columns to the mcoe null columns check, since there some generator_eia860 columns which only have data prior to 2008, the earliest year for which we can calculate the MCOE / fuel costs based on our current methods. This fixes the distributed_generation error.

The generation_fuel row counts have also been investigated and updated.

Both these changes are part of PR #1103

zaneselvans added a commit that referenced this issue Jul 29, 2021
Created a list of deprecated columns which are skipped in the mcoe null
column validation, since they only exist in 2006 and earlier data, and
we can only calculate the MCOE for 2008 and later.

Updated the null column validation to check all columns before raising
an exception, so you can get a list of all of them in the error message.

Updated release notes to remove this piece of errata.

Closes #943
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli Scripts and other command line interfaces to PUDL. testing Writing tests, creating test data, automating testing, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants