Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate v0.5.0 #1345

Merged
merged 4 commits into from
Nov 11, 2021
Merged

Validate v0.5.0 #1345

merged 4 commits into from
Nov 11, 2021

Conversation

zaneselvans
Copy link
Member

There are several changes here beyond the number of rows that we expect to see in each table which I would like to get some other eyeballs on quickly, including addition of new output tables and the way that we are looking for primary keys that uniquely identify records in the aggregated output tables.

* Updated the expected row counts for EIA, FERC, and MCOE output tables.
* Added gfn_eia923 (nuclear generation fuel) to outputs with expected row counts
* Added all_plants_ferc1 to outputs with expected row counts
* Added bf_eia923, gf_eia923, and gfn_eia923 to outputs expected to have unique primary
  keys (based on work we've done to clean those tables up)
* Changed aggregation of gfn_eia923 (monthly/annual) such that it includes the
  `nuclear_unit_id` field in the groupby.
* Alphabetized all of the lists of tables so that when we change them it's clearer which
  thing has been changed.
* Updated Datasette DB descriptions to include the old and new years that have been
  added.
* Un-hid some hidden tables from Datasette DB description since they are more
  informative now, with the coding setup we've got.
@codecov
Copy link

codecov bot commented Nov 11, 2021

Codecov Report

Merging #1345 (0e305b8) into dev (5d91dd4) will increase coverage by 0.06%.
The diff coverage is 62.50%.

❗ Current head 0e305b8 differs from pull request most recent head 2d0bc91. Consider uploading reports for the commit 2d0bc91 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##              dev    #1345      +/-   ##
==========================================
+ Coverage   83.23%   83.29%   +0.06%     
==========================================
  Files          62       62              
  Lines        6731     6737       +6     
==========================================
+ Hits         5602     5611       +9     
+ Misses       1129     1126       -3     
Impacted Files Coverage Δ
src/pudl/output/eia923.py 95.60% <62.50%> (-1.55%) ⬇️
src/pudl/analysis/timeseries_cleaning.py 88.84% <0.00%> (+0.44%) ⬆️
src/pudl/workspace/datastore.py 70.45% <0.00%> (+1.62%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d91dd4...2d0bc91. Read the comment docs.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o/s of the # of rows, it looks like you alphabetized all of the output tables in the tests, added the nuclear gf tables and added the if nuclear in the gf output functions... It is hard to tell which tables you changed the PKs for but I looked through all of the ones you have here now and they all look reasonable. I made a comment/potentially suggestion for the future but it all looks reasonable enough for now.

("plants_eia860", "all"),
("utils_eia860", "all"),
("pu_eia860", "all"),
("bf_eia923", "all"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these just reorganized to get whatever failure you were getting to show up early?? or are they all now just alphabetized?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I alphabetized these lists, and also added new output tables to them, like gfn_eia923 and all_plants_ferc1

"report_year",
"utility_id_ferc1",
"plant_name_ferc1",
"capacity_mw" # Why does having capacity here make sense???
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there are duplicates based on the "report_year", "utility_id_ferc1", "plant_name_ferc1" and the cpacity distinguishes them... this makes me wonder why we are compiling all of these lists of columns that are in effect the PKs of the tables. Why wouldn't we just pull these directly from the metadata. And adjust when necessary bc obviously these are not all analogous to db tables.

The intention of this comment is to maybe lead to an issue for a future enhancement - not to lead to changes in this PR/release.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we didn't have many of the primary keys actually encoded. Yes, now that we've got them available in the metadata, I think this can be done automatically (for the EIA stuff, not for FERC 1 which is still keyless).

In these comments I'm questioning whether checking for uniqueness even makes sense -- these tables don't have natural primary keys, and I don't know what exactly the argument is for using capacity to fake one.

src/pudl/output/eia923.py Show resolved Hide resolved
Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good to me. Thank you for catching the missing code for the gfn table.

@zaneselvans zaneselvans merged commit 7bb74e9 into dev Nov 11, 2021
@zaneselvans zaneselvans deleted the validate-v0.5.0 branch November 11, 2021 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants