-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate the raw ferc1 tables in the settings #2168
Conversation
Codecov ReportBase: 85.5% // Head: 85.5% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev #2168 +/- ##
=====================================
Coverage 85.5% 85.5%
=====================================
Files 73 73
Lines 8882 8911 +29
=====================================
+ Hits 7599 7627 +28
- Misses 1283 1284 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this is entangling the FERC to SQLite settings and the PUDL ETL settings in a way that we haven't before. What is the desired entanglement?
IF the settings include both PUDL ETL and FERC to SQLite settings, THEN we are going to require that the FERC to SQLite settings provide any tables which are going to be needed by the PUDL ETL?
If the settings file does NOT include any FERC to SQLite settings at all, then we can't check this, and we're going to assume that the user already has the necessary FERC DBs?
A new and maybe unexpected error that could result from this: If a user already has complete FERC DBs and is using a settings file that specifies only a subset of the FERC to SQLite tables to control a PUDL ETL, they would get errors even though their PUDL ETL would have worked fine. Is this the behavior we want? Seems like it would be annoying but rare.
Thinking about whether you should fill in required tables if they are missing, it seems like we could construct a complete ferc_to_sqlite_settings
object based just on the contents of the PUDL Ferc1Settings -- including both the required years and tables, and then not need to specify those settings at all or worry about them being inconsistent with the PUDL ETL settings which would be great!
I think just quietly filling in missing tables as needed will result in inconsistencies between what's in the settings file, and what's actually getting processed -- we'll forget to add a table to the settings and it won't break anything, and later someone will be debugging and it will be very confusing to see specific settings which don't actually control the process they claim to control.
So I think we should either go fully automatic and construct the required FERC to SQLite settings based on the PUDL ETL settings, or raise an exception and tell the user to fix their manually compiled settings if they are internally inconsistent.
src/pudl/settings.py
Outdated
@root_validator(pre=False) | ||
def raw_table_validation_ferc1(cls, field_values): | ||
"""Require the presence of raw FERC1 tables needed for PUDL table creation.""" | ||
# only check if we are actually loading any pudl tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this comment. It seems like you're checking whether any FERC DBF or XBRL tables are being loaded into their respective DBs, not whether any PUDL tables are being loaded. Do we want to do this validation if no FERC 1 tables are being loaded into PUDL? In that case then any set of FERC to SQLite settings would be valid, since no tables are required downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If any PUDL tables are being loaded, we use that list of PUDL tables to check for the required raw tables (via the table map).
- If we have any ferc1_to_sqlite settings, we check if those settings include the required tables
Right now, when we check if we have the required tables, we use a difference
- not a symetrical_difference
. We want to know if there are required tables that aren't in our ferc_to_sqlite settings. This means we could load a PUDL db with zero ferc-derived tables alongside ferc_to_sqlite tables.
thanks @zaneselvans ! i think I agree that the quiet add the table would end up being weird with a discrepancy between the settings files and the dbs created. fully automaticI personally lean slightly away from going fully automatic - always generating the list of ferc_to_sqlite tables, which could include:
This would probably preclude loading only specific ferc_to_sqlite tables with a non-PUDL, unless you compiled a list of PUDL tables that you weren't actually trying to load to fake out the settings. assert-onlyThe assert-only method sounds easier - its basically already implemented... but it would require a little more awareness on the users part of the dependencies of these two settings whether or not they are loading the ferc_to_sqlite dbs.
I think the main weirdness in the assert-only version is when your loading only a PUDL db. You will either need to pass in NO ferc_to_sqlite dbf/xbrl settings OR fully valid ones. imo we can clearly communicate that in the failure message. |
For more information, see https://pre-commit.ci
…pudl into raw_ferc1_settings
After talking with @zaneselvans , we decided to go with the assert-only option w/ mega descriptive error message (hopefully! lol any suggestions there would be helpful). |
Overview
EtlSettings
. Now, the setup step will fail if you are trying to load PUDL tables but didn't specify the needed raw tables (except when you specified no XBRL tables bc that is equivalent to all tables).pudl.__init__.py
bc circular references & this dict is not being used in, settings, extract and transform.PR Checklist
Before requesting a review of your pull request, please make sure you've done the
following:
dev
(or the appropriate upstream branch) intoyour branch and resolved any merge conflicts. You may need to do this several
times over the course of a PR as
dev
changes frequently.Running Tests with Tox
for details on how to run the full test suite locally if you need to debug a
particular failure.
descriptive enough for developers and users to understand your code.
unit tests.
questions you'd like reviewers to answer, known issues, solutions you're
unsatisfied with, or other things that deserve special attention from the
reviewer.
release notes
to reflect your changes. Make sure to reference the PR and any related issues.
will catch unexpected data issues.
data validation tests
pass locally on a fresh DB.