Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate the raw ferc1 tables in the settings #2168

Merged
merged 17 commits into from
Jan 5, 2023
Merged

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Dec 30, 2022

Overview

  • Whyyy tho: we have a list of raw tables in settings files which dictates which raw FERC tables get slurped into the raw ferc dbs.... which is a little flaky and manual. We kept adding the wrong tables into these lists which was junking up our builds.
  • What: I added a validation step into the EtlSettings. Now, the setup step will fail if you are trying to load PUDL tables but didn't specify the needed raw tables (except when you specified no XBRL tables bc that is equivalent to all tables).
    • I needed to pull the dictionary of pudl tables to raw dbf and xbrl tables into the top level pudl.__init__.py bc circular references & this dict is not being used in, settings, extract and transform.
  • moar??: Should we just add the raw tables needed for the PUDL tables listed?
  • ... What if your loading just the PUDL db?... The settings object (imu) doesn't know whether you are actually running the raw FERC pre-extract sqlite db creation... To me this is an argument for actually adding the raw tables that are needed into the FERC1 settings object instead of having it fail. Or perhaps you should be able to pass it either no raw ferc settings or valid ones, whether or not you are planning on running the raw ferc1 etl.

PR Checklist

Before requesting a review of your pull request, please make sure you've done the
following:

  • Merge the most recent version of dev (or the appropriate upstream branch) into
    your branch and resolved any merge conflicts. You may need to do this several
    times over the course of a PR as dev changes frequently.
  • Verify that all of the CI checks on your PR are passing. See
    Running Tests with Tox
    for details on how to run the full test suite locally if you need to debug a
    particular failure.
  • Ensure that the docstrings for any new modules, classes, functions, or methods are
    descriptive enough for developers and users to understand your code.
  • If you've added new functions or classes, ensure that they have at least basic
    unit tests.
  • Do your own review of the PR. Add comments highlighting areas where you have
    questions you'd like reviewers to answer, known issues, solutions you're
    unsatisfied with, or other things that deserve special attention from the
    reviewer.
  • Update the
    release notes
    to reflect your changes. Make sure to reference the PR and any related issues.
  • If you've added new analyses, make sure they include defensive sanity checks that
    will catch unexpected data issues.
  • If you expanded data coverage or changed the outputs, ensure that the full
    data validation tests
    pass locally on a fresh DB.

src/pudl/helpers.py Outdated Show resolved Hide resolved
src/pudl/helpers.py Outdated Show resolved Hide resolved
@cmgosnell cmgosnell marked this pull request as ready for review January 3, 2023 17:17
@codecov
Copy link

codecov bot commented Jan 3, 2023

Codecov Report

Base: 85.5% // Head: 85.5% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (1cede68) compared to base (bc71c5e).
Patch coverage: 97.6% of modified lines in pull request are covered.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2168   +/-   ##
=====================================
  Coverage   85.5%   85.5%           
=====================================
  Files         73      73           
  Lines       8882    8911   +29     
=====================================
+ Hits        7599    7627   +28     
- Misses      1283    1284    +1     
Impacted Files Coverage Δ
src/pudl/settings.py 96.0% <96.0%> (-0.1%) ⬇️
src/pudl/extract/ferc1.py 86.0% <100.0%> (ø)
src/pudl/helpers.py 87.4% <100.0%> (+0.2%) ⬆️
src/pudl/transform/ferc1.py 95.4% <100.0%> (-0.1%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is entangling the FERC to SQLite settings and the PUDL ETL settings in a way that we haven't before. What is the desired entanglement?

IF the settings include both PUDL ETL and FERC to SQLite settings, THEN we are going to require that the FERC to SQLite settings provide any tables which are going to be needed by the PUDL ETL?

If the settings file does NOT include any FERC to SQLite settings at all, then we can't check this, and we're going to assume that the user already has the necessary FERC DBs?

A new and maybe unexpected error that could result from this: If a user already has complete FERC DBs and is using a settings file that specifies only a subset of the FERC to SQLite tables to control a PUDL ETL, they would get errors even though their PUDL ETL would have worked fine. Is this the behavior we want? Seems like it would be annoying but rare.

Thinking about whether you should fill in required tables if they are missing, it seems like we could construct a complete ferc_to_sqlite_settings object based just on the contents of the PUDL Ferc1Settings -- including both the required years and tables, and then not need to specify those settings at all or worry about them being inconsistent with the PUDL ETL settings which would be great!

I think just quietly filling in missing tables as needed will result in inconsistencies between what's in the settings file, and what's actually getting processed -- we'll forget to add a table to the settings and it won't break anything, and later someone will be debugging and it will be very confusing to see specific settings which don't actually control the process they claim to control.

So I think we should either go fully automatic and construct the required FERC to SQLite settings based on the PUDL ETL settings, or raise an exception and tell the user to fix their manually compiled settings if they are internally inconsistent.

src/pudl/helpers.py Outdated Show resolved Hide resolved
src/pudl/helpers.py Outdated Show resolved Hide resolved
src/pudl/helpers.py Outdated Show resolved Hide resolved
src/pudl/__init__.py Outdated Show resolved Hide resolved
@root_validator(pre=False)
def raw_table_validation_ferc1(cls, field_values):
"""Require the presence of raw FERC1 tables needed for PUDL table creation."""
# only check if we are actually loading any pudl tables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this comment. It seems like you're checking whether any FERC DBF or XBRL tables are being loaded into their respective DBs, not whether any PUDL tables are being loaded. Do we want to do this validation if no FERC 1 tables are being loaded into PUDL? In that case then any set of FERC to SQLite settings would be valid, since no tables are required downstream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If any PUDL tables are being loaded, we use that list of PUDL tables to check for the required raw tables (via the table map).
  • If we have any ferc1_to_sqlite settings, we check if those settings include the required tables

Right now, when we check if we have the required tables, we use a difference - not a symetrical_difference. We want to know if there are required tables that aren't in our ferc_to_sqlite settings. This means we could load a PUDL db with zero ferc-derived tables alongside ferc_to_sqlite tables.

src/pudl/settings.py Outdated Show resolved Hide resolved
@cmgosnell
Copy link
Member Author

cmgosnell commented Jan 4, 2023

thanks @zaneselvans !

i think I agree that the quiet add the table would end up being weird with a discrepancy between the settings files and the dbs created.

fully automatic

I personally lean slightly away from going fully automatic - always generating the list of ferc_to_sqlite tables, which could include:

  • converting the ferc_to_sqlite tables in the settings file a all or required_only flag, which would either load all of the tables or just load those that are required for the PUDL tables.
    • Question: you noted that we can pass an empty list or a None to the XBRL ferc_to_sqlite converter... can we do that with the DBF converter as well? If not, we'll need to store the list of DBF raw tables somewhere (currently in the settings file)

This would probably preclude loading only specific ferc_to_sqlite tables with a non-PUDL, unless you compiled a list of PUDL tables that you weren't actually trying to load to fake out the settings.

assert-only

The assert-only method sounds easier - its basically already implemented... but it would require a little more awareness on the users part of the dependencies of these two settings whether or not they are loading the ferc_to_sqlite dbs.

  • IF the settings include both PUDL ETL and FERC to SQLite settings, THEN we are requiring that the FERC to SQLite settings provide any tables which are going to be needed by the PUDL ETL.
    • This is basically already required, but the failure happens... wayyyyy down the line in the transform step.
  • If the settings file does NOT include any FERC to SQLite settings at all, THEN we're going to assume that the user already has the necessary FERC DBs.

I think the main weirdness in the assert-only version is when your loading only a PUDL db. You will either need to pass in NO ferc_to_sqlite dbf/xbrl settings OR fully valid ones. imo we can clearly communicate that in the failure message.

@cmgosnell
Copy link
Member Author

After talking with @zaneselvans , we decided to go with the assert-only option w/ mega descriptive error message (hopefully! lol any suggestions there would be helpful).

src/pudl/settings.py Outdated Show resolved Hide resolved
src/pudl/settings.py Outdated Show resolved Hide resolved
@cmgosnell cmgosnell self-assigned this Jan 5, 2023
@cmgosnell cmgosnell added ferc1 Anything having to do with FERC Form 1 settings labels Jan 5, 2023
@zaneselvans zaneselvans self-requested a review January 5, 2023 15:53
@cmgosnell cmgosnell merged commit 3ce0ff1 into dev Jan 5, 2023
@cmgosnell cmgosnell deleted the raw_ferc1_settings branch January 5, 2023 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 settings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants