Validate the raw ferc1 tables in the settings #2168

cmgosnell · 2022-12-30T17:20:37Z

Overview

Whyyy tho: we have a list of raw tables in settings files which dictates which raw FERC tables get slurped into the raw ferc dbs.... which is a little flaky and manual. We kept adding the wrong tables into these lists which was junking up our builds.
What: I added a validation step into the EtlSettings. Now, the setup step will fail if you are trying to load PUDL tables but didn't specify the needed raw tables (except when you specified no XBRL tables bc that is equivalent to all tables).
- I needed to pull the dictionary of pudl tables to raw dbf and xbrl tables into the top level pudl.__init__.py bc circular references & this dict is not being used in, settings, extract and transform.
moar??: Should we just add the raw tables needed for the PUDL tables listed?
... What if your loading just the PUDL db?... The settings object (imu) doesn't know whether you are actually running the raw FERC pre-extract sqlite db creation... To me this is an argument for actually adding the raw tables that are needed into the FERC1 settings object instead of having it fail. Or perhaps you should be able to pass it either no raw ferc settings or valid ones, whether or not you are planning on running the raw ferc1 etl.

PR Checklist

Before requesting a review of your pull request, please make sure you've done the
following:

Merge the most recent version of dev (or the appropriate upstream branch) into
your branch and resolved any merge conflicts. You may need to do this several
times over the course of a PR as dev changes frequently.
Verify that all of the CI checks on your PR are passing. See
Running Tests with Tox
for details on how to run the full test suite locally if you need to debug a
particular failure.
Ensure that the docstrings for any new modules, classes, functions, or methods are
descriptive enough for developers and users to understand your code.
If you've added new functions or classes, ensure that they have at least basic
unit tests.
Do your own review of the PR. Add comments highlighting areas where you have
questions you'd like reviewers to answer, known issues, solutions you're
unsatisfied with, or other things that deserve special attention from the
reviewer.
Update the
release notes
to reflect your changes. Make sure to reference the PR and any related issues.
If you've added new analyses, make sure they include defensive sanity checks that
will catch unexpected data issues.
If you expanded data coverage or changed the outputs, ensure that the full
data validation tests
pass locally on a fresh DB.

src/pudl/helpers.py

codecov · 2023-01-03T17:25:04Z

Codecov Report

Base: 85.5% // Head: 85.5% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (1cede68) compared to base (bc71c5e).
Patch coverage: 97.6% of modified lines in pull request are covered.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2168   +/-   ##
=====================================
  Coverage   85.5%   85.5%           
=====================================
  Files         73      73           
  Lines       8882    8911   +29     
=====================================
+ Hits        7599    7627   +28     
- Misses      1283    1284    +1

Impacted Files	Coverage Δ
src/pudl/settings.py	`96.0% <96.0%> (-0.1%)`	⬇️
src/pudl/extract/ferc1.py	`86.0% <100.0%> (ø)`
src/pudl/helpers.py	`87.4% <100.0%> (+0.2%)`	⬆️
src/pudl/transform/ferc1.py	`95.4% <100.0%> (-0.1%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

zaneselvans

It seems like this is entangling the FERC to SQLite settings and the PUDL ETL settings in a way that we haven't before. What is the desired entanglement?

IF the settings include both PUDL ETL and FERC to SQLite settings, THEN we are going to require that the FERC to SQLite settings provide any tables which are going to be needed by the PUDL ETL?

If the settings file does NOT include any FERC to SQLite settings at all, then we can't check this, and we're going to assume that the user already has the necessary FERC DBs?

A new and maybe unexpected error that could result from this: If a user already has complete FERC DBs and is using a settings file that specifies only a subset of the FERC to SQLite tables to control a PUDL ETL, they would get errors even though their PUDL ETL would have worked fine. Is this the behavior we want? Seems like it would be annoying but rare.

Thinking about whether you should fill in required tables if they are missing, it seems like we could construct a complete ferc_to_sqlite_settings object based just on the contents of the PUDL Ferc1Settings -- including both the required years and tables, and then not need to specify those settings at all or worry about them being inconsistent with the PUDL ETL settings which would be great!

I think just quietly filling in missing tables as needed will result in inconsistencies between what's in the settings file, and what's actually getting processed -- we'll forget to add a table to the settings and it won't break anything, and later someone will be debugging and it will be very confusing to see specific settings which don't actually control the process they claim to control.

So I think we should either go fully automatic and construct the required FERC to SQLite settings based on the PUDL ETL settings, or raise an exception and tell the user to fix their manually compiled settings if they are internally inconsistent.

src/pudl/helpers.py

src/pudl/__init__.py

zaneselvans · 2023-01-04T16:19:06Z

src/pudl/settings.py

+    @root_validator(pre=False)
+    def raw_table_validation_ferc1(cls, field_values):
+        """Require the presence of raw FERC1 tables needed for PUDL table creation."""
+        # only check if we are actually loading any pudl tables


I'm confused by this comment. It seems like you're checking whether any FERC DBF or XBRL tables are being loaded into their respective DBs, not whether any PUDL tables are being loaded. Do we want to do this validation if no FERC 1 tables are being loaded into PUDL? In that case then any set of FERC to SQLite settings would be valid, since no tables are required downstream.

If any PUDL tables are being loaded, we use that list of PUDL tables to check for the required raw tables (via the table map).

If we have any ferc1_to_sqlite settings, we check if those settings include the required tables

Right now, when we check if we have the required tables, we use a difference - not a symetrical_difference. We want to know if there are required tables that aren't in our ferc_to_sqlite settings. This means we could load a PUDL db with zero ferc-derived tables alongside ferc_to_sqlite tables.

src/pudl/settings.py

cmgosnell · 2023-01-04T17:27:02Z

thanks @zaneselvans !

i think I agree that the quiet add the table would end up being weird with a discrepancy between the settings files and the dbs created.

fully automatic

I personally lean slightly away from going fully automatic - always generating the list of ferc_to_sqlite tables, which could include:

converting the ferc_to_sqlite tables in the settings file a all or required_only flag, which would either load all of the tables or just load those that are required for the PUDL tables.
- Question: you noted that we can pass an empty list or a None to the XBRL ferc_to_sqlite converter... can we do that with the DBF converter as well? If not, we'll need to store the list of DBF raw tables somewhere (currently in the settings file)

This would probably preclude loading only specific ferc_to_sqlite tables with a non-PUDL, unless you compiled a list of PUDL tables that you weren't actually trying to load to fake out the settings.

assert-only

The assert-only method sounds easier - its basically already implemented... but it would require a little more awareness on the users part of the dependencies of these two settings whether or not they are loading the ferc_to_sqlite dbs.

IF the settings include both PUDL ETL and FERC to SQLite settings, THEN we are requiring that the FERC to SQLite settings provide any tables which are going to be needed by the PUDL ETL.
- This is basically already required, but the failure happens... wayyyyy down the line in the transform step.
If the settings file does NOT include any FERC to SQLite settings at all, THEN we're going to assume that the user already has the necessary FERC DBs.

I think the main weirdness in the assert-only version is when your loading only a PUDL db. You will either need to pass in NO ferc_to_sqlite dbf/xbrl settings OR fully valid ones. imo we can clearly communicate that in the failure message.

For more information, see https://pre-commit.ci

…pudl into raw_ferc1_settings

cmgosnell · 2023-01-04T22:07:50Z

After talking with @zaneselvans , we decided to go with the assert-only option w/ mega descriptive error message (hopefully! lol any suggestions there would be helpful).

src/pudl/settings.py

cmgosnell added 3 commits December 30, 2022 11:37

initial raw table settings valitions

c5c755f

lol forgot this one (CIRCULAR DEPENDENCY ISSUE)

4538f53

docs updates and import changes

31dad97

cmgosnell requested a review from zaneselvans December 30, 2022 17:22

zaneselvans mentioned this pull request Dec 30, 2022

Transform f1_elc_op_mnt_expn #2162

Merged

cmgosnell added 2 commits January 3, 2023 11:46

add a helper flatten method to remove more_itertools pck

9078ff4

Merge branch 'dev' into raw_ferc1_settings

19c7a2a

cmgosnell commented Jan 3, 2023

View reviewed changes

src/pudl/helpers.py Outdated Show resolved Hide resolved

cmgosnell commented Jan 3, 2023

View reviewed changes

src/pudl/helpers.py Outdated Show resolved Hide resolved

moar docs

6804158

cmgosnell marked this pull request as ready for review January 3, 2023 17:17

cmgosnell added 2 commits January 4, 2023 10:47

enable a no raw ferc settings w/ pudl tables and update tests

d33fdb7

moar docs

c866cc7

zaneselvans reviewed Jan 4, 2023

View reviewed changes

cmgosnell and others added 6 commits January 4, 2023 12:35

move the ferc1 table map back into extract!

f0c07d3

edit docs

33f7b4b

Merge branch 'dev' into raw_ferc1_settings

4282bf4

[pre-commit.ci] auto fixes from pre-commit.com hooks

f3cdaa5

For more information, see https://pre-commit.ci

jfc docs didn't like:

a9d3df3

Merge branch 'raw_ferc1_settings' of github.com:catalyst-cooperative/…

112eb0a

…pudl into raw_ferc1_settings

zaneselvans reviewed Jan 4, 2023

View reviewed changes

src/pudl/settings.py Outdated Show resolved Hide resolved

src/pudl/settings.py Outdated Show resolved Hide resolved

cmgosnell self-assigned this Jan 5, 2023

cmgosnell added ferc1 Anything having to do with FERC Form 1 settings labels Jan 5, 2023

cmgosnell added 3 commits January 5, 2023 10:39

update bail early conditional

00aac69

Merge branch 'dev' into raw_ferc1_settings

5a72d30

make bnail early conditional a lil clearer plz ty

1cede68

zaneselvans self-requested a review January 5, 2023 15:53

zaneselvans approved these changes Jan 5, 2023

View reviewed changes

cmgosnell merged commit 3ce0ff1 into dev Jan 5, 2023

cmgosnell deleted the raw_ferc1_settings branch January 5, 2023 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate the raw ferc1 tables in the settings #2168

Validate the raw ferc1 tables in the settings #2168

cmgosnell commented Dec 30, 2022 •

edited

codecov bot commented Jan 3, 2023 •

edited

zaneselvans left a comment

zaneselvans Jan 4, 2023

cmgosnell Jan 4, 2023

cmgosnell commented Jan 4, 2023 •

edited

cmgosnell commented Jan 4, 2023

Validate the raw ferc1 tables in the settings #2168

Validate the raw ferc1 tables in the settings #2168

Conversation

cmgosnell commented Dec 30, 2022 • edited

Overview

PR Checklist

codecov bot commented Jan 3, 2023 • edited

Codecov Report

zaneselvans left a comment

Choose a reason for hiding this comment

zaneselvans Jan 4, 2023

Choose a reason for hiding this comment

cmgosnell Jan 4, 2023

Choose a reason for hiding this comment

cmgosnell commented Jan 4, 2023 • edited

fully automatic

assert-only

cmgosnell commented Jan 4, 2023

cmgosnell commented Dec 30, 2022 •

edited

codecov bot commented Jan 3, 2023 •

edited

cmgosnell commented Jan 4, 2023 •

edited