accumulation of any changes in the xbrl validation tests #2071

cmgosnell · 2022-11-16T16:11:50Z

PR Checklist

Before requesting a review of your pull request, please make sure you've done the
following:

Merge the most recent version of dev (or the appropriate upstream branch) into
your branch and resolved any merge conflicts. You may need to do this several
times over the course of a PR as dev changes frequently.
Verify that all of the CI checks on your PR are passing. See
Running Tests with Tox
for details on how to run the full test suite locally if you need to debug a
particular failure.
Ensure that the docstrings for any new modules, classes, functions, or methods are
descriptive enough for developers and users to understand your code.
If you expanded data coverage or changed the outputs, ensure that the full
data validation tests
pass locally on a fresh DB.
If you've added new functions or classes, ensure that they have at least basic
unit tests.
If you've added new analyses, make sure they include defensive sanity checks that
will catch unexpected data issues.
Update the
release notes
to reflect your changes. Make sure to reference the PR and any related issues.
Do your own review of the PR. Add comments highlighting areas where you have
questions you'd like reviewers to answer, known issues, solutions you're
unsatisfied with, or other things that deserve special attention from the
reviewer.

cmgosnell · 2022-11-16T16:20:34Z

Current error from tox -e nuke

test/integration/datasette_metadata_test.py:20:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'pudl.metadata.classes.DatasetteMetadata'>, data_source_ids = ['pudl', 'ferc1', 'eia860', 'eia860m', 'eia923']
xbrl_ids = ['ferc1_xbrl', 'ferc2_xbrl', 'ferc6_xbrl', 'ferc60_xbrl', 'ferc714_xbrl'], extra_etl_groups = ['entity_eia', 'glue', 'static_eia', 'static_ferc1']
pudl_settings = {'censusdp1tract_db': 'sqlite:////Users/christinagosnell/code/pudl_work/sqlite/censusdp1tract.sqlite', 'data_dir': '/U.../v3s4wmyd75v5gjb636r5bvl80000gq/T/pytest-of-christinagosnell/pytest-36/pudl0/sqlite/ferc1_xbrl_datapackage.json'), ...}

    @classmethod
    def from_data_source_ids(
        cls,
        data_source_ids: Iterable[str] = [
            "pudl",
            "ferc1",
            "eia860",
            "eia860m",
            "eia923",
        ],
        xbrl_ids: Iterable[str] = [
            "ferc1_xbrl",
            "ferc2_xbrl",
            "ferc6_xbrl",
            "ferc60_xbrl",
            "ferc714_xbrl",
        ],
        extra_etl_groups: Iterable[str] = [
            "entity_eia",
            "glue",
            "static_eia",
            "static_ferc1",
        ],
        pudl_settings: dict = {},
    ) -> "DatasetteMetadata":
        """Construct a dictionary of DataSources from data source names.

        Create dictionary of first and last year or year-month for each source.

        Args:
            data_source_ids: ids of data sources currently included in Datasette
            xbrl_ids: ids of data converted XBRL data to be included in Datasette
            extra_etl_groups: ETL groups with resources that should be included
            pudl_settings: Dictionary of settings.
        """
        # Compile a list of DataSource objects for use in the template
        data_sources = [DataSource.from_id(ds_id) for ds_id in data_source_ids]

        # Instantiate all possible resources in a Package:
        pkg = Package.from_resource_ids()
        # Grab a list of just the resources we want to output:
        resources = [
            res
            for res in pkg.resources
            if res.etl_group in data_source_ids + extra_etl_groups
        ]

        # Get XBRL based resources
        for xbrl_id in xbrl_ids:
            # Read JSON Package descriptor from file
>           with open(pudl_settings[f"{xbrl_id}_datapackage"]) as f:
E           FileNotFoundError: [Errno 2] No such file or directory: '/private/var/folders/p8/v3s4wmyd75v5gjb636r5bvl80000gq/T/pytest-of-christinagosnell/pytest-36/pudl0/sqlite/ferc1_xbrl_datapackage.json'

.env_tox/lib/python3.10/site-packages/pudl/metadata/classes.py:1915: FileNotFoundError

cmgosnell · 2022-11-16T18:05:26Z

current validation failure:

missing small plants errors:

FAILED test/validate/ferc1_test.py::test_no_null_cols_ferc1[ferc1_annual-plants_all_ferc1-all] - ValueError: Table plants_small_ferc1 not found
FAILED test/validate/ferc1_test.py::test_no_null_cols_ferc1[ferc1_annual-plants_small_ferc1-all] - ValueError: Table plants_small_ferc1 not found
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_all_ferc1-51456] - ValueError: Table plants_small_ferc1 not found
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_small_ferc1-15584] - ValueError: Table plants_small_ferc1 not found
FAILED test/validate/ferc1_test.py::test_record_id_dupes[plants_small_ferc1] - sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) near "plants_small_ferc1": syntax error

number of rows errors

FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fbp_ferc1-20573] - ValueError: fbp_ferc1: found 25430 rows, expected 20573. Off by 23.609%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fuel_ferc1-31265] - ValueError: fuel_ferc1: found 48280 rows, expected 31265. Off by 54.422%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_hydro_ferc1-6782] - ValueError: plants_hydro_ferc1: found 6797 rows, expected 6782. Off by 0.221%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_pumped_storage_ferc1-710] - ValueError: plants_pumped_storage_ferc1: found 544 rows, expected 710. Off by -23.380%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_steam_ferc1-28380] - ValueError: plants_steam_ferc1: found 30701 rows, expected 28380. Off by 8.178%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-pu_ferc1-6946] - ValueError: pu_ferc1: found 7423 rows, expected 6946. Off by 6.867%, allowed margin of 0.000%
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-purchased_power_ferc1-190228] - ValueError: purchased_power_ferc1: found 197523 rows, expected 190228. Off by 3.835%, allowed margin of 0.000%

columns renames/removals -> output failures

FAILED test/validate/ferc1_test.py::test_no_null_cols_ferc1[ferc1_annual-plant_in_service_ferc1-all] - KeyError: "['amount_type'] not in index"
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plant_in_service_ferc1-26933] - KeyError: "['amount_type'] not in index"
FAILED test/validate/ferc1_test.py::test_unique_rows_ferc1[ferc1_annual-plant_in_service_ferc1-unique_subset4] - KeyError: "['amount_type'] not in index"
FAILED test/validate/ferc1_test.py::test_no_null_cols_ferc1[ferc1_annual-fbp_ferc1-all] - ValueError: Null columns found in fbp_ferc1: ['other_cost_per_mmbtu']

data errors

FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mismatched_fuels[ferc1_annual] - AssertionError: Too many records (8.38%) have mismatched primary fuel types.
FAILED test/validate/plants_steam_ferc1_test.py::test_vs_bounds[ferc1_annual-capacity_ratios] - ValueError: 5% quantile (0.5694444444444444) is below lower bound (0.64) in validation entitled Capability Ratio (tails)
FAILED test/validate/plants_steam_ferc1_test.py::test_plant_id_clash[ferc1_annual] - AssertionError: Found 34 plant_id_ferc1 values associated with 68 non-unique plant_id_pudl values.
FAILED test/validate/fbp_ferc1_test.py::test_vs_bounds[ferc1_annual-oil_cost_per_mmbtu] - ValueError: 10% quantile (3.4921309412014603) is below lower bound (4.0) in validation entitled Oil cost per MMBTU (Tails)
FAILED test/validate/fuel_ferc1_test.py::test_vs_bounds[ferc1_annual-gas_cost_per_unit] - ValueError: 5% quantile (1.2239720675660863) is below lower bound (2.0) in validation entitled Cost per unit burned (Gas)
(!!new!!) FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mmbtu_cost_correlation[ferc1_annual] - ValueError: nuclear cost v mmbtu corrcoef is below 0.9: 0.09498177057189967

Dupes (is this test still appropriate for the xbrl data)?

FAILED test/validate/ferc1_test.py::test_unique_rows_ferc1[ferc1_annual-plants_hydro_ferc1-unique_subset2] - ValueError: Found 1 dupes of ['report_year', 'utility_id_ferc1', 'plant_name_ferc1', 'capacity_mw'] in dataframe plants_hydro_ferc1
FAILED test/validate/ferc1_test.py::test_record_id_dupes[purchased_power_ferc1] - AssertionError: 1 duplicate record_ids found in purchased_power_ferc1: ['purchased_power_326_2021_c001421_cheyenne_light_fuel__power_os'].

codecov · 2022-11-16T19:32:06Z

Codecov Report

Base: 85.0% // Head: 85.1% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (4aad2f3) compared to base (eda9a3c).
Patch coverage: 100.0% of modified lines in pull request are covered.

Additional details and impacted files

@@               Coverage Diff                @@
##           xbrl_integration   #2071   +/-   ##
================================================
  Coverage              85.0%   85.1%           
================================================
  Files                    72      72           
  Lines                  8162    8183   +21     
================================================
+ Hits                   6945    6968   +23     
+ Misses                 1217    1215    -2

Impacted Files	Coverage Δ
src/pudl/validate.py	`49.4% <ø> (ø)`
src/pudl/analysis/classify_plants_ferc1.py	`92.4% <100.0%> (+0.4%)`	⬆️
src/pudl/glue/ferc1_eia.py	`96.3% <100.0%> (-0.1%)`	⬇️
src/pudl/output/ferc1.py	`100.0% <100.0%> (+2.3%)`	⬆️
src/pudl/transform/ferc1.py	`94.7% <100.0%> (+0.1%)`	⬆️
src/pudl/output/pudltabl.py	`88.2% <0.0%> (+0.2%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

zaneselvans · 2022-11-16T21:23:14Z

The duplicates in the hydro and purchased power tables are coming from the DBF data, not the XBRL data. We've punted on figuring out why they are duplicates, but I think they actually should be addressed (and there are only a tiny handful of them)

cmgosnell · 2022-11-17T15:47:51Z

The duplicates in the hydro and purchased power tables are coming from the DBF data, not the XBRL data. We've punted on figuring out why they are duplicates, but I think they actually should be addressed (and there are only a tiny handful of them)

the purchased power duplicate is a 2021 (xbrl) dupe and appears to be two legitimate records. We do not necessarily expect the purchase power table to have a unique record_id in the xbrl data so we removed it from this test
the hydro table was a real dupe from 2019. we've added a targeted removal of one the duplicates which had less data in it.

…SH MF

zaneselvans · 2022-11-17T17:35:32Z

@cmgosnell some findings...

FAILED test/validate/plants_steam_ferc1_test.py::test_vs_bounds[ferc1_annual-capacity_ratios] - ValueError: 5% quantile (0.5694444444444444) is below lower bound (0.64) in validation entitled Capability Ratio (tails)
- I think this is just due to the fact that coal capacity factors are plummeting. I'd revise the threshold to 50%.
FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mismatched_fuels[ferc1_annual] - AssertionError: Too many records (8.38%) have mismatched primary fuel types.
- The problem here is that a lot of missing values are empty strings, and "oil" != "" so the number of mismatches is artificially high.

cmgosnell · 2022-11-17T18:54:11Z

and then there were 9:

=============================================================================================== short test summary info ================================================================================================
FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mismatched_fuels[ferc1_annual] - AssertionError: Too many records (8.37%) have mismatched primary fuel types.
FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mmbtu_cost_correlation[ferc1_annual] - ValueError: oil cost v mmbtu corrcoef is below 0.9: 0.8898862970859709
FAILED test/validate/fbp_ferc1_test.py::test_vs_bounds[ferc1_annual-oil_cost_per_mmbtu] - ValueError: 10% quantile (3.6009842831755097) is below lower bound (4.0) in validation entitled Oil cost per MMBTU (Tails)
FAILED test/validate/ferc1_test.py::test_no_null_cols_ferc1[ferc1_annual-fbp_ferc1-all] - ValueError: Null columns found in fbp_ferc1: ['other_fraction_cost', 'other_fraction_mmbtu', 'other_cost_per_mmbtu']
FAILED test/validate/fuel_ferc1_test.py::test_vs_bounds[ferc1_annual-gas_cost_per_unit] - ValueError: 5% quantile (1.2483937433974757) is below lower bound (2.0) in validation entitled Cost per unit burned (Gas)
FAILED test/validate/plants_steam_ferc1_test.py::test_vs_bounds[ferc1_annual-capacity_ratios] - ValueError: 5% quantile (0.5694444444444444) is below lower bound (0.64) in validation entitled Capability Ratio (tails)
FAILED test/validate/plants_steam_ferc1_test.py::test_plant_id_clash[ferc1_annual] - AssertionError: Found 34 plant_id_ferc1 values associated with 68 non-unique plant_id_pudl values.
============================================================== 7 failed, 250 passed, 124 skipped, 7 xfailed, 1 xpassed, 2 warnings in 1825.64s (0:30:25) ===============================================================
[1]    90996 exit 1     pytest test/validate --live-dbs

zaneselvans · 2022-11-17T19:49:58Z

I think the unexpectedly large low-end tail in natural gas prices may be due to some unfixed unit errors. There's a cluster of fuel prices reported between 0-0.5 USD/Mcf that looks suspiciously similar to the main distribution between 0-5 USD/Mcf. So maybe it was reported in USD / hundred cubic feet, rather than USD / thousand cubic feet?

cmgosnell · 2022-11-17T19:57:23Z

and then there were 4:

FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mmbtu_cost_correlation[ferc1_annual] - ValueError: nuclear cost v mmbtu corrcoef is below 0.9: 0.09498177057189967
FAILED test/validate/fbp_ferc1_test.py::test_vs_bounds[ferc1_annual-oil_cost_per_mmbtu] - ValueError: 10% quantile (3.6009842831755097) is below lower bound (4.0) in validation entitled Oil cost per MMBTU (Tails)
FAILED test/validate/fuel_ferc1_test.py::test_vs_bounds[ferc1_annual-gas_cost_per_unit] - ValueError: 5% quantile (1.2483937433974757) is below lower bound (2.0) in validation entitled Cost per unit burned (Gas)
FAILED test/validate/plants_steam_ferc1_test.py::test_plant_id_clash[ferc1_annual] - AssertionError: Found 34 plant_id_ferc1 values associated with 68 non-unique plant_id_pudl values.

zaneselvans · 2022-11-17T20:21:24Z

I think the oil price distribution actually just includes more lower priced records now. Probably as a result of all the fuel type / units cleaning. There's a little spike at ~0 that's probably bad data, but removing it doesn't make much of a difference in where the 10% cutoff is. So I would say drop the 10% cutoff from 4.0 USD/MMBtu to 3.0 USD/MMBtu on this one:

 FAILED test/validate/fbp_ferc1_test.py::test_vs_bounds[ferc1_annual-oil_cost_per_mmbtu] - ValueError: 10% quantile (3.4921309412014603) is below lower bound (4.0) in validation entitled Oil cost per MMBTU (Tails)

zaneselvans · 2022-11-17T20:23:33Z

On the cost correlations, I think you can just remove nuclear and other from the fuel types that are checked. Nuclear fuel reporting is a mess in FERC 1. There's really no good data to use. And we eliminated all of the other fuel records.

cmgosnell · 2022-11-17T20:46:28Z

src/pudl/transform/ferc1.py

+        eachother, except one have nulls in the capex columns. Surgically remove the
+        record with the nulls.
+        """
+        if 2019 in df.report_year.unique():


i made this run only if 2019 is being processed, but I could change the assertion below to only yell if len(possible_dupes) != 2 and 2019 in df.report_year.unique()

I think only having the assertion in the conditional is better -- we want these kinds of spot fixes to run and work regardless of what data is being processed, and having the whole thing enclosed in a conditional suggests that it will not work if 2019 isn't in the data.

test/conftest.py

test/validate/ferc1_test.py

zaneselvans

Only the tiniest of changes requested!

zaneselvans · 2022-11-17T20:56:02Z

src/pudl/analysis/classify_plants_ferc1.py

+    if float_cols:
+        df.loc[:, float_cols] = df.loc[:, float_cols].replace(0, np.nan)
+    return df


Not any worse than it was before but... it would be better if we could distinguish the original nan/0.0 values from each other and end up with the same values. We should make an issue to ensure that we get the same values out of the whole steam plant classifier that we put into it, even if the NULL values need to be filled for the classification process to run.

zaneselvans · 2022-11-17T21:00:31Z

src/pudl/transform/ferc1.py

+        eachother, except one have nulls in the capex columns. Surgically remove the
+        record with the nulls.
+        """
+        if 2019 in df.report_year.unique():


I think only having the assertion in the conditional is better -- we want these kinds of spot fixes to run and work regardless of what data is being processed, and having the whole thing enclosed in a conditional suggests that it will not work if 2019 isn't in the data.

test/validate/fuel_ferc1_test.py

remove xfail from the output test

65c078c

cmgosnell changed the title ~~remove xfail from the output test~~ accumulation of any changes in the xbrl validation tests Nov 16, 2022

include all generated outputs in the live_dbs overwrite in conftest

32ff9f3

remove amount_type column from plant_in_service output table

a3d2627

cmgosnell added 4 commits November 16, 2022 16:17

add straggler small plants into mapping sheet (WHYYY)

a0c9e31

fix sorting error in ferc glue plants

0400037

add small plants into full settings file

74847e9

change PK's and expected rows in ferc tests

494b91c

Base automatically changed from dbf-xbrl-mapping-dupes to xbrl_integration November 16, 2022 21:51

cmgosnell added 3 commits November 16, 2022 17:27

add bespoke drop duplicate

d88fb1f

skip drop 2019 dupe if 2019 not being loaded

4c8e2c8

Merge branch 'xbrl_integration' into nuke

77a7260

lol fix the 2019 exception and FIX OFFICE LIBRE'S AUTOCORRECT OF A DA…

d6629e4

…SH MF

nulls others and such

cb02642

validation updates

ee1e675

cmgosnell commented Nov 17, 2022

View reviewed changes

test/conftest.py Show resolved Hide resolved

zaneselvans mentioned this pull request Nov 17, 2022

Bad natural gas prices in fuel_ferc1 #2073

Open

cmgosnell commented Nov 17, 2022

View reviewed changes

test/validate/ferc1_test.py Show resolved Hide resolved

zaneselvans self-requested a review November 17, 2022 20:49

only revent fillna in outputs, not steam table

f8fdccc

zaneselvans requested changes Nov 17, 2022

View reviewed changes

cmgosnell added 2 commits November 17, 2022 16:21

enable targeted drop dupes

ebcde93

bb validation fixes

4aad2f3

cmgosnell marked this pull request as ready for review November 17, 2022 22:00

zaneselvans self-requested a review November 17, 2022 22:00

zaneselvans approved these changes Nov 17, 2022

View reviewed changes

cmgosnell merged commit 801223b into xbrl_integration Nov 17, 2022

cmgosnell deleted the nuke branch November 17, 2022 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accumulation of any changes in the xbrl validation tests #2071

accumulation of any changes in the xbrl validation tests #2071

cmgosnell commented Nov 16, 2022 •

edited

cmgosnell commented Nov 16, 2022

cmgosnell commented Nov 16, 2022 •

edited

codecov bot commented Nov 16, 2022 •

edited

zaneselvans commented Nov 16, 2022

cmgosnell commented Nov 17, 2022

zaneselvans commented Nov 17, 2022

cmgosnell commented Nov 17, 2022

zaneselvans commented Nov 17, 2022

cmgosnell commented Nov 17, 2022

zaneselvans commented Nov 17, 2022

zaneselvans commented Nov 17, 2022

cmgosnell Nov 17, 2022

zaneselvans Nov 17, 2022

zaneselvans left a comment

zaneselvans Nov 17, 2022

zaneselvans Nov 17, 2022

accumulation of any changes in the xbrl validation tests #2071

accumulation of any changes in the xbrl validation tests #2071

Conversation

cmgosnell commented Nov 16, 2022 • edited

PR Checklist

cmgosnell commented Nov 16, 2022

cmgosnell commented Nov 16, 2022 • edited

missing small plants errors:

number of rows errors

columns renames/removals -> output failures

data errors

Dupes (is this test still appropriate for the xbrl data)?

codecov bot commented Nov 16, 2022 • edited

Codecov Report

zaneselvans commented Nov 16, 2022

cmgosnell commented Nov 17, 2022

zaneselvans commented Nov 17, 2022

cmgosnell commented Nov 17, 2022

zaneselvans commented Nov 17, 2022

cmgosnell commented Nov 17, 2022

zaneselvans commented Nov 17, 2022

zaneselvans commented Nov 17, 2022

cmgosnell Nov 17, 2022

Choose a reason for hiding this comment

zaneselvans Nov 17, 2022

Choose a reason for hiding this comment

zaneselvans left a comment

Choose a reason for hiding this comment

zaneselvans Nov 17, 2022

Choose a reason for hiding this comment

zaneselvans Nov 17, 2022

Choose a reason for hiding this comment

cmgosnell commented Nov 16, 2022 •

edited

cmgosnell commented Nov 16, 2022 •

edited

codecov bot commented Nov 16, 2022 •

edited