Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accumulation of any changes in the xbrl validation tests #2071

Merged
merged 16 commits into from
Nov 17, 2022

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Nov 16, 2022

PR Checklist

Before requesting a review of your pull request, please make sure you've done the
following:

  • Merge the most recent version of dev (or the appropriate upstream branch) into
    your branch and resolved any merge conflicts. You may need to do this several
    times over the course of a PR as dev changes frequently.
  • Verify that all of the CI checks on your PR are passing. See
    Running Tests with Tox
    for details on how to run the full test suite locally if you need to debug a
    particular failure.
  • Ensure that the docstrings for any new modules, classes, functions, or methods are
    descriptive enough for developers and users to understand your code.
  • If you expanded data coverage or changed the outputs, ensure that the full
    data validation tests
    pass locally on a fresh DB.
  • If you've added new functions or classes, ensure that they have at least basic
    unit tests.
  • If you've added new analyses, make sure they include defensive sanity checks that
    will catch unexpected data issues.
  • Update the
    release notes
    to reflect your changes. Make sure to reference the PR and any related issues.
  • Do your own review of the PR. Add comments highlighting areas where you have
    questions you'd like reviewers to answer, known issues, solutions you're
    unsatisfied with, or other things that deserve special attention from the
    reviewer.

@cmgosnell cmgosnell changed the title remove xfail from the output test accumulation of any changes in the xbrl validation tests Nov 16, 2022
@cmgosnell
Copy link
Member Author

Current error from tox -e nuke

test/integration/datasette_metadata_test.py:20:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'pudl.metadata.classes.DatasetteMetadata'>, data_source_ids = ['pudl', 'ferc1', 'eia860', 'eia860m', 'eia923']
xbrl_ids = ['ferc1_xbrl', 'ferc2_xbrl', 'ferc6_xbrl', 'ferc60_xbrl', 'ferc714_xbrl'], extra_etl_groups = ['entity_eia', 'glue', 'static_eia', 'static_ferc1']
pudl_settings = {'censusdp1tract_db': 'sqlite:////Users/christinagosnell/code/pudl_work/sqlite/censusdp1tract.sqlite', 'data_dir': '/U.../v3s4wmyd75v5gjb636r5bvl80000gq/T/pytest-of-christinagosnell/pytest-36/pudl0/sqlite/ferc1_xbrl_datapackage.json'), ...}

    @classmethod
    def from_data_source_ids(
        cls,
        data_source_ids: Iterable[str] = [
            "pudl",
            "ferc1",
            "eia860",
            "eia860m",
            "eia923",
        ],
        xbrl_ids: Iterable[str] = [
            "ferc1_xbrl",
            "ferc2_xbrl",
            "ferc6_xbrl",
            "ferc60_xbrl",
            "ferc714_xbrl",
        ],
        extra_etl_groups: Iterable[str] = [
            "entity_eia",
            "glue",
            "static_eia",
            "static_ferc1",
        ],
        pudl_settings: dict = {},
    ) -> "DatasetteMetadata":
        """Construct a dictionary of DataSources from data source names.

        Create dictionary of first and last year or year-month for each source.

        Args:
            data_source_ids: ids of data sources currently included in Datasette
            xbrl_ids: ids of data converted XBRL data to be included in Datasette
            extra_etl_groups: ETL groups with resources that should be included
            pudl_settings: Dictionary of settings.
        """
        # Compile a list of DataSource objects for use in the template
        data_sources = [DataSource.from_id(ds_id) for ds_id in data_source_ids]

        # Instantiate all possible resources in a Package:
        pkg = Package.from_resource_ids()
        # Grab a list of just the resources we want to output:
        resources = [
            res
            for res in pkg.resources
            if res.etl_group in data_source_ids + extra_etl_groups
        ]

        # Get XBRL based resources
        for xbrl_id in xbrl_ids:
            # Read JSON Package descriptor from file
>           with open(pudl_settings[f"{xbrl_id}_datapackage"]) as f:
E           FileNotFoundError: [Errno 2] No such file or directory: '/private/var/folders/p8/v3s4wmyd75v5gjb636r5bvl80000gq/T/pytest-of-christinagosnell/pytest-36/pudl0/sqlite/ferc1_xbrl_datapackage.json'

.env_tox/lib/python3.10/site-packages/pudl/metadata/classes.py:1915: FileNotFoundError

@cmgosnell
Copy link
Member Author

cmgosnell commented Nov 16, 2022

current validation failure:

missing small plants errors:

  • FAILED test/validate/ferc1_test.py::test_no_null_cols_ferc1[ferc1_annual-plants_all_ferc1-all] - ValueError: Table plants_small_ferc1 not found
  • FAILED test/validate/ferc1_test.py::test_no_null_cols_ferc1[ferc1_annual-plants_small_ferc1-all] - ValueError: Table plants_small_ferc1 not found
  • FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_all_ferc1-51456] - ValueError: Table plants_small_ferc1 not found
  • FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_small_ferc1-15584] - ValueError: Table plants_small_ferc1 not found
  • FAILED test/validate/ferc1_test.py::test_record_id_dupes[plants_small_ferc1] - sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) near "plants_small_ferc1": syntax error

number of rows errors

  • FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fbp_ferc1-20573] - ValueError: fbp_ferc1: found 25430 rows, expected 20573. Off by 23.609%, allowed margin of 0.000%
  • FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fuel_ferc1-31265] - ValueError: fuel_ferc1: found 48280 rows, expected 31265. Off by 54.422%, allowed margin of 0.000%
  • FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_hydro_ferc1-6782] - ValueError: plants_hydro_ferc1: found 6797 rows, expected 6782. Off by 0.221%, allowed margin of 0.000%
  • FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_pumped_storage_ferc1-710] - ValueError: plants_pumped_storage_ferc1: found 544 rows, expected 710. Off by -23.380%, allowed margin of 0.000%
  • FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_steam_ferc1-28380] - ValueError: plants_steam_ferc1: found 30701 rows, expected 28380. Off by 8.178%, allowed margin of 0.000%
  • FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-pu_ferc1-6946] - ValueError: pu_ferc1: found 7423 rows, expected 6946. Off by 6.867%, allowed margin of 0.000%
  • FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-purchased_power_ferc1-190228] - ValueError: purchased_power_ferc1: found 197523 rows, expected 190228. Off by 3.835%, allowed margin of 0.000%

columns renames/removals -> output failures

  • FAILED test/validate/ferc1_test.py::test_no_null_cols_ferc1[ferc1_annual-plant_in_service_ferc1-all] - KeyError: "['amount_type'] not in index"
  • FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plant_in_service_ferc1-26933] - KeyError: "['amount_type'] not in index"
  • FAILED test/validate/ferc1_test.py::test_unique_rows_ferc1[ferc1_annual-plant_in_service_ferc1-unique_subset4] - KeyError: "['amount_type'] not in index"
  • FAILED test/validate/ferc1_test.py::test_no_null_cols_ferc1[ferc1_annual-fbp_ferc1-all] - ValueError: Null columns found in fbp_ferc1: ['other_cost_per_mmbtu']

data errors

  • FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mismatched_fuels[ferc1_annual] - AssertionError: Too many records (8.38%) have mismatched primary fuel types.
  • FAILED test/validate/plants_steam_ferc1_test.py::test_vs_bounds[ferc1_annual-capacity_ratios] - ValueError: 5% quantile (0.5694444444444444) is below lower bound (0.64) in validation entitled Capability Ratio (tails)
  • FAILED test/validate/plants_steam_ferc1_test.py::test_plant_id_clash[ferc1_annual] - AssertionError: Found 34 plant_id_ferc1 values associated with 68 non-unique plant_id_pudl values.
  • FAILED test/validate/fbp_ferc1_test.py::test_vs_bounds[ferc1_annual-oil_cost_per_mmbtu] - ValueError: 10% quantile (3.4921309412014603) is below lower bound (4.0) in validation entitled Oil cost per MMBTU (Tails)
  • FAILED test/validate/fuel_ferc1_test.py::test_vs_bounds[ferc1_annual-gas_cost_per_unit] - ValueError: 5% quantile (1.2239720675660863) is below lower bound (2.0) in validation entitled Cost per unit burned (Gas)
  • (!!new!!) FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mmbtu_cost_correlation[ferc1_annual] - ValueError: nuclear cost v mmbtu corrcoef is below 0.9: 0.09498177057189967

Dupes (is this test still appropriate for the xbrl data)?

  • FAILED test/validate/ferc1_test.py::test_unique_rows_ferc1[ferc1_annual-plants_hydro_ferc1-unique_subset2] - ValueError: Found 1 dupes of ['report_year', 'utility_id_ferc1', 'plant_name_ferc1', 'capacity_mw'] in dataframe plants_hydro_ferc1
  • FAILED test/validate/ferc1_test.py::test_record_id_dupes[purchased_power_ferc1] - AssertionError: 1 duplicate record_ids found in purchased_power_ferc1: ['purchased_power_326_2021_c001421_cheyenne_light_fuel__power_os'].

@codecov
Copy link

codecov bot commented Nov 16, 2022

Codecov Report

Base: 85.0% // Head: 85.1% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (4aad2f3) compared to base (eda9a3c).
Patch coverage: 100.0% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           xbrl_integration   #2071   +/-   ##
================================================
  Coverage              85.0%   85.1%           
================================================
  Files                    72      72           
  Lines                  8162    8183   +21     
================================================
+ Hits                   6945    6968   +23     
+ Misses                 1217    1215    -2     
Impacted Files Coverage Δ
src/pudl/validate.py 49.4% <ø> (ø)
src/pudl/analysis/classify_plants_ferc1.py 92.4% <100.0%> (+0.4%) ⬆️
src/pudl/glue/ferc1_eia.py 96.3% <100.0%> (-0.1%) ⬇️
src/pudl/output/ferc1.py 100.0% <100.0%> (+2.3%) ⬆️
src/pudl/transform/ferc1.py 94.7% <100.0%> (+0.1%) ⬆️
src/pudl/output/pudltabl.py 88.2% <0.0%> (+0.2%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zaneselvans
Copy link
Member

The duplicates in the hydro and purchased power tables are coming from the DBF data, not the XBRL data. We've punted on figuring out why they are duplicates, but I think they actually should be addressed (and there are only a tiny handful of them)

Base automatically changed from dbf-xbrl-mapping-dupes to xbrl_integration November 16, 2022 21:51
@cmgosnell
Copy link
Member Author

The duplicates in the hydro and purchased power tables are coming from the DBF data, not the XBRL data. We've punted on figuring out why they are duplicates, but I think they actually should be addressed (and there are only a tiny handful of them)

  • the purchased power duplicate is a 2021 (xbrl) dupe and appears to be two legitimate records. We do not necessarily expect the purchase power table to have a unique record_id in the xbrl data so we removed it from this test
  • the hydro table was a real dupe from 2019. we've added a targeted removal of one the duplicates which had less data in it.

@zaneselvans
Copy link
Member

@cmgosnell some findings...

  • FAILED test/validate/plants_steam_ferc1_test.py::test_vs_bounds[ferc1_annual-capacity_ratios] - ValueError: 5% quantile (0.5694444444444444) is below lower bound (0.64) in validation entitled Capability Ratio (tails)
    • I think this is just due to the fact that coal capacity factors are plummeting. I'd revise the threshold to 50%.
  • FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mismatched_fuels[ferc1_annual] - AssertionError: Too many records (8.38%) have mismatched primary fuel types.
    • The problem here is that a lot of missing values are empty strings, and "oil" != "" so the number of mismatches is artificially high.

@cmgosnell
Copy link
Member Author

and then there were 9:

=============================================================================================== short test summary info ================================================================================================
FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mismatched_fuels[ferc1_annual] - AssertionError: Too many records (8.37%) have mismatched primary fuel types.
FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mmbtu_cost_correlation[ferc1_annual] - ValueError: oil cost v mmbtu corrcoef is below 0.9: 0.8898862970859709
FAILED test/validate/fbp_ferc1_test.py::test_vs_bounds[ferc1_annual-oil_cost_per_mmbtu] - ValueError: 10% quantile (3.6009842831755097) is below lower bound (4.0) in validation entitled Oil cost per MMBTU (Tails)
FAILED test/validate/ferc1_test.py::test_no_null_cols_ferc1[ferc1_annual-fbp_ferc1-all] - ValueError: Null columns found in fbp_ferc1: ['other_fraction_cost', 'other_fraction_mmbtu', 'other_cost_per_mmbtu']
FAILED test/validate/fuel_ferc1_test.py::test_vs_bounds[ferc1_annual-gas_cost_per_unit] - ValueError: 5% quantile (1.2483937433974757) is below lower bound (2.0) in validation entitled Cost per unit burned (Gas)
FAILED test/validate/plants_steam_ferc1_test.py::test_vs_bounds[ferc1_annual-capacity_ratios] - ValueError: 5% quantile (0.5694444444444444) is below lower bound (0.64) in validation entitled Capability Ratio (tails)
FAILED test/validate/plants_steam_ferc1_test.py::test_plant_id_clash[ferc1_annual] - AssertionError: Found 34 plant_id_ferc1 values associated with 68 non-unique plant_id_pudl values.
============================================================== 7 failed, 250 passed, 124 skipped, 7 xfailed, 1 xpassed, 2 warnings in 1825.64s (0:30:25) ===============================================================
[1]    90996 exit 1     pytest test/validate --live-dbs

@zaneselvans
Copy link
Member

I think the unexpectedly large low-end tail in natural gas prices may be due to some unfixed unit errors. There's a cluster of fuel prices reported between 0-0.5 USD/Mcf that looks suspiciously similar to the main distribution between 0-5 USD/Mcf. So maybe it was reported in USD / hundred cubic feet, rather than USD / thousand cubic feet?

image
image

@cmgosnell
Copy link
Member Author

and then there were 4:

FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mmbtu_cost_correlation[ferc1_annual] - ValueError: nuclear cost v mmbtu corrcoef is below 0.9: 0.09498177057189967
FAILED test/validate/fbp_ferc1_test.py::test_vs_bounds[ferc1_annual-oil_cost_per_mmbtu] - ValueError: 10% quantile (3.6009842831755097) is below lower bound (4.0) in validation entitled Oil cost per MMBTU (Tails)
FAILED test/validate/fuel_ferc1_test.py::test_vs_bounds[ferc1_annual-gas_cost_per_unit] - ValueError: 5% quantile (1.2483937433974757) is below lower bound (2.0) in validation entitled Cost per unit burned (Gas)
FAILED test/validate/plants_steam_ferc1_test.py::test_plant_id_clash[ferc1_annual] - AssertionError: Found 34 plant_id_ferc1 values associated with 68 non-unique plant_id_pudl values.

@zaneselvans
Copy link
Member

I think the oil price distribution actually just includes more lower priced records now. Probably as a result of all the fuel type / units cleaning. There's a little spike at ~0 that's probably bad data, but removing it doesn't make much of a difference in where the 10% cutoff is. So I would say drop the 10% cutoff from 4.0 USD/MMBtu to 3.0 USD/MMBtu on this one:

 FAILED test/validate/fbp_ferc1_test.py::test_vs_bounds[ferc1_annual-oil_cost_per_mmbtu] - ValueError: 10% quantile (3.4921309412014603) is below lower bound (4.0) in validation entitled Oil cost per MMBTU (Tails)

image

@zaneselvans
Copy link
Member

On the cost correlations, I think you can just remove nuclear and other from the fuel types that are checked. Nuclear fuel reporting is a mess in FERC 1. There's really no good data to use. And we eliminated all of the other fuel records.

eachother, except one have nulls in the capex columns. Surgically remove the
record with the nulls.
"""
if 2019 in df.report_year.unique():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i made this run only if 2019 is being processed, but I could change the assertion below to only yell if len(possible_dupes) != 2 and 2019 in df.report_year.unique()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only having the assertion in the conditional is better -- we want these kinds of spot fixes to run and work regardless of what data is being processed, and having the whole thing enclosed in a conditional suggests that it will not work if 2019 isn't in the data.

@zaneselvans zaneselvans self-requested a review November 17, 2022 20:49
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the tiniest of changes requested!

Comment on lines +552 to +554
if float_cols:
df.loc[:, float_cols] = df.loc[:, float_cols].replace(0, np.nan)
return df
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not any worse than it was before but... it would be better if we could distinguish the original nan/0.0 values from each other and end up with the same values. We should make an issue to ensure that we get the same values out of the whole steam plant classifier that we put into it, even if the NULL values need to be filled for the classification process to run.

eachother, except one have nulls in the capex columns. Surgically remove the
record with the nulls.
"""
if 2019 in df.report_year.unique():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only having the assertion in the conditional is better -- we want these kinds of spot fixes to run and work regardless of what data is being processed, and having the whole thing enclosed in a conditional suggests that it will not work if 2019 isn't in the data.

test/validate/fuel_ferc1_test.py Show resolved Hide resolved
@cmgosnell cmgosnell marked this pull request as ready for review November 17, 2022 22:00
@zaneselvans zaneselvans self-requested a review November 17, 2022 22:00
@cmgosnell cmgosnell merged commit 801223b into xbrl_integration Nov 17, 2022
@cmgosnell cmgosnell deleted the nuke branch November 17, 2022 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants