Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use all dimensions in XbrlCalculationForestFerc1 and exploded tables #2763

Merged
merged 57 commits into from
Aug 22, 2023

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Jul 31, 2023

PR Overview

This PR updates the MetadataExploder and Exploder and XbrlCalculationForestFerc1 classes to use the additional dimensions that are required to uniquely identify all reported facts so they can be independently annotated and used to filter the output data.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. xbrl Related to the FERC XBRL transition labels Jul 31, 2023
src/pudl/output/ferc1.py Outdated Show resolved Hide resolved
@zaneselvans zaneselvans marked this pull request as draft July 31, 2023 14:02
src/pudl/output/ferc1.py Outdated Show resolved Hide resolved
src/pudl/output/ferc1.py Outdated Show resolved Hide resolved
src/pudl/output/ferc1.py Outdated Show resolved Hide resolved
@zaneselvans
Copy link
Member Author

I've added the drop_duplicates() and weirdly it seems like now I have more records than before (1044 total).

I tried selecting all of the "leaf" nodes (which have parent columns, but all NA calc columns) and it looks all of the leaves also currently lack any additional dimensions in their parents, which doesn't seem like what we would expect.

new_calcs = MetadataExploder(
    table_names=table_names,
    clean_xbrl_metadata_json=clean_xbrl_metadata_json,
    calculation_components_xbrl_ferc1=calculation_components_xbrl_ferc1,
).calculations

new_calcs[new_calcs[calc_cols].isna().all(axis="columns")][parent_cols + calc_cols + ["weight"]].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 239 entries, 15 to 782
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   table_name_parent      239 non-null    string
 1   xbrl_factoid_parent    239 non-null    string
 2   utility_type_parent    0 non-null      string
 3   plant_status_parent    0 non-null      string
 4   plant_function_parent  0 non-null      string
 5   table_name             0 non-null      string
 6   xbrl_factoid           0 non-null      string
 7   utility_type           0 non-null      string
 8   plant_status           0 non-null      string
 9   plant_function         0 non-null      string
 10  weight                 0 non-null      Int64 
dtypes: Int64(1), string(10)
memory usage: 22.6 KB

@zaneselvans zaneselvans marked this pull request as ready for review August 11, 2023 04:55
@zaneselvans
Copy link
Member Author

Oooookay, I've got the whole explosion + filtering based on the calculation forest working again, but there are lots of issues involving the new dimensions that we'll need to untangle. Hopefully many of them can be fixed systematically rather than manually.

Not everything in this PR is pretty. I'm sure it can be improved. But do we want to get this stuff merged in and work on the fixes to the calculations with all the dimensions separate from these changes?

The setup I'm using right now to identify issues looks like this...

Setup Inputs

import importlib

from dagster import AssetKey

from pudl.etl import defs
from pudl.output.ferc1 import (
    Exploder,
    MetadataExploder,
    NodeId,
    XbrlCalculationForestFerc1,
)

tags_csv = (
    importlib.resources.files("pudl.package_data.ferc1")
    / "xbrl_factoid_rate_base_tags.csv"
)
tags_df = (
    pd.read_csv(tags_csv, usecols=["table_name", "xbrl_factoid", "in_rate_base"])
    .drop_duplicates()
    .dropna(subset=["table_name", "xbrl_factoid"], how="any")
    .astype(pd.StringDtype())
)

clean_xbrl_metadata_json = defs.load_asset_value(AssetKey("clean_xbrl_metadata_json"))
metadata_xbrl_ferc1 = defs.load_asset_value(AssetKey("metadata_xbrl_ferc1"))
calculation_components_xbrl_ferc1 = defs.load_asset_value(
    AssetKey("calculation_components_xbrl_ferc1")
)

explosion_args = {
    "income_statement_ferc1": {
        "root_table": "income_statement_ferc1",
        "table_names_to_explode": [
            "income_statement_ferc1",
            "depreciation_amortization_summary_ferc1",
            "electric_operating_expenses_ferc1",
            "electric_operating_revenues_ferc1",
        ],
        "calculation_tolerance": 0.27,
        "seeds": [
            NodeId(
                table_name="income_statement_ferc1",
                xbrl_factoid="net_income_loss",
                utility_type="total",
                plant_status=pd.NA,
                plant_function=pd.NA,
            ),
        ],
        "tags": tags_df,
    },
    "balance_sheet_assets_ferc1": {
        "root_table": "balance_sheet_assets_ferc1",
        "table_names_to_explode": [
            "balance_sheet_assets_ferc1",
            "utility_plant_summary_ferc1",
            "plant_in_service_ferc1",
            "electric_plant_depreciation_functional_ferc1",
        ],
        "calculation_tolerance": 0.81,
        "seeds": [
            NodeId(
                table_name="balance_sheet_assets_ferc1",
                xbrl_factoid="assets_and_other_debits",
                utility_type=pd.NA,
                plant_status=pd.NA,
                plant_function=pd.NA,
            )
        ],
        "tags": tags_df,
    },
    "balance_sheet_liabilities_ferc1": {
        "root_table": "balance_sheet_liabilities_ferc1",
        "table_names_to_explode": [
            "balance_sheet_liabilities_ferc1",
            "retained_earnings_ferc1",
        ],
        "calculation_tolerance": 0.075,
        "seeds": [
            NodeId(
                table_name="balance_sheet_liabilities_ferc1",
                xbrl_factoid="liabilities_and_other_credits",
                utility_type=pd.NA,
                plant_status=pd.NA,
                plant_function=pd.NA,
            )
        ],
        "tags": tags_df,
    },
}

Coordinating Function

def exploded_table(
    root_table: str,
    table_names_to_explode: list[str],
    calculation_tolerance: float,
    seeds: list[NodeId],
    tags: pd.DataFrame,
):
    metadata_xbrl_ferc1 = defs.load_asset_value(
        AssetKey("metadata_xbrl_ferc1")
    )
    calculation_components_xbrl_ferc1 = defs.load_asset_value(
        AssetKey("calculation_components_xbrl_ferc1")
    )

    dfs_to_explode = {
        table: pd.read_sql(table, pudl_engine) for table in table_names_to_explode
    }

    exploder = Exploder(
        root_table=root_table,
        table_names=table_names_to_explode,
        metadata_xbrl_ferc1=metadata_xbrl_ferc1,
        calculation_components_xbrl_ferc1=calculation_components_xbrl_ferc1,
        seed_nodes=seeds,
        tags=tags,
    )
    return {
        "exploder": exploder,
        "exploded_meta": exploder.metadata_exploded,
        "exploded_calcs": exploder.calculations_exploded,
        "forest": exploder.calculation_forest,
        "leafy_meta": exploder.calculation_forest.leafy_meta,
        "root_calcs": exploder.calculation_forest.root_calculations,
        "exploded_data": exploder.boom(
            tables_to_explode=dfs_to_explode,
            calculation_tolerance=calculation_tolerance,
        ),
    }

Run the Explosions

test_explode = {}
for root_table in explosion_args:
    print(f"Exploding: {root_table}")
    test_explode[root_table] = exploded_table(**explosion_args[root_table])

Display Results

for root_table in test_explode:
    test_explode[root_table]["forest"].plot("full_digraph")
    test_explode[root_table]["forest"].plot("forest")
    
    print(f"\n ======== ORPHAN NODES: ========\n")
    display(pd.DataFrame(test_explode[root_table]['forest'].orphans))
    
    print(f"\n ======== PRUNED NODES: ========\n")
    display(pd.DataFrame(test_explode[root_table]['forest'].pruned))

Outputs

Income Statement

image
image

Balance Sheet Assetse

image
image

Balance Sheet Liabilities

image
image

@zaneselvans zaneselvans self-assigned this Aug 11, 2023
@zaneselvans zaneselvans changed the title WIP update to MetadataExploder.calculations Update XbrlCalculationForestFerc1 and exploded tables to use all dimensions Aug 11, 2023
@zaneselvans zaneselvans changed the title Update XbrlCalculationForestFerc1 and exploded tables to use all dimensions Use all dimensions in XbrlCalculationForestFerc1 and exploded tables Aug 11, 2023
@zaneselvans zaneselvans merged commit 7f7909e into explode_ferc1 Aug 22, 2023
7 of 8 checks passed
@zaneselvans zaneselvans deleted the dim-trees branch August 22, 2023 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. xbrl Related to the FERC XBRL transition
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add dimensions and tabular calculations to XBRL Calculation Forests
2 participants