Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💥 FERC feature branch 💥 : FERC tables post caclulation validation by concat-ing & deduping #2633

Merged
merged 449 commits into from
Sep 21, 2023

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Jun 6, 2023

PR Overview

First pass at #2624

# psuedo code version
class Exploder:
    root_table: str
    tables_to_explode: list[str]

    def boom(dfs: list[pd.DataFrame]):
        forest_map = ForestBuilder().build()
        leaves = forest_map.get_leaves()
        roots = forest_map.get_roots()
        exploded = (
            concat(dfs)
            .pipe(validate_calculations, leaves, roots)
            .pipe(validate_inter_table_calculations)
            .pipe(prune, leaves)
        return exploded

    def validate_calculations(df, leaves, roots):
        df = reconcile_table_calculations(df.loc[leaves_only, roots_only])
        ...
        return df

    def validate_inter_table_calculations
        ...

    def prune(exploded_df):
        leaves = ForestBuilder().get_leaves()
        pruned = (
            exploded_df[exploded_df.xbrl_factoid.isin(leaves)]
            .pipe(remove_totals_from_other_dimensions)
        )
        return pruned

class MetadataExploder:
    tables_to_explode: list[str]

    def boom(self, clean_xbrl_metadata_json):
        return ...

class ForestBuilder:
    factoids_to_forest: list[str]
    metadata_exploded = MetadataExploder().boom()
    
    def build():
        ....
        for factoid in factoids_to_forest:
            tree = XbrlFactoidTreeBuilder(xbrl_factoid).build(metadata_exploded)
        forest = do some stuff here
        return forest

    def get_leaves(forest):
        ....
        return leaves
    

class XbrlFactoidTreeBuilder:
    xbrl_factoid
    table_name
    ...
   
    def build(metadata_exploded):
        ...

To dos:

  • Add some validations

Questions

  • Is there a more sensible design for this overall? I can imagine an Explode class which does some of the upfront metadata compilation and validation for a given set of tables to explode.
  • See first comment below re: dagster halp question.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@cmgosnell cmgosnell changed the title first pass of removing xbrl factoids that show up in multiple tbales 💥 FERC tables post caclulation validation by concat-ing & deduping Jun 6, 2023
src/pudl/output/ferc1.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jun 6, 2023

Codecov Report

Patch coverage: 86.4% and no project coverage change.

Comparison is base (86d528a) 88.5% compared to head (1880d7b) 88.6%.
Report is 1 commits behind head on dev.

Additional details and impacted files
@@          Coverage Diff           @@
##             dev   #2633    +/-   ##
======================================
  Coverage   88.5%   88.6%            
======================================
  Files         90      90            
  Lines      10139   10832   +693     
======================================
+ Hits        8982    9599   +617     
- Misses      1157    1233    +76     
Files Changed Coverage Δ
src/pudl/metadata/resources/ferc1.py 100.0% <ø> (ø)
src/pudl/transform/ferc1.py 96.6% <ø> (-0.2%) ⬇️
src/pudl/transform/params/ferc1.py 100.0% <ø> (ø)
src/pudl/output/ferc1.py 89.3% <86.3%> (-10.0%) ⬇️
src/pudl/convert/censusdp1tract_to_sqlite.py 88.5% <100.0%> (ø)
src/pudl/extract/ferc1.py 99.1% <100.0%> (+<0.1%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 1115 to 1145
def get_table_level(table_name: str, top_table: str) -> int:
"""Get a table level."""
# we may be able to infer this nesting from the metadata
table_nesting = {
"balance_sheet_assets_ferc1": {
"utility_plant_summary_ferc1": {
"plant_in_service_ferc1": None,
"electric_plant_depreciation_changes_ferc1": None,
},
},
"balance_sheet_liabilities_ferc1": {"retained_earnings_ferc1": None},
"income_statement_ferc1": {
"depreciation_amortization_summary_ferc1": None,
"electric_operating_expenses_ferc1": None,
"electric_operating_revenues_ferc1": None,
},
}
if table_name == top_table:
level = 1
elif table_name in table_nesting[top_table].keys():
level = 2
elif table_name in pudl.helpers.dedupe_n_flatten_list_of_lists(
[values.keys() for values in table_nesting[top_table].values()]
):
level = 3
else:
raise AssertionError(
f"AH we didn't find yer table name {table_name} in the nested group of "
"tables. Be sure all the tables you are trying to explode are related."
)
return level
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i expect this whole level getting setup will be nullified if we get #2625 working. i just needed something to work for now

@zaneselvans
Copy link
Member

Many of the new functions in the Explode section of pudl.output.ferc1 don't appear to be getting called in the tests (at least based on the coverage report). Is that expected? Functions that the coverage report says aren't called include:

  • explode_tables()
  • remove_factoids_from_mutliple_tables()
  • get_table_level()
  • get_table_levels()

This seems weird, since explode_tables() is definitely called in the assets generated by the factory. Are they showing up in your DAG locally? They do show up in mine when I reload. There was a typo in the explode_tables() type hints. Not sure if that would be a problem that keeps it from running though.

I am fixing the typo and making some names more consistent and pushing to see if that gets these functions run / assets materialized in the CI.

We might want to use exploded_ as a prefix for these DB tables like we do for raw_, clean_, norm_ and denorm_ since it's another processing stage which is applied to the modified table.

@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. xbrl Related to the FERC XBRL transition labels Jun 8, 2023
Base automatically changed from xbrl_meta_reshape to dev June 8, 2023 17:19
return value_col


def remove_factoids_from_mutliple_tables(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently assumes factoids are the same if the original names are the same. After #2623 is done we'll need to revisit this assumption, since we may be comparing the transformed name. We'll need a different way to compare those factoids by name, whether by modifying this method or by writing a different one.

return pd.DataFrame(table_levels)


def remove_totals_from_other_dimensions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other dimensions here refers to the sub-total columns.
Currently this method drops "totals" for any table that actively has sub-dimensions.
We currently aren't correcting for sub-total v. total differences, which we should probably do if we're going to drop these.

@jrea-rmi
Copy link
Collaborator

jrea-rmi commented Jun 20, 2023

Reviewing exploded_balance_sheet_assets_ferc1.pkl sent by @e-belfer on 6/16/2023!

Currently only includes 2020-2021. Is your plan to test just on 2020-2021 then expand? Eventually we'd like to have 2005-2021 (and then -2022 after that's looking good).

The only table_names I see are plant_in_service_ferc1 and utility_plant_summary_ferc1. I also expect to see electric_plant_depreciation_functional_ferc1 and balance_sheet_assets_ferc1 - are those not integrated yet or is it an error that their components are being dropped?

electric_plant_in_service_and_completed_construction_not_classified_electric is a calculated field that's double counting with details of plant_in_service_ferc1. The final exploded tables with not include any row_type_xbrl == 'calculated_value', correct?

row_type_xbrl is NA for accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility_detail - this is a calculated field that should be exploded using electric_plant_depreciation_functional_ferc1, right?

How will users add a plant_function (steam, nuclear, hydro, other) label? I see it included as a column in exploded_income_statement_ferc1.pkl, and this that'll be a common enough use it should either be included the same way or with one step of mapping to add that column.

explosion_tables.append(tbl)
metadata_exploded = self.meta_exploder.boom(clean_xbrl_metadata_json)
exploded = pd.concat(explosion_tables)
# drop any metadata columns coming from the tbls bc we may have edited the
Copy link
Member

@e-belfer e-belfer Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does clean_xbrl_metadata_json contain the calculation corrections added to each table?

cmgosnell and others added 14 commits July 12, 2023 15:49
* Rename "seeds" argument to "seed_nodes"
* Validate some aspects of the exploded metadata as soon as it is created, and give
  useful error messages if we find problems.
* Make the calculation_forest a property of the Exploder class, rather than
  automatically building it upon instantiation to aid in debugging.
…ion_and_depletion_of_plant_utility as in rate base
…ations

Link electric OpEx to income statement table
… checks

note: the tests don't pass! these guys are too off which is to say
something is probably wrong with the code.
zaneselvans and others added 25 commits September 12, 2023 19:06
* Handle multi-dim totals

* Do not treat NA value as total value.

* Runtime check to avoid extra components in total calcs.

* Add some calculation metadata to the inferred total calculations for downstream consumers; get existing tree annotation code working.

There will be new, shiny, better tree annotation code in explode_tree_fixes, but this works for now.

---------

Co-authored-by: Zane Selvans <zane.selvans@catalyst.coop>
Make leafy balance sheet assets & liabilities data
@zaneselvans zaneselvans marked this pull request as ready for review September 20, 2023 21:12
@zaneselvans zaneselvans merged commit be2acf0 into dev Sep 21, 2023
16 of 17 checks passed
@zaneselvans zaneselvans deleted the explode_ferc1 branch September 21, 2023 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. xbrl Related to the FERC XBRL transition
Projects
Archived in project
6 participants