Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean-up XBRL calculation fixes #2728

Merged
merged 20 commits into from
Jul 28, 2023
Merged

Clean-up XBRL calculation fixes #2728

merged 20 commits into from
Jul 28, 2023

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Jul 17, 2023

PR Overview

#2605

currently working through all of the transforms:

Tasks

Edit tasklist title
Beta Give feedback Tasklist Tasks, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. balance_sheet_liabilities_ferc1: failing bc of dupes in "long_term_portion_of_derivative_instrument_liabilities" & "long_term_portion_of_derivative_instrument_liabilities_hedges",
    Options
  2. electric_plant_depreciation_changes_ferc1: instant["name"] = instant["name"] + ["_starting_balance", "_ending_balance"] see comment
    Options
  3. electric_operating_expenses_ferc1: pandas.errors.MergeError: Merge keys are not unique in left dataset; not a one-to-many merge during reconcile_table_calculations -> calculate_values_from_components
    Options

this last one looks like a calc fix problem:
image

renaming tasks:

  • cleanup/normalize the columns in metadata vs calc tables
    • decide where to make intra_table_calc_flag... it is necessary/useful in the calc table even in the reconciliation step
  • anything to integrate over in the explosion step?
  • use the generalized calculate_values_from_components over in the explosion land
  • not that the dimensions are kind incorporated into the first checks in the calc reconciliation step... should we now also to the subtotals step? :scratches_head:

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@cmgosnell cmgosnell linked an issue Jul 17, 2023 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Jul 17, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.1% ⚠️

Comparison is base (1a6efa4) 88.4% compared to head (b3a7ca9) 88.4%.
Report is 2 commits behind head on explode_ferc1.

❗ Current head b3a7ca9 differs from pull request most recent head 4a4df01. Consider uploading reports for the commit 4a4df01 to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##           explode_ferc1   #2728     +/-   ##
===============================================
- Coverage           88.4%   88.4%   -0.1%     
===============================================
  Files                 89      88      -1     
  Lines              10711   10668     -43     
===============================================
- Hits                9478    9434     -44     
- Misses              1233    1234      +1     

see 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cmgosnell
Copy link
Member Author

cmgosnell commented Jul 18, 2023

this is just here for posterity bc i'm not planning on checking in the code i am using to jump back and forth between the big calculated_fields_to_fix dict that is currently in apply_xbrl_calculation_fixes (also don't mind all of the duplication here....)

calc_fix_idx = ["table_name", "xbrl_factoid", "xbrl_factoid_calc"]

add_me = []
for table, calcs in calculated_fields_to_fix.items():
    for factoid, fixes in calcs.items():
        for fix in fixes:
           add_me.append(
               pd.json_normalize(fix["calc_component_new"]).assign(xbrl_factoid=factoid,table_name=table)
           )
add = (
    pd.concat(add_me).explode("source_tables")
    .rename(
        columns={
            "name": "xbrl_factoid_calc",
            "source_tables": "table_name_calc",
        }
    )
    .dropna(subset=calc_fix_idx)
    .set_index(calc_fix_idx)   
)

delete_me = []
for table, calcs in calculated_fields_to_fix.items():
    for factoid, fixes in calcs.items():
        for fix in fixes:
           delete_me.append(
               pd.json_normalize(fix["calc_component_to_replace"]).assign(xbrl_factoid=factoid,table_name=table)
           )
delete = (
    pd.concat(delete_me)
    .rename(
        columns={
            "name": "xbrl_factoid_calc",
        }
    )
    [calc_fix_idx]
    .dropna()
    .set_index(calc_fix_idx)
)
fixes = pd.concat([delete.loc[delete.index.difference(add.index)], add]).sort_index()
assert not fixes.index.duplicated().any()

Base automatically changed from calc_component_tbl to explode_ferc1 July 18, 2023 22:12
@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 data-repair Interpolating or extrapolating data that we don't actually have. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. xbrl Related to the FERC XBRL transition labels Jul 25, 2023
calculation_components.intra_table_calc_flag
& calculation_components.xbrl_factoid.notnull() # no nulls bc we have all parents
]
# !!! Add dimensions into the calculation components!!!
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i ended up added the implied dimensions into the calculation component table in here because it made the calculation checking simpler and more in line with how we are checking calcs over in output land. I think this really belongs over in process_xbrl_metadata_calculations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the answer is yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm we have a problem. adding in the dimensions requires having the data. bc the implied dimensions are gleaned from the processed data. typically we've done all of the metadata and calculation processing before and fully independent from the data processing.

I think because of this interdependency, I'd like to keep this as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this would make sense, but in the context of the assets being written into the database, we could have per-table calculation component tables that do depend on the data (taking the data tables as inputs).

But maybe this is duplicative with the all-tables calculation components table, which will have a more complete knowledge of all the dimensional values that are observed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But anyway, not a blocker for merging!

@zaneselvans zaneselvans marked this pull request as ready for review July 26, 2023 19:57
Comment on lines 910 to 913
gby_parent = [
f"{col}_parent" if col in ["table_name", "xbrl_factoid"] else col
for col in data_idx
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if we talked about this yesterday and I forgot but why don't we need to group by all of the parent key columns, including the dimensions? Why is it only using table name and factoid?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I was maybe I was thrown off by this conditional comprehension. I don't understand what you expect to have in here now. It seems like a mix of _parent and non-parent columns. Why is that appropriate? Is there a way we can make this more readable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not understanding the nature of the initial merge + groupby.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the part that's confusing me is why we wouldn't want all of the gby_parent values that show up in calc_idx to have the _parent suffix.

Copy link
Member Author

@cmgosnell cmgosnell Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only columns that have the _parent suffix are the "table_name" and "xbrl_factoid"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we didn't talk about this yesterday but we did talk about this for a while in my last pr #2753!

if it would be more clear, I could check for any _parent suffixed columns in calculation_components and replace the ["table_name", "xbrl_factoid"] with that list.

@cmgosnell cmgosnell self-assigned this Jul 27, 2023
@cmgosnell cmgosnell requested a review from e-belfer July 27, 2023 17:05
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the merge + groupby for the validation of the calculations but I guess that's just how it's going to be.

calculation_components.intra_table_calc_flag
& calculation_components.xbrl_factoid.notnull() # no nulls bc we have all parents
]
# !!! Add dimensions into the calculation components!!!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But anyway, not a blocker for merging!

Comment on lines 1025 to 1041
source = files(pudl.package_data.ferc1).joinpath("dbf_to_xbrl.csv")
with as_file(source) as file:

source = importlib.resources.files("pudl.package_data.ferc1").joinpath(
"dbf_to_xbrl.csv"
)
with importlib.resources.as_file(source) as file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a little poking and I think the most concise / readable way to use the new API for our purposes is probably something like:

mapped_rows = (
    pd.read_csv(importlib.resources.files("pudl.package_data.ferc1") / "table_file_map.csv")
    .set_index(idx_cols)
    .drop(columns=["row_literal"])
)

@cmgosnell cmgosnell merged commit 3448f08 into explode_ferc1 Jul 28, 2023
7 of 8 checks passed
@cmgosnell cmgosnell deleted the calc_fix_cleanup branch July 28, 2023 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-repair Interpolating or extrapolating data that we don't actually have. ferc1 Anything having to do with FERC Form 1 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. xbrl Related to the FERC XBRL transition
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Use tabular calculation fixes in FERC 1 table transformers
3 participants