-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean-up XBRL calculation fixes #2728
Conversation
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## explode_ferc1 #2728 +/- ##
===============================================
- Coverage 88.4% 88.4% -0.1%
===============================================
Files 89 88 -1
Lines 10711 10668 -43
===============================================
- Hits 9478 9434 -44
- Misses 1233 1234 +1 ☔ View full report in Codecov by Sentry. |
this is just here for posterity bc i'm not planning on checking in the code i am using to jump back and forth between the big calc_fix_idx = ["table_name", "xbrl_factoid", "xbrl_factoid_calc"]
add_me = []
for table, calcs in calculated_fields_to_fix.items():
for factoid, fixes in calcs.items():
for fix in fixes:
add_me.append(
pd.json_normalize(fix["calc_component_new"]).assign(xbrl_factoid=factoid,table_name=table)
)
add = (
pd.concat(add_me).explode("source_tables")
.rename(
columns={
"name": "xbrl_factoid_calc",
"source_tables": "table_name_calc",
}
)
.dropna(subset=calc_fix_idx)
.set_index(calc_fix_idx)
)
delete_me = []
for table, calcs in calculated_fields_to_fix.items():
for factoid, fixes in calcs.items():
for fix in fixes:
delete_me.append(
pd.json_normalize(fix["calc_component_to_replace"]).assign(xbrl_factoid=factoid,table_name=table)
)
delete = (
pd.concat(delete_me)
.rename(
columns={
"name": "xbrl_factoid_calc",
}
)
[calc_fix_idx]
.dropna()
.set_index(calc_fix_idx)
)
fixes = pd.concat([delete.loc[delete.index.difference(add.index)], add]).sort_index()
assert not fixes.index.duplicated().any() |
Inject missing dbf-only factoids into XBRL metadata
…dl into calc_fix_cleanup
calculation_components.intra_table_calc_flag | ||
& calculation_components.xbrl_factoid.notnull() # no nulls bc we have all parents | ||
] | ||
# !!! Add dimensions into the calculation components!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i ended up added the implied dimensions into the calculation component table in here because it made the calculation checking simpler and more in line with how we are checking calcs over in output land. I think this really belongs over in process_xbrl_metadata_calculations
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the answer is yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm we have a problem. adding in the dimensions requires having the data. bc the implied dimensions are gleaned from the processed data. typically we've done all of the metadata and calculation processing before and fully independent from the data processing.
I think because of this interdependency, I'd like to keep this as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this would make sense, but in the context of the assets being written into the database, we could have per-table calculation component tables that do depend on the data (taking the data tables as inputs).
But maybe this is duplicative with the all-tables calculation components table, which will have a more complete knowledge of all the dimensional values that are observed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But anyway, not a blocker for merging!
src/pudl/transform/ferc1.py
Outdated
gby_parent = [ | ||
f"{col}_parent" if col in ["table_name", "xbrl_factoid"] else col | ||
for col in data_idx | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if we talked about this yesterday and I forgot but why don't we need to group by all of the parent key columns, including the dimensions? Why is it only using table name and factoid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I was maybe I was thrown off by this conditional comprehension. I don't understand what you expect to have in here now. It seems like a mix of _parent
and non-parent columns. Why is that appropriate? Is there a way we can make this more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not understanding the nature of the initial merge + groupby.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the part that's confusing me is why we wouldn't want all of the gby_parent
values that show up in calc_idx
to have the _parent
suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only columns that have the _parent
suffix are the "table_name" and "xbrl_factoid"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we didn't talk about this yesterday but we did talk about this for a while in my last pr #2753!
if it would be more clear, I could check for any _parent suffixed columns in calculation_components and replace the ["table_name", "xbrl_factoid"]
with that list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand the merge + groupby for the validation of the calculations but I guess that's just how it's going to be.
calculation_components.intra_table_calc_flag | ||
& calculation_components.xbrl_factoid.notnull() # no nulls bc we have all parents | ||
] | ||
# !!! Add dimensions into the calculation components!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But anyway, not a blocker for merging!
src/pudl/transform/ferc1.py
Outdated
source = files(pudl.package_data.ferc1).joinpath("dbf_to_xbrl.csv") | ||
with as_file(source) as file: | ||
|
||
source = importlib.resources.files("pudl.package_data.ferc1").joinpath( | ||
"dbf_to_xbrl.csv" | ||
) | ||
with importlib.resources.as_file(source) as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a little poking and I think the most concise / readable way to use the new API for our purposes is probably something like:
mapped_rows = (
pd.read_csv(importlib.resources.files("pudl.package_data.ferc1") / "table_file_map.csv")
.set_index(idx_cols)
.drop(columns=["row_literal"])
)
PR Overview
#2605
currently working through all of the transforms:
Tasks
balance_sheet_liabilities_ferc1
: failing bc of dupes in "long_term_portion_of_derivative_instrument_liabilities" & "long_term_portion_of_derivative_instrument_liabilities_hedges",electric_plant_depreciation_changes_ferc1
: instant["name"] = instant["name"] + ["_starting_balance", "_ending_balance"] see commentelectric_operating_expenses_ferc1
: pandas.errors.MergeError: Merge keys are not unique in left dataset; not a one-to-many merge duringreconcile_table_calculations
->calculate_values_from_components
this last one looks like a calc fix problem:
renaming tasks:
intra_table_calc_flag
... it is necessary/useful in the calc table even in the reconciliation stepcalculate_values_from_components
over in the explosion landPR Checklist
dev
).