Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename and test XBRL metadata calculations #2563

Merged
merged 114 commits into from Jun 8, 2023
Merged

Rename and test XBRL metadata calculations #2563

merged 114 commits into from Jun 8, 2023

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented May 3, 2023

PR Overview

Working on sub-task of #2016. Desire here is to:

Tasks

Edit tasklist title
Beta Give feedback Tasklist Tasks, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. 7 of 7
    data-cleaning ferc1 metadata xbrl
    cmgosnell
    Edit...
  2. 11 of 12
    ferc1 rmi xbrl
    zaneselvans
    Edit...

Latest income-statement component results

2023-05-08 15:07:49 [    INFO] catalystcoop.pudl.transform.ferc1:4332 income_statement_ferc1: has #4 / 0.01% records that don't calculate exactly
2023-05-08 15:07:49 [    INFO] catalystcoop.pudl.transform.ferc1:4332 depreciation_amortization_summary_ferc1: has #1 / 0.00% records that don't calculate exactly
2023-05-08 15:08:02 [    INFO] catalystcoop.pudl.transform.ferc1:4332 electric_operating_expenses_ferc1: has #161 / 0.16% records that don't calculate exactly
2023-05-08 15:08:02 [    INFO] catalystcoop.pudl.transform.ferc1:4245 18 columns from electric_operating_revenues_ferc1 that show up in calculated fields have been labeled with 'source_table' in the metadata.
2023-05-08 15:08:02 [ WARNING] catalystcoop.pudl.transform.ferc1:4250 electric_operating_revenues_ferc1: All renamed types were not foundin the transformed table. Missing types: ['small_or_commercial_sales', 'large_or_industrial_sales']
2023-05-08 15:08:02 [    INFO] catalystcoop.pudl.transform.ferc1:4290 Skipping calcuated values because they are intrer-table calucations: dict_keys(['small_or_commercial_sales', 'large_or_industrial_sales', 'sales_to_ultimate_consumers', 'other_sales_to_public_authorities', 'sales_to_railroads_and_railways', 'interdepartmental_sales', 'public_street_and_highway_lighting', 'residential_sales', 'sales_for_resale'])
2023-05-08 15:08:02 [    INFO] catalystcoop.pudl.transform.ferc1:4332 electric_operating_revenues_ferc1: has #188 / 42.82% records that don't calculate exactly

Note: I was previously calculating the % as off_records/total_records and recently converted this to check the off_records/calculated_records. So the % went up while the # stayed the same

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@cmgosnell cmgosnell requested a review from zaneselvans May 3, 2023 21:21
@codecov
Copy link

codecov bot commented May 4, 2023

Codecov Report

Patch coverage: 98.6% and project coverage change: +0.2 🎉

Comparison is base (fe395c9) 86.9% compared to head (a218eb2) 87.1%.

❗ Current head a218eb2 differs from pull request most recent head 0f72803. Consider uploading reports for the commit 0f72803 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2563     +/-   ##
=======================================
+ Coverage   86.9%   87.1%   +0.2%     
=======================================
  Files         84      84             
  Lines       9720    9912    +192     
=======================================
+ Hits        8447    8636    +189     
- Misses      1273    1276      +3     
Impacted Files Coverage Δ
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/ferc1.py 100.0% <ø> (ø)
src/pudl/transform/params/ferc1.py 100.0% <ø> (ø)
src/pudl/transform/ferc1.py 96.6% <98.6%> (+0.4%) ⬆️
src/pudl/extract/ferc1.py 99.1% <100.0%> (ø)
src/pudl/glue/ferc1_eia.py 98.4% <100.0%> (-0.1%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

this will be migrated away bc the "missing" names mostly seem to
be references to other tables
@cmgosnell
Copy link
Member Author

I added a check to see if the per table metadata calculation names were showing up in the transformed tables and 7 of the 23 tables don't have all of the names in the proper type column in the table. BUT on a spot check it looks like these all show up in other tables. so they are inter-table references.

these are the table that are failing this check:
image

Bc of this... I think I'd like to pull out these per-table transformer methods into a calcs/metadata class of its own

@cmgosnell
Copy link
Member Author

cmgosnell commented May 4, 2023

Okay I pulled out the metadata work into it's own

income_statement_tables = [
    "income_statement_ferc1",
    "depreciation_amortization_summary_ferc1",
    "electric_operating_expenses_ferc1",
    "electric_operating_revenues_ferc1",
]
income_table_dollar_cols = {
    "income_statement_ferc1": "income",
    "depreciation_amortization_summary_ferc1": "depreciation_amortization_value",
    "electric_operating_expenses_ferc1": "expense",
    "electric_operating_revenues_ferc1": "revenue",
}
# tables = {tbl: defs.load_asset_value(AssetKey(tbl)) for tbl in income_statement_tables}
tables = {tbl: pd.read_sql(tbl, pudl_engine) for tbl in income_statement_tables}
meta_converted = ExplodeMeta(xbrl_meta).convert_metadata(income_statement_tables)

calc_dfs = {}
for table_name in income_statement_tables:
    calculated_values = meta_converted[table_name]
    dollar_value_col = income_table_dollar_cols[table_name]
    table_df = tables[table_name]
    calc_dfs[table_name] = check_table_calcs(table_name, table_df, dollar_value_col, calculated_values)

Results:

2023-05-04 17:46:19 [    INFO] catalystcoop.pudl.transform.ferc1:4092 income_statement_ferc1: has #7097 / 2.23% records that don't calculate exactly
2023-05-04 17:46:19 [    INFO] catalystcoop.pudl.transform.ferc1:4092 depreciation_amortization_summary_ferc1: has #16 / 0.01% records that don't calculate exactly
2023-05-04 17:46:20 [    INFO] catalystcoop.pudl.transform.ferc1:4092 electric_operating_expenses_ferc1: has #7707 / 1.44% records that don't calculate exactly
2023-05-04 17:46:21 [    INFO] catalystcoop.pudl.transform.ferc1:4092 electric_operating_revenues_ferc1: has #197 / 0.28% records that don't calculate exactly

@cmgosnell cmgosnell changed the title very wip-y xbrl calculation metadata rename process Rename and test XBRL metadata calculations May 5, 2023
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@zaneselvans zaneselvans self-requested a review June 5, 2023 21:14
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow there is a lot in here even if we ignore the DB migrations (which unfortunately make it impossible to tell how much there is actually in here).

src/pudl/glue/ferc1_eia.py Show resolved Hide resolved
Comment on lines 95 to 104
for table_name, table_meta in raw_xbrl_metadata_json.items():
for list_of_facts in table_meta.values():
for xbrl_fact in list_of_facts:
# all facts have ``calculations``, but they are empty lists when null
for calc_component in xbrl_fact["calculations"]:
# does the calc component show up in the table? if not, add a label
if calc_component["name"] not in tables_to_fields[table_name]:
calc_component = label_source_table(
calc_component, tables_to_fields
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably define Pydantic classes for the objects that are represented in the JSON metadata, and methods for those classes that know how to perform these operations internal to the objects, otherwise it's all very dependent on the implicit structure of the nested dicts/lists and as those contents evolve, it'll be time consuming to chase down all the places where things break and fix them.

What are the component data structures we need to represent in here?

  • table metadata (a list of fact metadata objects? Can be converted to a dataframe?)
  • fact metadata (what's in here besides calculations? Can be converted to a row in a dataframe?)
  • calculation (a list of calculation components, needs context of what table it's embedded within?)
  • calculation component (name + weight, plus we need to qualify the name with a source table?)

I think we're already partly there since the sub-functions defined above would become methods of one of these classes.

devtools/explosion_ferc1.ipynb Outdated Show resolved Hide resolved
src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
src/pudl/transform/ferc1.py Show resolved Hide resolved
src/pudl/transform/ferc1.py Show resolved Hide resolved
src/pudl/transform/ferc1.py Show resolved Hide resolved
src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some additional comments in our existing threads and maybe added another one regarding naming / my ongoing confusions.

@cmgosnell cmgosnell merged commit ced19c5 into dev Jun 8, 2023
8 checks passed
@cmgosnell cmgosnell deleted the xbrl_meta_reshape branch June 8, 2023 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. rmi xbrl Related to the FERC XBRL transition
Projects
Archived in project
3 participants