Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix retained_earnings_ferc1 transform #2645

Merged
merged 15 commits into from Jun 16, 2023
Merged

Fix retained_earnings_ferc1 transform #2645

merged 15 commits into from Jun 16, 2023

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Jun 8, 2023

Addresses #1811 and #2016.

Issues addressed in this PR:

  • Drop amount column from retained_earnings_ferc1: At present, there is an amount, starting_balance and ending_balance column. amount and starting_balance originate from the source data, while ending_balance is produced and filled with amount when reconciling duplicate years' data in the condense_double_year_earnings_types_dbf method. At present, this is only filled when the data has overlapping years. I've updated this method to fill all NA values for ending_balance with amount where available. This adds approximately another 6k values into the ending_balance column and resolves the ambiguity surrounding the relationship between the 3 columns.
  • As discussed below, change condense_double_year_earnings_types_dbf into reconcile_double_year_earnings_types_dbf, to retain the _previous_year versions of existing factoids that require starting balances for their calculations.
  • Adds calculations into the retained_earnings_ferc1 table based on the form itself, and checks them to be correct.
  • Manually spot fixes the account numbers associated with fields.

Questions for the reviewer:

  • I could probably fill in the balance column, but I don't have 100% certainty about what's a debit and what's a credit and some factoids could be either. Thoughts here?

Notes about table weirdnesses:

  • Right now our tables either have starting_balance and ending_balance as columns or factoids. When starting balance values are included in calculations in other tables, they are included as factoids and summed in the dollar_value column.
  • Where tables have ending_balance as a value to be calculated (e.g., utility_plant_summary_ferc1), they do not directly calculate it using the starting_balance, even if starting balance is reported (e.g. balance_sheet_assets_ferc1).
  • Post 2004, the retained_earnings_ferc1 is a weirdo because it has starting_balance in 2 places: as a factoid for the table and as a column for each variable. We collapse these two values into one in the condense_double_year_earnings_types_dbf method but this means that we're losing the ability to add this value as a component in the calculations in the ending_balance column. In general, tables mirror the vertical addition structure used by FERC, and we're deviating from this format in this instance by reshaping the data.
  • Our calculation checks right now are only set up to calculate values across one column (either dollar_value or ending_balance). The retained_earnings table has starting_balance and ending_balance columns for all variables, but there are calculations which require the addition of a starting balance and the ending balance of other components to get to the ending balance.

Design decisions:
To deal with the above, we could do one of the following:

  • Add back unappropriated_retained_earnings_start_balance and unappropriated_retained_earnings_ending_balance (etc.) as factoids, reassigning the ending_value column to be a dollar_value column and dropping the starting_balance data. We'd lose half the data in this table this way.
  • Add back unappropriated_retained_earnings_start_balance and unappropriated_retained_earnings_ending_balance (etc.) as factoids, keeping starting_balance and ending_balance and just filling in the values for this column as they presently are in the table (in effect, undo the work from condense_double_year_earnings_types_dbf). This table is frustratingly less clean than it was before, but we can do the calculation checks. Even so, we'll still have to deal with the same issue being present in the xbrl data, where there is only one factoid with a starting and ending balance, possibly by following a similar workaround to that used for the electric_plant_depreciation_changes_ferc1 table.
    • We're going with this option, which most closely mirrors FERC's format.
  • Reorganize the calculations to be able to take a value from a different column. Probably the neatest path forward but will require serious workarounds all the way down and strays from FERC's existing format.

Remaining tasks:

  • Change the behavior of condense_double_year_earnings_type_dbf to keep the filling behavior without dropping the previous year factoids
  • Change xbrl to include current and previous year factoids, mirroring dbf structure
  • Add these as components into calcs
  • Add table into reconcile_table_calculations() method
  • Debug any remaining issues with the calculations
  • Fix missing account numbers?

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@e-belfer e-belfer changed the base branch from main to dev June 8, 2023 18:09
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@e-belfer e-belfer added rmi xbrl Related to the FERC XBRL transition labels Jun 8, 2023
@e-belfer e-belfer self-assigned this Jun 8, 2023
@e-belfer e-belfer linked an issue Jun 8, 2023 that may be closed by this pull request
2 tasks
@codecov
Copy link

codecov bot commented Jun 12, 2023

Codecov Report

Patch coverage: 100.0% and no project coverage change.

Comparison is base (36c3a96) 87.1% compared to head (c3f1dbf) 87.2%.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2645   +/-   ##
=====================================
  Coverage   87.1%   87.2%           
=====================================
  Files         86      86           
  Lines      10043   10097   +54     
=====================================
+ Hits        8755    8811   +56     
+ Misses      1288    1286    -2     
Impacted Files Coverage Δ
src/pudl/metadata/resources/ferc1.py 100.0% <ø> (ø)
src/pudl/transform/params/ferc1.py 100.0% <ø> (ø)
src/pudl/extract/eia860.py 100.0% <100.0%> (ø)
src/pudl/transform/ferc1.py 96.7% <100.0%> (+<0.1%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@e-belfer
Copy link
Member Author

Plan of action based on discussion with @zaneselvans

  • Change the behavior of condense_double_year_earnings_type_dbf to keep the filling behavior without dropping the previous year factoids
  • Change xbrl to include current and previous year factoids, mirroring dbf structure
  • Add these as components into calcs

@e-belfer
Copy link
Member Author

e-belfer commented Jun 13, 2023

With the updates above, the table fields are able to be calculated and can be integrated into the explosion process. For the full ETL, the table as-is has 7.9% of calculations not summing correctly. This requires further diagnosis.

Update: this issue has been diagnosed. See subsequent commits for resolution.

@e-belfer e-belfer marked this pull request as ready for review June 13, 2023 19:26
@zaneselvans
Copy link
Member

zaneselvans commented Jun 15, 2023

Notes from messing around with the data:

I apologize if I sound like I'm being dense here... I'm just running through the FERC 1 ETL debugging notebook and finally actually getting to see what in the data. These are just my running notes from that process. Maybe we can sit down with the notebook and the custom transform methods briefly? 😵‍💫

  • What's up with there being two different kinds of starting_balance and ending_balance columns in the instant XBRL dataframe? They two sets of columns are the same but one has a _reserve_federal suffix.
  • In the DBF output the prmry_account column looks like it contains FERC account numbers for ~75 rows (though it's usually blank) Is there any analogous information contained in the XBRL data? Is it okay that we're losing this information? By the time the DBF and XBRL data have been combined in the output from transform_start() there are only 11 of these values left. Oh, I see these are the only ones which weren't associated with an _unstructured earnings_type, so I guess that the FERC Account numbers we get from the XBRL metadata take the place of this information for all the structured entries.
  • There are 2 earnings_type values which only appear in the DBF data and not the XBRL data: unappropriated_undistributed_subsidiary_earnings_previous_year and unappropriated_retained_earnings_previous_year. Is that expected or concerning? It looks like they're the only _previous_year values. I notice that they also never end up with an xbrl_factoid_name_original value. Is that okay? The same is true of the calculations field. Is this because they only exist in the DBF data?
  • The _unstructured values only show up in the DBF data and get dumped when we combine XBRL & DBF because that information comes from separate tables in XBRL-land, and we're not dealing with them now, right?
  • I notice that in the finished transformation we still have the amount column, and that the starting_balance and ending_balance values are always NaN when we have a non-null amount. I recall that you're filling in the ending_balance based on amount later on somewhere... is there a reason why it doesn't make sense to do that here in the initial transformation and then drop the amount column? Is there some conditionality that's not clear how to apply without more context from other tables? Ah. I see that we are dropping the amount column in reconcile_double_year_earnings_types but... then it's still around in the finished table.
  • Reading through the reconciliation function, it sounds like we're trying to use the duplicated information available in the _current_year and _previous_year factoids to fill in missing values and check for internal consistency. Once we've done that it seems like it would be desirable to drop the records that don't refer to the current year. What keeps us from doing that? If they represent duplicated information what do we need these values for later on?

At a high level it seems funny that at the end of this transform we're retaining what feels like a few pieces of duplicative information:

  • The amount column (used to fill in starting / ending balances?)
  • The _previous_year rows, even after we've used them to fill in some missing values.

I suspect that this doesn't make sense to me because I don't understand how we need to use these values later on to replicate the XBRL calculations.

@e-belfer
Copy link
Member Author

e-belfer commented Jun 16, 2023

Hopefully this answers all your questions.

What's up with there being two different kinds of starting_balance and ending_balance columns in the instant XBRL dataframe? They two sets of columns are the same but one has a _reserve_federal suffix.

I'm not sure what you mean about two sets of starting/ending balances. I see two components in the raw XBRL instant data, These are two different components that sum to appropriated_retained_earnings_including_reserve_amortization and are reflected in existing calculations. On my local they're getting correctly processed in the first XBRL processing stage.

There are 2 earnings_type values which only appear in the DBF data and not the XBRL data: unappropriated_undistributed_subsidiary_earnings_previous_year and unappropriated_retained_earnings_previous_year. Is that expected or concerning? It looks like they're the only _previous_year values. I notice that they also never end up with an xbrl_factoid_name_original value. Is that okay? The same is true of the calculations field. Is this because they only exist in the DBF data?

DBF has factoids in current_year and previous_year, and XBRL only has the factoid itself. We were previously collapsing the DBF data into one factoid, but we need both for the calculations, so we are now splitting the XBRL data to match based on the timestamps in the original data. I've gone ahead and dropped the current_year suffix from the DBF data so we now have unappropriated_undistributed_subsidiary_earnings and unappropriated_undistributed_subsidiary_earnings_previous_year, e.g.

Re: the metadata - technically these values can be calculated using the same formula, but with one year's prior data. I think for simplicity though, we should relate to them as reported values. We are in practice checking that they are the same as the ending balance for the prior year's data. Lines 4457-59 attempt to add existing metadata associated with DBF previous_year factoids to the converted XBRL data, but this data was originally null. Perhaps we need to assign these values or troubleshoot here.

The _unstructured values only show up in the DBF data and get dumped when we combine XBRL & DBF because that information comes from separate tables in XBRL-land, and we're not dealing with them now, right?

This table only deals with the structured data currently, correct.

I notice that in the finished transformation we still have the amount column, and that the starting_balance and ending_balance values are always NaN when we have a non-null amount. I recall that you're filling in the ending_balance based on amount later on somewhere... is there a reason why it doesn't make sense to do that here in the initial transformation and then drop the amount column? Is there some conditionality that's not clear how to apply without more context from other tables? Ah. I see that we are dropping the amount column in reconcile_double_year_earnings_types but... then it's still around in the finished table.

I'm not sure we are looking at the same table. At the end of reconcile_double_year_earnings_dbf() the amount column gets dropped (line 4396). It also isn't in the table schema and should get dropped by transform_end(), if somehow it's still in the table. I don't see it in my retained_earnings_ferc1() table at the end of the full transform process.

Reading through the reconciliation function, it sounds like we're trying to use the duplicated information available in the _current_year and _previous_year factoids to fill in missing values and check for internal consistency. Once we've done that it seems like it would be desirable to drop the records that don't refer to the current year. What keeps us from doing that? If they represent duplicated information what do we need these values for later on?

We can't drop the factoid because we need it as a starting_balance factoid in order to do the calculation, and we can't reconcile calculations across multiple columns. I've gone ahead and changed the transform that previously purged this value in order to retain it, while using the information in it to fill some potentially missing gaps. Duplicative but necessary.

@e-belfer e-belfer merged commit 8a90965 into dev Jun 16, 2023
10 checks passed
@e-belfer e-belfer deleted the retained_earnings_fix branch June 16, 2023 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rmi xbrl Related to the FERC XBRL transition
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Fix transform for f1_retained_erng xbrl + dbf
2 participants