Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ferc1 respondents using previous years' form -> data filed in wrong row_number #471

Open
Tracked by #1574 ...
zaneselvans opened this issue Nov 15, 2019 · 1 comment
Open
Tracked by #1574 ...
Labels
data-cleaning Tasks related to cleaning & regularizing data during ETL. dbf Data coming from FERC's old Visual FoxPro DBF database file format. ferc1 Anything having to do with FERC Form 1 rmi

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Nov 15, 2019

plant_in_service_ferc1 table headers contain data

In the transform step of the plant_in_service_ferc1 table, there are a handful of cases in which the header rows contain data in the original database. This should never happen. All instances of this phenomenon are in the years 2004-2005, which is immediately after the table underwent a revision (2003). In addition, the numbers reported in the header rows for 2004 and 2005 are almost identical to each other, so it seems like maybe someone was just re-submitting the same form in both years, ignoring the fact that the form itself had changed.

This means that all of these records are probably bad. However, there are only a few dozen of them (out of ~30,000 total plant-years). It might be possible to recover this data, but it's probably not super

raw_dfs = {"plant_in_service_ferc1": pd.read_sql("f1_plant_in_srvce", ferc1_engine)}
transformed_dfs = {}
transformed_dfs = pudl.transform.ferc1.new_plant_in_service(raw_dfs, transformed_dfs)
pis_head = (
    transformed_dfs["plant_in_service_ferc1"]
    .set_index(["utility_id_ferc1", "report_year", "amount_type"])
    .filter(regex=".*_head$")
)
dude = pd.DataFrame()
for field in pis_head:
    bad_df = pis_head[pis_head[field].notnull()]
    dude = pd.concat([dude, bad_df])
(
    dude
    .reset_index()
    .drop_duplicates()
    .set_index(["utility_id_ferc1", "report_year", "amount_type"]).index
)

The indices of the resulting problematic records:

MultiIndex([(  7, 2004, 'starting_balance'),
            (  7, 2005, 'starting_balance'),
            ( 94, 2004, 'starting_balance'),
            ( 94, 2005, 'starting_balance'),
            (121, 2004, 'starting_balance'),
            (121, 2005, 'starting_balance'),
            (145, 2004, 'starting_balance'),
            (145, 2005, 'starting_balance'),
            (159, 2004, 'starting_balance'),
            (166, 2004, 'starting_balance'),
            (187, 2004, 'starting_balance'),
            (187, 2005, 'starting_balance'),
            (  7, 2004,   'ending_balance'),
            (  7, 2005,   'ending_balance'),
            ( 94, 2004,   'ending_balance'),
            ( 94, 2005,   'ending_balance'),
            (121, 2004,   'ending_balance'),
            (121, 2005,   'ending_balance'),
            (145, 2004,   'ending_balance'),
            (145, 2005,   'ending_balance'),
            (159, 2004,   'ending_balance'),
            (166, 2004,   'ending_balance'),
            (187, 2004,   'ending_balance'),
            (187, 2005,   'ending_balance'),
            ( 25, 2004, 'starting_balance'),
            ( 25, 2005, 'starting_balance'),
            ( 95, 2004, 'starting_balance'),
            (101, 2004, 'starting_balance'),
            (101, 2005, 'starting_balance'),
            (132, 2004, 'starting_balance'),
            (132, 2005, 'starting_balance'),
            (167, 2004, 'starting_balance'),
            (167, 2005, 'starting_balance'),
            (247, 2004, 'starting_balance'),
            ( 25, 2004,   'ending_balance'),
            ( 25, 2005,   'ending_balance'),
            ( 95, 2004,   'ending_balance'),
            (101, 2004,   'ending_balance'),
            (101, 2005,   'ending_balance'),
            (132, 2004,   'ending_balance'),
            (132, 2005,   'ending_balance'),
            (167, 2004,   'ending_balance'),
            (167, 2005,   'ending_balance'),
            (247, 2004,   'ending_balance'),
            (153, 2004, 'starting_balance'),
            (153, 2005, 'starting_balance'),
            (153, 2004,   'ending_balance'),
            (153, 2005,   'ending_balance')],
           names=['utility_id_ferc1', 'report_year', 'amount_type'])
``
@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 data-cleaning Tasks related to cleaning & regularizing data during ETL. labels Nov 15, 2019
zaneselvans added a commit that referenced this issue Nov 17, 2019
Used the new row unpacking infrastructure to convert rows into columns
corresponding to the various FERC Accounts. This allows all of the rows
to be mapped to the correct fields, even as which row number they are
associated with changes over the years.  The implemented row maps cover
all available years of data (1994-2018).

The six original columns reported in this table represented the starting
balance, additions, retirements, adjustments, transfers and ending
balance for each FERC account. These "columns" have been converted into
an index column, with entries named for each of those types of amounts
(the column is called amount_type).

The FERC Account numbers are used in the column names (acctXXX) and the
various sub-categories of accounts are used as a prefix (e.g.
transmission_acctXXX_...) followed by a 1-3 word descriptive name, taken
from the row literal describing the original entry on the form.

The conversion of rows into columns meant that we had to stop using the
same format for the record_id that identifies where in the original FERC
data an entry in the PUDL DB came from.

A small number of utilities may have used an old version of the FERC
Form 1 in 2004-2005, meaning that some of the row numbers are incorrect
(when compared to other forms filed in those years). See #471 for
details.

Closes #472
@zaneselvans zaneselvans added the dbf Data coming from FERC's old Visual FoxPro DBF database file format. label Dec 21, 2022
@cmgosnell
Copy link
Member

A new instance of this problem shows up in the income_statement_ferc1 table in 2003. I had to add a little cleanup method in transform.ferc1.IncomeStatementFerc1TableTransformer.process_dbf. Docs from that method:

    In 2003, two rows were added to the ``f1_income_stmnt`` dbf table, which bumped
    the starting ``row_number`` of ``f1_incm_stmnt_2`` from 25 to 27. A small
    handful of respondents seem to have not gotten the memo about this this in
    2003 and have information on these row numbers that shouldn't exist at all for
    this table.

These bad rows consist of 33% (!) of the respondents for that year.

@cmgosnell cmgosnell changed the title plant_in_service_ferc1 table headers contain data ferc1 respondents using previous years' form -> data filed in wrong row_number Dec 21, 2022
@cmgosnell cmgosnell added the rmi label Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-cleaning Tasks related to cleaning & regularizing data during ETL. dbf Data coming from FERC's old Visual FoxPro DBF database file format. ferc1 Anything having to do with FERC Form 1 rmi
Projects
None yet
Development

No branches or pull requests

2 participants