ferc1 respondents using previous years' form -> data filed in wrong row_number #471

zaneselvans · 2019-11-15T15:55:59Z

plant_in_service_ferc1 table headers contain data

In the transform step of the plant_in_service_ferc1 table, there are a handful of cases in which the header rows contain data in the original database. This should never happen. All instances of this phenomenon are in the years 2004-2005, which is immediately after the table underwent a revision (2003). In addition, the numbers reported in the header rows for 2004 and 2005 are almost identical to each other, so it seems like maybe someone was just re-submitting the same form in both years, ignoring the fact that the form itself had changed.

This means that all of these records are probably bad. However, there are only a few dozen of them (out of ~30,000 total plant-years). It might be possible to recover this data, but it's probably not super

raw_dfs = {"plant_in_service_ferc1": pd.read_sql("f1_plant_in_srvce", ferc1_engine)}
transformed_dfs = {}
transformed_dfs = pudl.transform.ferc1.new_plant_in_service(raw_dfs, transformed_dfs)
pis_head = (
    transformed_dfs["plant_in_service_ferc1"]
    .set_index(["utility_id_ferc1", "report_year", "amount_type"])
    .filter(regex=".*_head$")
)
dude = pd.DataFrame()
for field in pis_head:
    bad_df = pis_head[pis_head[field].notnull()]
    dude = pd.concat([dude, bad_df])
(
    dude
    .reset_index()
    .drop_duplicates()
    .set_index(["utility_id_ferc1", "report_year", "amount_type"]).index
)

The indices of the resulting problematic records:

MultiIndex([(  7, 2004, 'starting_balance'),
            (  7, 2005, 'starting_balance'),
            ( 94, 2004, 'starting_balance'),
            ( 94, 2005, 'starting_balance'),
            (121, 2004, 'starting_balance'),
            (121, 2005, 'starting_balance'),
            (145, 2004, 'starting_balance'),
            (145, 2005, 'starting_balance'),
            (159, 2004, 'starting_balance'),
            (166, 2004, 'starting_balance'),
            (187, 2004, 'starting_balance'),
            (187, 2005, 'starting_balance'),
            (  7, 2004,   'ending_balance'),
            (  7, 2005,   'ending_balance'),
            ( 94, 2004,   'ending_balance'),
            ( 94, 2005,   'ending_balance'),
            (121, 2004,   'ending_balance'),
            (121, 2005,   'ending_balance'),
            (145, 2004,   'ending_balance'),
            (145, 2005,   'ending_balance'),
            (159, 2004,   'ending_balance'),
            (166, 2004,   'ending_balance'),
            (187, 2004,   'ending_balance'),
            (187, 2005,   'ending_balance'),
            ( 25, 2004, 'starting_balance'),
            ( 25, 2005, 'starting_balance'),
            ( 95, 2004, 'starting_balance'),
            (101, 2004, 'starting_balance'),
            (101, 2005, 'starting_balance'),
            (132, 2004, 'starting_balance'),
            (132, 2005, 'starting_balance'),
            (167, 2004, 'starting_balance'),
            (167, 2005, 'starting_balance'),
            (247, 2004, 'starting_balance'),
            ( 25, 2004,   'ending_balance'),
            ( 25, 2005,   'ending_balance'),
            ( 95, 2004,   'ending_balance'),
            (101, 2004,   'ending_balance'),
            (101, 2005,   'ending_balance'),
            (132, 2004,   'ending_balance'),
            (132, 2005,   'ending_balance'),
            (167, 2004,   'ending_balance'),
            (167, 2005,   'ending_balance'),
            (247, 2004,   'ending_balance'),
            (153, 2004, 'starting_balance'),
            (153, 2005, 'starting_balance'),
            (153, 2004,   'ending_balance'),
            (153, 2005,   'ending_balance')],
           names=['utility_id_ferc1', 'report_year', 'amount_type'])
``

The text was updated successfully, but these errors were encountered:

Used the new row unpacking infrastructure to convert rows into columns corresponding to the various FERC Accounts. This allows all of the rows to be mapped to the correct fields, even as which row number they are associated with changes over the years. The implemented row maps cover all available years of data (1994-2018). The six original columns reported in this table represented the starting balance, additions, retirements, adjustments, transfers and ending balance for each FERC account. These "columns" have been converted into an index column, with entries named for each of those types of amounts (the column is called amount_type). The FERC Account numbers are used in the column names (acctXXX) and the various sub-categories of accounts are used as a prefix (e.g. transmission_acctXXX_...) followed by a 1-3 word descriptive name, taken from the row literal describing the original entry on the form. The conversion of rows into columns meant that we had to stop using the same format for the record_id that identifies where in the original FERC data an entry in the PUDL DB came from. A small number of utilities may have used an old version of the FERC Form 1 in 2004-2005, meaning that some of the row numbers are incorrect (when compared to other forms filed in those years). See #471 for details. Closes #472

cmgosnell · 2022-12-21T21:26:53Z

A new instance of this problem shows up in the income_statement_ferc1 table in 2003. I had to add a little cleanup method in transform.ferc1.IncomeStatementFerc1TableTransformer.process_dbf. Docs from that method:

    In 2003, two rows were added to the ``f1_income_stmnt`` dbf table, which bumped
    the starting ``row_number`` of ``f1_incm_stmnt_2`` from 25 to 27. A small
    handful of respondents seem to have not gotten the memo about this this in
    2003 and have information on these row numbers that shouldn't exist at all for
    this table.

These bad rows consist of 33% (!) of the respondents for that year.

zaneselvans added ferc1 Anything having to do with FERC Form 1 data-cleaning Tasks related to cleaning & regularizing data during ETL. labels Nov 15, 2019

zaneselvans mentioned this issue Nov 15, 2019

Integrate unpacked plant_in_service table into ETL #472

Closed

4 tasks

zaneselvans added the dbf Data coming from FERC's old Visual FoxPro DBF database file format. label Dec 21, 2022

cmgosnell changed the title ~~plant_in_service_ferc1 table headers contain data~~ ferc1 respondents using previous years' form -> data filed in wrong row_number Dec 21, 2022

zaneselvans mentioned this issue Dec 22, 2022

Integrate FERC XBRL data into PUDL #1574

Open

96 tasks

cmgosnell added the rmi label Jan 4, 2023

cmgosnell mentioned this issue Jan 4, 2023

Fix Known High Priority FERC1 Data Issues #2180

Open

4 tasks

zaneselvans mentioned this issue Feb 11, 2023

FERC Form 1 Data Misaligned for 1995 #1666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ferc1 respondents using previous years' form -> data filed in wrong row_number #471

ferc1 respondents using previous years' form -> data filed in wrong row_number #471

zaneselvans commented Nov 15, 2019 •

edited by cmgosnell

cmgosnell commented Dec 21, 2022

ferc1 respondents using previous years' form -> data filed in wrong row_number #471

ferc1 respondents using previous years' form -> data filed in wrong row_number #471

Comments

zaneselvans commented Nov 15, 2019 • edited by cmgosnell

plant_in_service_ferc1 table headers contain data

cmgosnell commented Dec 21, 2022

zaneselvans commented Nov 15, 2019 •

edited by cmgosnell