# Description
This notebook demonstrates the process for transforming FERC Form 1 data distributed in the new XBRL format to be compatible with historical data. It focuses on recreating the `f1_fuel` table, but the process will be very similar for the rest of the tables.

This notebook requires a sqlite database containing data already extracted from XBRL filings. To extract this data you can use the CLI included in this repo. There's a zip file in the root directory of this repo that contains a years worth of filings. To extract data from these filings, first extract the zip file, then run the following command:

`xbrl_extract https://ecollection.ferc.gov/taxonomy/form1/2022-01-01/form/form1/form-1_2022-01-01.xsd {path_to_filings} --to-sql notebooks/ferc1.sqlite --entity-id-fact CompanyIdentifier`

Running this command will take quite awhile, so I've also included this database for convenience.

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

First load the relevant table from raw database. The tables in the raw database correspond directly to the PDF pages of the FERC form 1.

In [2]:
engine = create_engine("sqlite:///ferc1.sqlite")

fuel_stats = pd.read_sql(
    "402 - Schedule - Steam-Electric Generating Plant Statistics (Large Plants) - Fuel Statistics",
    engine,
    parse_dates=["start_date", "end_date"]
)

In [3]:
fuel_stats

Unnamed: 0,entity_id,start_date,end_date,instant,PlantNameAxis,FuelKindAxis,FuelKind,FuelUnit,QuantityOfFuelBurned,FuelBurnedAverageHeatContent,AverageCostOfFuelPerUnitAsDelivered,AverageCostOfFuelPerUnitBurned,AverageCostOfFuelBurnedPerMillionBritishThermalUnit,AverageCostOfFuelBurnedPerKilowattHourNetGeneration,AverageBritishThermalUnitPerKilowattHourNetGeneration
0,C000041,2011-01-01,2012-01-01,False,Barre Peaker-1,GAS-1,GAS,MCF,69687.0,1021.0,4.523,4.523,4.432,0.051,11487.000
1,C000041,2011-01-01,2012-01-01,False,Center Peaker-1,GAS-1,GAS,MCF,75701.0,1016.0,4.556,4.556,4.486,0.051,11406.000
2,C000041,2011-01-01,2012-01-01,False,Four Corners-0,COAL-0,COAL,TONS,2828924.0,8801.0,,31.756,1.804,0.000,0.000
3,C000041,2011-01-01,2012-01-01,False,Four Corners-0,GAS-0,GAS,MCF,171474.0,1080.0,6.726,6.726,6.228,0.018,9685.000
4,C000041,2011-01-01,2012-01-01,False,Grapeland Peaker-1,GAS-1,GAS,MCF,33762.0,1023.0,4.556,4.556,4.454,0.059,13201.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1691,R001160,2011-01-01,2012-01-01,False,Williams-0,Gas-0,Gas,MCF,,,,,,,
1692,R001160,2011-01-01,2012-01-01,False,Williams-0,Oil-0,Oil,Barrels,14116.0,137734.0,129.150,117.054,20.235,,
1693,R001202,2011-01-01,2012-01-01,False,Beluga-0,GAS-0,GAS,MCF,24636639.0,1000.0,4.810,4.810,4.810,0.052,10799.350
1694,R001202,2011-01-01,2012-01-01,False,Bernice Lake-0,GAS-0,GAS,MCF,626306.0,1000.0,4.820,4.820,4.820,0.072,14714.656


Upon close examination, it's clear that this table is very similar to the `ferc_fuel` table in the old database. This new table doesn't have the columns `report_year`, `supplmnt_num`, `row_number`, `row_seq`, `row_prvlg`, but many of these can be derived from other information as I'll demonstrate later. There are also many columns which are very similar, but slightly different (camel case vs snake case, names fully spelled vs abreviations, some words are reordered). These columns, however, maintain the same order as in the old database, which makes mapping between them quite easy.

In [4]:
# There's no respondent_id in the new database, but they have a company identifier which I've used in place of the the respondent_id
# The report_year can be extracted from the start_date
ferc_fuel_dict = {
    'respondent_id': fuel_stats['entity_id'],
    'report_year': fuel_stats['start_date'].dt.year,
    'supplmnt_num': np.zeros(len(fuel_stats), dtype=int),
    'row_number': np.zeros(len(fuel_stats), dtype=int),
    'row_seq': np.zeros(len(fuel_stats), dtype=int),
    'row_prvlg': np.zeros(len(fuel_stats), dtype=int),

}

# Drop first several columns to align the order of the remaining columns
fuel = fuel_stats.drop(['entity_id', 'start_date', 'end_date', 'instant', 'FuelKindAxis'], axis=1)
ferc_fuel_data_cols = ['plant_name', 'fuel', 'fuel_unit', 'fuel_quantity', 'fuel_avg_heat', 'fuel_cost_delvd',
                       'fuel_cost_burned', 'fuel_cost_btu', 'fuel_cost_kwh', 'fuel_generation']

# Change column names
ferc_fuel_dict.update({old_col: fuel[new_col] for old_col, new_col in zip(ferc_fuel_data_cols, fuel.columns)})

ferc_fuel = pd.DataFrame(ferc_fuel_dict)

In [5]:
ferc_fuel

Unnamed: 0,respondent_id,report_year,supplmnt_num,row_number,row_seq,row_prvlg,plant_name,fuel,fuel_unit,fuel_quantity,fuel_avg_heat,fuel_cost_delvd,fuel_cost_burned,fuel_cost_btu,fuel_cost_kwh,fuel_generation
0,C000041,2011,0,0,0,0,Barre Peaker-1,GAS,MCF,69687.0,1021.0,4.523,4.523,4.432,0.051,11487.000
1,C000041,2011,0,0,0,0,Center Peaker-1,GAS,MCF,75701.0,1016.0,4.556,4.556,4.486,0.051,11406.000
2,C000041,2011,0,0,0,0,Four Corners-0,COAL,TONS,2828924.0,8801.0,,31.756,1.804,0.000,0.000
3,C000041,2011,0,0,0,0,Four Corners-0,GAS,MCF,171474.0,1080.0,6.726,6.726,6.228,0.018,9685.000
4,C000041,2011,0,0,0,0,Grapeland Peaker-1,GAS,MCF,33762.0,1023.0,4.556,4.556,4.454,0.059,13201.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1691,R001160,2011,0,0,0,0,Williams-0,Gas,MCF,,,,,,,
1692,R001160,2011,0,0,0,0,Williams-0,Oil,Barrels,14116.0,137734.0,129.150,117.054,20.235,,
1693,R001202,2011,0,0,0,0,Beluga-0,GAS,MCF,24636639.0,1000.0,4.810,4.810,4.810,0.052,10799.350
1694,R001202,2011,0,0,0,0,Bernice Lake-0,GAS,MCF,626306.0,1000.0,4.820,4.820,4.820,0.072,14714.656


Some tables have a column named `OrderNumber` that corresponds to the `row_number`, but this one does not. The `row_number` and `supplmnt_num` can, however, be deduced from the `respondent_id`. For this table, row number increments by one for each record submitted by an individual respondent until 15, then starts over at 1 and increases the `supplmnt_num` by 1.

In [6]:
# Loop through unique respondents and set row/supplmnt numbers based on previous rule
for respondent in ferc_fuel['respondent_id'].unique():
    respondent_inds = ferc_fuel['respondent_id'] == respondent
    ferc_fuel.loc[respondent_inds, 'row_number'] = (np.arange(respondent_inds.sum()) % 14) + 1
    ferc_fuel.loc[respondent_inds, 'supplmnt_num'] = np.arange(respondent_inds.sum()) // 14

In [7]:
ferc_fuel

Unnamed: 0,respondent_id,report_year,supplmnt_num,row_number,row_seq,row_prvlg,plant_name,fuel,fuel_unit,fuel_quantity,fuel_avg_heat,fuel_cost_delvd,fuel_cost_burned,fuel_cost_btu,fuel_cost_kwh,fuel_generation
0,C000041,2011,0,1,0,0,Barre Peaker-1,GAS,MCF,69687.0,1021.0,4.523,4.523,4.432,0.051,11487.000
1,C000041,2011,0,2,0,0,Center Peaker-1,GAS,MCF,75701.0,1016.0,4.556,4.556,4.486,0.051,11406.000
2,C000041,2011,0,3,0,0,Four Corners-0,COAL,TONS,2828924.0,8801.0,,31.756,1.804,0.000,0.000
3,C000041,2011,0,4,0,0,Four Corners-0,GAS,MCF,171474.0,1080.0,6.726,6.726,6.228,0.018,9685.000
4,C000041,2011,0,5,0,0,Grapeland Peaker-1,GAS,MCF,33762.0,1023.0,4.556,4.556,4.454,0.059,13201.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1691,R001160,2011,0,2,0,0,Williams-0,Gas,MCF,,,,,,,
1692,R001160,2011,0,3,0,0,Williams-0,Oil,Barrels,14116.0,137734.0,129.150,117.054,20.235,,
1693,R001202,2011,0,1,0,0,Beluga-0,GAS,MCF,24636639.0,1000.0,4.810,4.810,4.810,0.052,10799.350
1694,R001202,2011,0,2,0,0,Bernice Lake-0,GAS,MCF,626306.0,1000.0,4.820,4.820,4.820,0.072,14714.656
