# Description
This notebook demonstrates a generic process for transforming XBRL based FERC Form 1 filings to be compatible with historical data. See `README` and `extract_fuel` for notes on setup. This notebook requires the raw `ferc1` database produced by `ferc1_to_sqlite` in `PUDL`.



In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, MetaData

There needs to be some mapping between table names. This notebook uses the `f1_steam` table for demonstration, but it is meant to be as generic as possible.

In [2]:
# Example of what table map might look like
table_map = {
    "f1_steam": "402 - Schedule - Steam-Electric Generating Plant Statistics (Large Plants)",
}

# Open old db and db with extracted xbrl data
xbrl_db = create_engine("sqlite:///ferc1.sqlite")
old_db = create_engine("sqlite:///dbf_ferc1.sqlite")

Get columns from table in old db and new db to prepare for mapping between columns.

In [3]:
# Function to get column names from a table given a sqlalchemy connection and table name
def extract_table_cols(engine, table_name):
    metadata = MetaData(bind=engine)
    metadata.reflect()
    return list(metadata.tables[table_name].columns.keys())

# Read in XBRL table
xbrl_steam = pd.read_sql(
    table_map["f1_steam"],
    xbrl_db,
    parse_dates=["start_date", "end_date"]
).drop("index", axis=1)

old_cols = extract_table_cols(old_db, "f1_steam")
new_cols = list(xbrl_steam.columns)

Remove Columns with `Axis` suffix. XBRL uses dimensions for indicating the context of facts, and those dimensions are added as columns to the extracted table with the suffix `Axis`. In many cases these columns are essentially repeated, and one of them needs to be removed, because columns are matched based on order.

In [4]:
def strip_axes(cols):
    update_cols = []
    for col in cols:
        if col.endswith("Axis") and col.rstrip("Axis") in cols:
            continue
        update_cols.append(col)
    return update_cols

new_cols = strip_axes(new_cols)

Attempt to generate a column map based on the column order. This will skip gneric columns included in every table, and footnote columns.

In [10]:
def create_col_map(old_cols, new_cols):
    default_cols_old = ["respondent_id", "report_year", "spplmnt_num", "row_number", "row_seq", "row_prvlg", "report_prd"]
    old_cols = [col for col in old_cols if col not in default_cols_old and not col.endswith("_f")]
    
    default_cols_new = ["entity_id", "start_date", "end_date", "instant", "OrderNumber"]
    new_cols = [col for col in new_cols if col not in default_cols_new]

    if len(new_cols) != len(old_cols):
        raise Exception("Can't generate column map")

    return {old_col: new_col for old_col, new_col in zip(old_cols, new_cols)}

col_map = create_col_map(old_cols, new_cols)
for key, val in col_map.items():
    print(f"{key}: {val}")

plant_name: PlantName
plant_kind: PlantKind
type_const: PlantConstructionType
yr_const: YearPlantOriginallyConstructed
yr_installed: YearLastUnitOfPlantInstalled
tot_capacity: InstalledCapacityOfPlant
peak_demand: NetPeakDemandOnPlant
plant_hours: PlantHoursConnectedToLoad
plnt_capability: NetContinuousPlantCapability
when_not_limited: NetContinuousPlantCapabilityNotLimitedByCondenserWater
when_limited: NetContinuousPlantCapabilityLimitedByCondenserWater
avg_num_of_emp: PlantAverageNumberOfEmployees
net_generation: NetGenerationExcludingPlantUse
cost_land: CostOfLandAndLandRightsSteamProduction
cost_structure: CostOfStructuresAndImprovementsSteamProduction
cost_equipment: CostOfEquipmentSteamProduction
cost_of_plant_to: AssetRetirementCostsSteamProduction
cost_per_kw: CostOfPlant
expns_operations: CostPerKilowattOfInstalledCapacity
expns_fuel: OperationSupervisionAndEngineeringExpense
expns_coolants: FuelSteamPowerGeneration
expns_steam: CoolantsAndWater
expns_steam_othr: SteamExpenses

From inspection, this appears to have done a very good job with one important exception. The column `asset_retire_cost` is at the end of the old table, but `AssetRetirementCostsSteamProduction` (which is clearly the equivalent column in the new table) is in the middle. These leads to the rest of the columns all being off by one.

Perhaps a better approach would be to use a string similarity metric to attempt to match column names. To do this, I've used [Jaro-Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) distance for computing the metric, and picked mappings that minimize the total string distance across all columns.

In [6]:
import jellyfish
import stringcase
from scipy.optimize import linear_sum_assignment

# Mapping func using string similarity
def create_col_map_similarity(old_cols, new_cols):
    default_cols_old = ["respondent_id", "report_year", "spplmnt_num", "row_number", "row_seq", "row_prvlg", "report_prd"]
    old_cols = [col for col in old_cols if col not in default_cols_old and not col.endswith("_f")]
    
    default_cols_new = ["entity_id", "start_date", "end_date", "instant", "OrderNumber"]
    new_cols = [col for col in new_cols if col not in default_cols_new]
    
    similarity_matrix = np.zeros((len(old_cols), len(new_cols)))
    for i, old_col in enumerate(old_cols):
        for j, new_col in enumerate(new_cols):
            # Transform new column names from CamelCase to snake_case and compute similarity matrix between all column names
            similarity_matrix[i, j] = jellyfish.jaro_winkler(old_col, stringcase.snakecase(new_col))
    
    # Find wich mappings minimize (this optimization is a variation of the assignment problem)
    row_ind, col_ind = linear_sum_assignment(similarity_matrix, maximize=True)
    
    return {old_cols[old_ind]: new_cols[new_ind] for old_ind, new_ind in zip(row_ind, col_ind)}

col_map_similarity = create_col_map_similarity(old_cols, new_cols)
for key, val in col_map_similarity.items():
    print(f"{key}: {val}")

plant_name: PlantName
plant_kind: PlantKind
type_const: NetContinuousPlantCapability
yr_const: MaintenanceOfStructuresSteamPowerGeneration
yr_installed: YearLastUnitOfPlantInstalled
tot_capacity: PlantConstructionType
peak_demand: NetPeakDemandOnPlant
plant_hours: PlantHoursConnectedToLoad
plnt_capability: InstalledCapacityOfPlant
when_not_limited: NetContinuousPlantCapabilityNotLimitedByCondenserWater
when_limited: NetContinuousPlantCapabilityLimitedByCondenserWater
avg_num_of_emp: PlantAverageNumberOfEmployees
net_generation: NetGenerationExcludingPlantUse
cost_land: CostOfLandAndLandRightsSteamProduction
cost_structure: CostOfStructuresAndImprovementsSteamProduction
cost_equipment: CostOfEquipmentSteamProduction
cost_of_plant_to: CostOfPlant
cost_per_kw: CostPerKilowattOfInstalledCapacity
expns_operations: PowerProductionExpensesSteamPower
expns_fuel: FuelSteamPowerGeneration
expns_coolants: MaintenanceOfMiscellaneousSteamPlant
expns_steam: SteamExpensesSteamPowerGeneration
expns_st

This approach does properly match `asset_retire_cost`, but it also mismatches many other columns. It's possible that a more sophisticated approach could handle these cases (perhaps one that considers both order, and string similarity, or a string similarity metric that better accounts for the use of abbreviations or words in different orders), but it might also be easier to just manually fix these errors.

For now I'll manually move `AssetRetirementCostsSteamProduction` to the correct place and shift the other column names that are off by one.

In [7]:
keys = list(col_map.keys())[16:-1]
previous_val = col_map["asset_retire_cost"]

for key in reversed(keys):
    tmp = col_map[key]
    col_map[key] = previous_val
    previous_val = tmp
    
col_map["asset_retire_cost"] = "AssetRetirementCostsSteamProduction"
print(col_map)

{'plant_name': 'PlantName', 'plant_kind': 'PlantKind', 'type_const': 'PlantConstructionType', 'yr_const': 'YearPlantOriginallyConstructed', 'yr_installed': 'YearLastUnitOfPlantInstalled', 'tot_capacity': 'InstalledCapacityOfPlant', 'peak_demand': 'NetPeakDemandOnPlant', 'plant_hours': 'PlantHoursConnectedToLoad', 'plnt_capability': 'NetContinuousPlantCapability', 'when_not_limited': 'NetContinuousPlantCapabilityNotLimitedByCondenserWater', 'when_limited': 'NetContinuousPlantCapabilityLimitedByCondenserWater', 'avg_num_of_emp': 'PlantAverageNumberOfEmployees', 'net_generation': 'NetGenerationExcludingPlantUse', 'cost_land': 'CostOfLandAndLandRightsSteamProduction', 'cost_structure': 'CostOfStructuresAndImprovementsSteamProduction', 'cost_equipment': 'CostOfEquipmentSteamProduction', 'cost_of_plant_to': 'CostOfPlant', 'cost_per_kw': 'CostPerKilowattOfInstalledCapacity', 'expns_operations': 'OperationSupervisionAndEngineeringExpense', 'expns_fuel': 'FuelSteamPowerGeneration', 'expns_coola

Using this generated column map, assemble the new table composed of XBRL data. The first thing to do is handle the durations in the XBRL data. XBRL uses durations to identify facts. Some facts are reported with an instantaneous duration while others have an actual date range. The historical data does not have this conceptof durations and only provides the report year. The first function here will match rows with an instant duration with rows with a compatible date range. The instant date should line up with the end date according to [this](https://www.ferc.gov/sites/default/files/2020-05/FERC_Taxonomy_Guide.pdf) guide on the FERC taxonomy.

In [8]:
def process_durations(table):
    # Seperate facts with instant duration vs normal duration
    duration_facts = table.loc[table["instant"] == False, :].drop("instant", axis=1)
    instant_facts = table.loc[table["instant"] == True, :].drop("instant", axis=1)
    
    # Take only columns that have data (this function assumes that this will produce non overlapping sets of columns
    duration_facts = duration_facts.loc[:, ~duration_facts.isnull().all(axis=0)]
    instant_facts = instant_facts.loc[:, ~instant_facts.isnull().all(axis=0)]
    
    indices = ["entity_id", "end_date"] + [col for col in instant_facts.columns if col.endswith("Axis")]
    
    return duration_facts.join(instant_facts.set_index(indices), on=indices)

# Use the generated column map to assemble the table
def create_df(table, col_map):
    df_dict = {
        'respondent_id': table['entity_id'],
        'report_year': table['start_date'].dt.year,
        'supplmnt_num': np.zeros(len(table), dtype=int),
        'row_number': np.zeros(len(table), dtype=int),
        'row_seq': np.zeros(len(table), dtype=int),
        'row_prvlg': np.zeros(len(table), dtype=int),
    }
    
    df_dict.update({old_col: table[new_col] for old_col, new_col in col_map.items()})
    return pd.DataFrame(df_dict)

In [9]:
xbrl_steam = process_durations(xbrl_steam)
xbrl_steam = create_df(xbrl_steam, col_map)
xbrl_steam = xbrl_steam[xbrl_steam["report_year"] == 2011]
xbrl_steam

Unnamed: 0,respondent_id,report_year,supplmnt_num,row_number,row_seq,row_prvlg,plant_name,plant_kind,type_const,yr_const,...,expns_rents,expns_allowances,expns_engnr,expns_structures,expns_boiler,expns_plants,expns_misc_steam,tot_prdctn_expns,expns_kwh,asset_retire_cost
0,C000134,2011,0,0,0,0,,,,,...,,,,,,,,260585981,,
1,C004924,2011,0,0,0,0,Joppa Gas Turbines,Gas Turbine,Outdoor,2000,...,38736,,144123,15378,,64162,,475593,0.2511,
3,C001143,2011,0,0,0,0,,,,,...,,,,,,,,312177112,,
4,C000772,2011,0,0,0,0,,,,,...,9748885,215,5236353,2830890,21289438,6343376,2812036,353842304,,
6,C000772,2011,0,0,0,0,Tecumseh,Steam,Full Outdoor,1925,...,,,708132,211441,3660707,1005591,565449,33871573,0.0303,1015801
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2337,C000041,2011,0,0,0,0,Barre Peaker,Gas Turbine,Outdoor,2007,...,70904,,250960,8112,,375016,259921,2600861,0.4201,
2339,C000041,2011,0,0,0,0,San Onofre Unit 1,Nuclear,Semi-Outdoor,1967,...,,,,,,,,3948789,,
2340,C000041,2011,0,0,0,0,Center Peaker,Gas Turbine,Outdoor,2007,...,92516,,255063,12451,,306394,177089,2546226,0.3777,
2342,C000041,2011,0,0,0,0,San Onofre Unit 2&3,Nuclear,Semi-Outdoor,1983,...,1623600,,20568115,7349250,18677385,15398359,32111535,395408879,0.0279,
