# Manually Overriding FERC-EIA Record Linkage

The FERC-EIA record linkage process requries training data in order to work properly. Training matches also serve as overrides. This notebook helps you check whether the machine learning algroythem did a good job of matching FERC and EIA records. If you find a good match (or you correct a bad match), this process will turn it into training data.

This notebook has three purposes: 

- [**Step 1: Output Override Tools:**](#verify-tools) Where you create and output the spreadsheets used to conduct the manual overrides.
- [**Step 2: Validate New Training Data:**](#validate) Where you check that the overrides we made are sound.
- [**Step 3: Upload Changes to Training Data:**](#upload-overrides) Where integrate the overrides into the training data.

## Settings

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pudl_rmi
from pudl_rmi.create_override_spreadsheets import *
                                           
import pudl
import sqlalchemy as sa
import logging
import sys

import warnings
warnings.filterwarnings('ignore')

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS',fill_fuel_cost=True,roll_fuel_cost=True,fill_net_gen=True)
rmi_out = pudl_rmi.coordinate.Output(pudl_out)

In [None]:
# old

specified_utilities = {
    # 'Dominion': {'utility_id_pudl': [292, 293, 349],
    #              'utility_id_eia': [17539, 17554, 19876]},
    # 'Evergy': {'utility_id_pudl': [159, 160, 161, 1270, 13243],
    #            'utility_id_eia': [10000, 10005, 56211, 25000]},
    # 'IDACORP': {'utility_id_pudl': [140],
    #             'utility_id_eia': [9191]},
    # 'Duke': {'utility_id_pudl': [90, 91, 92, 93, 96, 97],
    #          'utility_id_eia': [5416, 6455, 15470, 55729, 3542, 3046]},
    'BHE': {'utility_id_pudl': [185, 246, 204, 287],
            'utility_id_eia': [12341, 14354, 13407, 17166]},
    'Southern': {'utility_id_pudl': [123, 18, 190, 11830],
                 'utility_id_eia': [7140, 195, 12686, 17622]},
    # 'NextEra': {'utility_id_pudl': [121, 130],
    #             'utility_id_eia': [6452, 7801]},
    # 'AEP': {'utility_id_pudl': [29, 301, 144, 275, 162, 361, 7],
    #         'utility_id_eia': [733, 17698, 9324, 15474, 22053, 20521, 343]},
    # 'Entergy': {'utility_id_pudl': [107, 106, 311, 113, 110],
    #             'utility_id_eia': [11241, 814, 12465, 55937, 13478]},
    # 'Xcel': {'utility_id_pudl': [224, 302, 272, 11297],
    #          'utility_id_eia': [13781, 13780, 17718, 15466]}
}

<a id='verify-tools'></a>
## Step 1: Output Override Tools

In [80]:
specified_utilities = {
    #'BHE': [12341, 14354, 13407, 17166],
    #'Southern':[7140, 195, 12686, 17622]
    'Dominion': [17539, 17554, 19876, 5248] # 5248...
    #'Entergy': [11241, 814, 12465, 55937, 13478],
    #'Xcel': [13781, 13780, 17718, 15466],
    #'NextEra': [6452, 7801]
    #'IDACORP': [9191]
    #'Evergy': [10000, 10005, 56211, 22500]
}

specified_years = [2020
    # 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 
    # 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020
]

Run the following function and you'll find excel files called `<UTILITY>_fix_FERC-EIA_overrides.xlsx` in the `outputs/overrides` directory created based on the utility and year inputs you specified above. Read the [Override Instructions](https://docs.google.com/document/d/1nJfmUtbSN-RT5U2Z3rJKfOIhWsRFUPNxs9NKTes0SRA/edit#) to learn how to begin fixing/verifying the FERC-EIA connections.

In [81]:
generate_all_override_spreadsheets(pudl_out, rmi_out, specified_utilities, specified_years)

Generating inputs
Reading the FERC to EIA connection from /Users/austensharpe/Desktop/Repos/rmi-ferc1-eia/outputs/ferc1_eia.pkl.gz
Prepping FERC-EIA table
Reading the EIA plant-parts from /Users/austensharpe/Desktop/Repos/rmi-ferc1-eia/outputs/plant_parts_eia.pkl.gz
Prepping Plant Parts Table
Grabbing depreciation study output from /Users/austensharpe/Desktop/Repos/rmi-ferc1-eia/outputs/deprish.pkl.gz
Prepping Deprish Data
Developing outputs for Dominion
Getting utility-year subset for ferc_eia
Getting utility-year subset for ppl
Getting utility-year subset for deprish
Outputing table subsets to tabs



<a id='validate'></a>
## Step 2: Validate New Training Data

Once you've finished checking the maps, make sure everything you want to validate is set to `verified=TRUE`. Then, move the file into the add_to_training folder and run the following function:

In [265]:
# Define function inputs
ferc1_eia_df = rmi_out.ferc1_to_eia()
ppl_df = rmi_out.plant_parts_eia().reset_index()
utils_df = pudl_out.utils_eia860()
training_df = pd.read_csv(pudl_rmi.TRAIN_FERC1_EIA_CSV)
path_to_overrides = pudl_rmi.INPUTS_DIR / "add_to_training" 

override_files = os.listdir(path_to_overrides)
override_files = [file for file in override_files if file.endswith(".xlsx")]

Reading the FERC to EIA connection from /Users/austensharpe/Desktop/Repos/rmi-ferc1-eia/outputs/ferc1_eia.pkl.gz
Reading the EIA plant-parts from /Users/austensharpe/Desktop/Repos/rmi-ferc1-eia/outputs/plant_parts_eia.pkl.gz


In [264]:
ppl_df[
    #(ppl_df["record_id_eia"]=="3283_2006_plant_total_17539")
    (ppl_df["plant_id_eia"]==3281)
    # ppl_df["plant_name_new"].str.contains("Williams")
    & (ppl_df["report_date"].dt.year.isin([2019]))
    #& (ppl_df["utility_id_eia"]==5248)
    #& (ppl_df["capacity_mw"]==613)
    #& (ppl_df["net_generation_mwh"] > 600)
    #& (ppl_df["net_generation_mwh"] < 2000)
    #& (ppl_df["capacity_mw"]> 600)
    #& (ppl_df["capacity_mw"]<100)
    #& (ppl_df["ownership_dupe"]==False)
].sort_values(["report_year", "capacity_mw"])[
    ["true_gran",
     "ownership_dupe",
     "record_id_eia", 
     "plant_id_eia", 
     "utility_id_eia", 
     "report_year", 
     "generator_id", 
     "plant_name_new", 
     "capacity_mw", 
     "net_generation_mwh"]
]

Unnamed: 0,true_gran,ownership_dupe,record_id_eia,plant_id_eia,utility_id_eia,report_year,generator_id,plant_name_new,capacity_mw,net_generation_mwh
158846,True,True,3281_1_2019_plant_gen_owned_17539,3281,17539,2019,1.0,Coit GT 1,19.6,158.0
158847,True,True,3281_2_2019_plant_gen_owned_17539,3281,17539,2019,2.0,Coit GT 2,19.6,158.0
189614,True,False,3281_1_2019_plant_gen_total_17539,3281,17539,2019,1.0,Coit GT 1,19.6,158.0
189615,True,False,3281_2_2019_plant_gen_total_17539,3281,17539,2019,2.0,Coit GT 2,19.6,158.0
2231,True,True,3281_2019_plant_owned_17539,3281,17539,2019,,Coit GT,39.2,316.0
16091,True,False,3281_2019_plant_total_17539,3281,17539,2019,,Coit GT,39.2,316.0
35496,False,True,3281_gt_2019_plant_prime_mover_owned_17539,3281,17539,2019,,Coit GT GT,39.2,316.0
50947,False,False,3281_gt_2019_plant_prime_mover_total_17539,3281,17539,2019,,Coit GT GT,39.2,316.0
66411,False,True,3281_natural_gas_fired_combustion_turbine_2019...,3281,17539,2019,,Coit GT Natural Gas Fired Combustion Turbine,39.2,316.0
81339,False,False,3281_natural_gas_fired_combustion_turbine_2019...,3281,17539,2019,,Coit GT Natural Gas Fired Combustion Turbine,39.2,316.0


In [251]:
ppl_df.columns.tolist()

['record_id_eia',
 'plant_id_eia',
 'report_date',
 'plant_part',
 'generator_id',
 'unit_id_pudl',
 'prime_mover_code',
 'energy_source_code_1',
 'technology_description',
 'ferc_acct_name',
 'utility_id_eia',
 'true_gran',
 'appro_part_label',
 'appro_record_id_eia',
 'capacity_eoy_mw',
 'capacity_factor',
 'capacity_mw',
 'fraction_owned',
 'fuel_cost_per_mmbtu',
 'fuel_cost_per_mwh',
 'fuel_type_code_pudl',
 'heat_rate_mmbtu_mwh',
 'installation_year',
 'net_generation_mwh',
 'operational_status',
 'operational_status_pudl',
 'ownership',
 'ownership_dupe',
 'planned_retirement_date',
 'plant_id_pudl',
 'plant_name_eia',
 'plant_name_new',
 'plant_part_id_eia',
 'record_count',
 'retirement_date',
 'total_fuel_cost',
 'total_mmbtu',
 'utility_id_pudl',
 'report_year',
 'plant_id_report_year']

In [230]:
small = pudl_out.plants_small_ferc1()
small[small["record_id"]=='f1_gnrt_plant_2019_12_186_0_21'].columns

Index(['report_year', 'utility_id_ferc1', 'utility_id_pudl', 'utility_name_ferc1', 'plant_id_pudl', 'plant_name_ferc1', 'record_id', 'capacity_mw', 'capex_per_mw', 'construction_year', 'ferc_license_id', 'fuel_cost_per_mmbtu', 'fuel_type', 'net_generation_mwh', 'opex_fuel', 'opex_maintenance', 'opex_operations', 'opex_total', 'opex_total_nonfuel', 'peak_demand_mw', 'plant_name_clean', 'plant_type', 'total_cost_of_plant'], dtype='object')

In [203]:
utils = pudl_out.utils_eia860()
utils[utils["utility_id_eia"]==60968]
#utils[utils["utility_id_pudl"]==349]

Unnamed: 0,report_date,utility_id_eia,utility_id_pudl,utility_name_eia,address_2,attention_line,city,contact_firstname,contact_firstname_2,contact_lastname,contact_lastname_2,contact_title,contact_title_2,entity_type,phone_extension,phone_extension_2,phone_number,phone_number_2,plants_reported_asset_manager,plants_reported_operator,plants_reported_other_relationship,plants_reported_owner,state,street_address,zip_code,zip_code_4
106633,2021-01-01,60968,6166,Delphinus Community Solar,,,,,,,,,,,,,,,,,,,,,,
106634,2020-01-01,60968,6166,Delphinus Community Solar,,,Scottsdale,,,,,,,Q,,,,,,True,,True,AZ,"17200 N. Perimeter Drive, Suit",85004.0,
106635,2019-01-01,60968,6166,Delphinus Community Solar,,,Sauk Rapids,,,,,,,Q,,,,,,True,,True,MN,3629 Golden Spike Rd NE,56379.0,
106636,2018-01-01,60968,6166,Delphinus Community Solar,,,Sauk Rapids,,,,,,,Q,,,,,,True,,True,MN,3629 Golden Spike Rd NE,56379.0,
106637,2017-01-01,60968,6166,Delphinus Community Solar,,,Sauk Rapids,,,,,,,Q,,,,,,True,,True,MN,3629 Golden Spike Rd NE,56379.0,
106638,2016-01-01,60968,6166,Delphinus Community Solar,,,Sauk Rapids,,,,,,,Q,,,,,,True,,True,MN,3629 Golden Spike Rd NE,56379.0,


In [102]:
plants = pudl_out.plants_eia860()
plants[plants["utility_id_pudl"]==349].plant_name_eia.unique()

<StringArray>
[                                             'Gaston',                                          'Kitty Hawk',                                      'Roanoke Rapids',                                         'Bremo Bluff',                                        'Chesterfield',                                              'Cushaw',                                            'Low Moor',                                       'Northern Neck',                                          'Chesapeake',                                        'Possum Point',                                               'Surry',                                            'Yorktown',                                            'Mt Storm',                                         'Bath County',                                          'North Anna',                                         'Gravel Neck',                                           'Darbytown',                                              'Clov

In [87]:
steam = pudl_out.plants_steam_ferc1()
steam[steam["plant_name_ferc1"].str.contains("utenberg")]
test = steam[(steam["report_year"]==2020) & (steam["utility_id_pudl"]==349)]
test

Unnamed: 0,report_year,utility_id_ferc1,utility_id_pudl,utility_name_ferc1,plant_id_pudl,plant_id_ferc1,plant_name_ferc1,asset_retirement_cost,avg_num_employees,capacity_factor,capacity_mw,capex_equipment,capex_land,capex_per_mw,capex_structures,capex_total,construction_type,construction_year,installation_year,net_generation_mwh,not_water_limited_capacity_mw,opex_allowances,opex_boiler,opex_coolants,opex_electric,opex_engineering,opex_fuel,opex_fuel_per_mwh,opex_misc_power,opex_misc_steam,opex_nonfuel_per_mwh,opex_operations,opex_per_mwh,opex_plants,opex_production_total,opex_rents,opex_steam,opex_steam_other,opex_structures,opex_total_nonfuel,opex_transfer,peak_demand_mw,plant_capability_mw,plant_hours_connected_while_generating,plant_type,record_id,water_limited_capacity_mw
27483,2020,186,349,VIRGINIA ELECTRIC AND POWER COMPANY,15,1416,altavista,2348721.0,21.0,0.586329,58.0,65674610.0,850256.0,1384607.2,11433626.0,80307220.0,conventional,1992.0,1992.0,297902.0,51.0,,3170705.0,,89944.0,70928.0,12471528.0,41.864534,1056670.0,1406720.0,56.937642,684834.0,98.8,514244.0,29433365.0,237084.0,9452780.0,,277928.0,16961837.0,,54.0,51.0,6262.0,steam,f1_steam_2020_12_186_0_1,51.0
27484,2020,186,349,VIRGINIA ELECTRIC AND POWER COMPANY,111,696,chesterfield,1429778000.0,180.0,0.144936,1303.0,1462173000.0,3995551.0,2321308.3,128718931.0,3024665000.0,semioutdoor,1952.0,1969.0,1654341.0,1302.0,,153584000.0,,1015661.0,396705.0,73289296.0,44.301202,17268355.0,5777356.0,49.904259,3840828.0,94.2,1403639.0,155847959.0,310481.0,-102168582.0,,1130220.0,82558663.0,,1000.0,1267.0,3260.0,steam,f1_steam_2020_12_186_0_5,1267.0
27485,2020,186,349,VIRGINIA ELECTRIC AND POWER COMPANY,118,6242,clover,6631512.0,82.0,0.305559,463.5,480718300.0,961205.0,1290393.6,109786404.0,598097400.0,conventional,1995.0,1996.0,1240647.0,881.0,,5563492.0,,980435.0,-122793.0,21526517.0,17.351035,626788.0,129892.0,23.351865,3343320.0,40.7,1900541.0,50497948.0,104874.0,15802949.0,,641933.0,28971431.0,,874.0,877.0,3024.0,steam,f1_steam_2020_12_186_1_1,877.0
27486,2020,186,349,VIRGINIA ELECTRIC AND POWER COMPANY,148,1379,darbytown,217744.0,4.0,0.032283,374.0,85584170.0,2819440.0,252966.4,5988077.0,94609440.0,,1990.0,1990.0,105767.1,387.0,,,,,,2577633.0,24.370845,171656.0,191890.0,24.426089,22584.0,48.8,2013915.0,5161109.0,,121309.0,,62122.0,2583476.0,,285.0,336.0,1054.0,internal_combustion,f1_steam_2020_12_186_1_2,336.0
27487,2020,186,349,VIRGINIA ELECTRIC AND POWER COMPANY,230,1464,gordonsville,273193.0,20.0,0.460347,291.0,56961340.0,,211483.1,4307033.0,61541570.0,,1994.0,1994.0,1173497.0,268.0,,,,,,25479328.0,21.71231,3321434.0,1125579.0,9.206774,,30.9,5122363.0,36283449.0,546067.0,616691.0,,71987.0,10804121.0,,259.0,218.0,6209.0,combined_cycle,f1_steam_2020_12_186_1_3,218.0
27488,2020,186,349,VIRGINIA ELECTRIC AND POWER COMPANY,235,1378,gravel neck,304959.0,3.0,0.040571,415.0,99175000.0,,247987.0,3434631.0,102914600.0,,1970.0,1989.0,147490.6,428.0,,,,,,4355197.0,29.528646,159116.0,210353.0,8.964017,31202.0,38.5,636847.0,5677305.0,,135267.0,,149323.0,1322108.0,,284.0,368.0,1312.0,internal_combustion,f1_steam_2020_12_186_1_4,368.0
27489,2020,186,349,VIRGINIA ELECTRIC AND POWER COMPANY,312,1419,ladysmith,108498.0,11.0,0.165421,796.0,286080800.0,1016458.0,399159.3,30525062.0,317730800.0,,2001.0,2009.0,1153471.0,915.0,,,,,513937.0,22994748.0,19.935266,408874.0,119190.0,4.582886,729149.0,24.5,2973401.0,28280973.0,125472.0,277884.0,,138318.0,5286225.0,,884.0,783.0,3782.0,internal_combustion,f1_steam_2020_12_186_1_5,783.0
27490,2020,186,349,VIRGINIA ELECTRIC AND POWER COMPANY,426,1264,mount storm,25376310.0,204.0,0.324391,1660.0,1604650000.0,754476.0,1087936.9,175194874.0,1805975000.0,conventional,1965.0,1973.0,4717164.0,1676.0,,35373527.0,,2502931.0,3285048.0,153042232.0,32.443695,10297316.0,10262234.0,34.646447,13326693.0,67.1,2989834.0,316475206.0,763364.0,81390323.0,,3241704.0,163432974.0,,1607.0,1629.0,7886.0,steam,f1_steam_2020_12_186_2_2,1629.0
27491,2020,186,349,VIRGINIA ELECTRIC AND POWER COMPANY,421,1046,north anna,-89496260.0,1094.0,0.95303,1673.0,2158620000.0,32180168.0,1477871.6,371175217.0,2472479000.0,,1978.0,1980.0,13967110.0,1673.0,,9171979.0,1397237.0,2307016.0,10167683.0,84190051.0,6.027734,46041090.0,7886409.0,10.908442,50485206.0,16.9,11304599.0,236549502.0,3158281.0,9365607.0,,1074344.0,152359451.0,,1962.0,1673.0,8784.0,nuclear,f1_steam_2020_12_186_2_3,1673.0
27492,2020,186,349,VIRGINIA ELECTRIC AND POWER COMPANY,490,700,possum point,1301588.0,56.0,0.001652,1213.0,18058470.0,321513.0,19300.4,3729784.0,23411360.0,outdoor,1948.0,1975.0,17558.25,1130.0,,-4029712.0,,69733.0,236143.0,1737719.0,98.968781,14832686.0,1063950.0,885.220307,349221.0,984.2,200895.0,17280642.0,2743.0,851692.0,,1965572.0,15542923.0,,615.0,1102.0,58.0,steam,f1_steam_2020_12_186_2_5,1102.0


In [267]:
for file in override_files:
    if not file.startswith("~$"):
        print(f"VALIDATING {file} ************** ")
        file_df = pd.read_excel(path_to_overrides / file)

        validate_override_fixes(
            validated_connections=file_df,
            utils_eia860=utils_df,
            ppl=ppl_df,
            ferc1_eia=ferc1_eia_df,
            training_data=training_df,
            expect_override_overrides=True,
        )
    print(" ")

VALIDATING Dominion_fix_FERC-EIA_overrides.xlsx ************** 
Validating overrides
Checking record_id_eia_override_1 consistency for values that don't exist
Checking record_id_ferc1 consistency for values that don't exist
Checking for duplicate override ids
Checking for mismatched utility ids


AssertionError: Found mismatched utilities:       self  other
0    349.0   6484
1    349.0   6484
2    349.0   1498
3    349.0   1498
12   349.0   1498
13   349.0   1498
31   292.0  13492
32   292.0  13492
201  292.0    293
202  292.0    293
203  292.0    293
205  292.0    293
206  292.0    293
207  292.0    293
209  292.0    293
210  292.0    293
211  292.0    293
213  292.0    293
214  292.0    293
215  292.0    293
217  292.0    293
218  292.0    293
219  292.0    293
221  292.0    293
222  292.0    293
223  292.0    293
225  292.0    293
226  292.0    293
227  292.0    293
229  292.0    293
230  292.0    293
231  292.0    293
233  292.0    293
234  292.0    293
235  292.0    293
237  292.0    293
238  292.0    293
239  292.0    293
241  292.0    293
242  292.0    293
243  292.0    293
245  292.0    293
246  292.0    293
247  292.0    293
249  292.0    293
250  292.0    293
251  292.0    293
253  292.0    293
254  292.0    293
255  292.0    293
257  292.0    293
258  292.0    293
259  292.0    293

<a id='upload-overrides'></a>
## Step 3: Upload Changes to Training Data

When you've finished editing the `<UTILITY>_fix_FERC-EIA_overrides.xlsx` and want to add your changes to the official override csv, move your file to the directory called `add_to_training` and then run the following function. 

**Note:** If you have changed or marked TRUE any records that have already been overridden and included in the training data, you will want to set `expect_override_overrides = True`. Otherwise, the function will check to see if you have accidentally tampered with values that have already been matched.

Right now, the module points to a COPY of the training data so it doesn't override the official version. You'll need to change that later if you want to update the official version.

In [None]:
validate_and_add_to_training(
    pudl_out, rmi_out, expect_override_overrides=True
)

In [None]:
rmi_out.ferc1_to_eia(clobber=True)