# Add Overrides to Train FERC-EIA Connecter

The FERC-EIA record linkage process requries training data in order to work properly. Training matches also serve as overrides. This notebook helps you check whether the machine learning algroythem did a good job of matching FERC and EIA records. If you find a good match (or you correct a bad match), this process will turn it into training data.

This notebook has two purposes: 

1) [**Output override tools to verify connection between EIA and FERC1**](#verify-tools)
2) [**Upload changes to training data**](#upload-overrides)

## Settings

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pudl_rmi
from pudl_rmi.create_override_spreadsheets import *
                                           
import pudl
import sqlalchemy as sa
import logging
import sys

import warnings
warnings.filterwarnings('ignore')

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS',fill_fuel_cost=True,roll_fuel_cost=True,fill_net_gen=True)
rmi_out = pudl_rmi.coordinate.Output(pudl_out)

## Specify Utilities & Years

In [3]:
# old

specified_utilities = {
    # 'Dominion': {'utility_id_pudl': [292, 293, 349],
    #              'utility_id_eia': [17539, 17554, 19876]},
    # 'Evergy': {'utility_id_pudl': [159, 160, 161, 1270, 13243],
    #            'utility_id_eia': [10000, 10005, 56211, 3702, 55329]}, # pudl/eia 359/22500 --> 13243/55329, 1270/3702 --> BAD
    # 'IDACORP': {'utility_id_pudl': [140],
    #             'utility_id_eia': [9191]},
    # 'Duke': {'utility_id_pudl': [90, 91, 92, 93, 96, 97],
    #          'utility_id_eia': [5416, 6455, 15470, 55729, 3542, 3046]},
    'BHE': {'utility_id_pudl': [185, 246, 204, 287],
            'utility_id_eia': [12341, 14354, 13407, 17166]},
    'Southern': {'utility_id_pudl': [123, 18, 190, 11830],
                 'utility_id_eia': [7140, 195, 12686, 17622]},
    # 'NextEra': {'utility_id_pudl': [121, 130],
    #             'utility_id_eia': [6452, 7801]},
    # 'AEP': {'utility_id_pudl': [29, 301, 144, 275, 162, 361, 7],
    #         'utility_id_eia': [733, 17698, 9324, 15474, 22053, 20521, 343]},
    # 'Entergy': {'utility_id_pudl': [107, 106, 311, 113, 110],
    #             'utility_id_eia': [11241, 814, 12465, 55937, 13478]},
    # 'Xcel': {'utility_id_pudl': [224, 302, 272, 11297],
    #          'utility_id_eia': [13781, 13780, 17718, 15466]}
}

In [3]:
specified_utilities = {
    'BHE': [12341, 14354, 13407, 17166],
    'Southern':[7140, 195, 12686, 17622]
}

specified_years = [
    2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 
    2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020
] 

<a id='verify-tools'></a>
## 1) Output Override Tools
Run the following function and you'll find excel files called `<UTILITY>_fix_FERC-EIA_overrides.xlsx` in the `outputs/overrides` directory created based on the utility and year inputs you specified above. Read the [Override Instructions](https://docs.google.com/document/d/1nJfmUtbSN-RT5U2Z3rJKfOIhWsRFUPNxs9NKTes0SRA/edit#) to learn how to begin fixing/verifying the FERC-EIA connections.

In [4]:
generate_override_tools(pudl_out, rmi_out, specified_utilities, specified_years)

Reading the FERC to EIA connection from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/ferc1_eia.pkl.gz
Reading the plant part list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/plant_parts_eia.pkl.gz
Grabbing depreciation study output from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/deprish.pkl.gz

Developing outputs for BHE
Outputing table subsets to tabs

Developing outputs for Southern
Outputing table subsets to tabs



<a id='upload-overrides'></a>
## 2) Upload changes to training data
When you've finished editing the `<UTILITY>_fix_FERC-EIA_overrides.xlsx` and want to add your changes to the official override csv, move your file to the directory called `add_to_training` and then run the following function. 

**Note:** If you have changed or marked TRUE any records that have already been overridden and included in the training data, you will want to set `expect_override_overrides = True`. Otherwise, the function will check to see if you have accidentally tampered with values that have already been matched.

Right now, the module points to a COPY of the training data so it doesn't override the official version. You'll need to change that later if you want to update the official version.

In [3]:
training_data
ferc1_eia = rmi_out.grab_ferc1_to_eia()
ppl = rmi_out.grab_plant_part_list().reset_index()
file_path = "/Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/add_to_training/BHE_fix_FERC-EIA_overrides.xlsx"
bhe_overrides = pd.read_excel(file_path)
utils_df = pudl_out.utils_eia860()

Reading the FERC to EIA connection from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/ferc1_eia.pkl.gz
Reading the plant part list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/plant_parts_eia.pkl.gz


In [4]:
true_connections = bhe_overrides[bhe_overrides["verified"]==1].reset_index(drop=True)
only_overrides = (
    true_connections.dropna(subset=["record_id_eia_override_1"])
    .reset_index()
    .copy()
)

In [5]:
validate_override_fixes(
    bhe_overrides, 
    utils_df, 
    ppl, 
    ferc1_eia, 
    training_data, 
    expect_override_overrides=True
)

Validating overrides
Checking eia record id consistency for values that don't exist
Checking ferc record id consistency for values that don't exist
Checking for duplicate override ids


AssertionError: Found record_id_eia_override_1 duplicates:     ['56379_2006_plant_total_12341' '56379_2007_plant_total_12341'
 '56379_2008_plant_total_12341' '2322_gt4_2012_plant_gen_total_13407']

In [189]:
dd = ppl.head(5).copy()
dd[dd["report_date"].duplicated(keep=False)]
bb = only_overrides.head(5).copy()

ss = utils_df.groupby("utility_id_pudl").apply(lambda x: x.utility_id_eia.unique().tolist())

In [190]:
id_dict = dict(zip(ss.index, ss))
bb['utility_id_eia_override'] = bb.record_id_eia_override_1.str.extract(r"(\d+$)")
#bb['utility_id_eia_override'] = bb.utility_id_eia_override.astype("Int64")
bb['test'] = bb.utility_id_pudl.map(id_dict)

In [194]:
bb[~bb.apply(lambda x: x.utility_id_eia_override in id_dict[x.utility_id_pudl], axis=1)]

Unnamed: 0,index,verified,used_match_record,signature_1,signature_2,notes,record_id_eia_override_1,record_id_eia_override_2,record_id_eia_override_3,record_id_eia_override_4,record_id_eia_override_5,record_id_eia_override_6,best_match,record_id_ferc1,record_id_eia,true_gran,report_year,match_type,plant_part,ownership,utility_id_eia,utility_id_pudl,utility_name_ferc1,utility_name_eia,plant_id_pudl,unit_id_pudl,generator_id,plant_name_ferc1,plant_name_eia,fuel_type_code_pudl_ferc1,fuel_type_code_pudl_eia,fuel_type_code_pudl_diff,net_generation_mwh_ferc1,net_generation_mwh_eia,net_generation_mwh_pct_diff,capacity_mw_ferc1,capacity_mw_eia,capacity_mw_pct_diff,capacity_factor_ferc1,capacity_factor_eia,capacity_factor_pct_diff,total_fuel_cost_ferc1,total_fuel_cost_eia,total_fuel_cost_pct_diff,total_mmbtu_ferc1,total_mmbtu_eia,total_mmbtu_pct_diff,fuel_cost_per_mmbtu_ferc1,fuel_cost_per_mmbtu_eia,fuel_cost_per_mmbtu_pct_diff,installation_year_ferc1,installation_year_eia,installation_year_diff,utility_id_eia_override,test
0,12,1.0,0.0,,AS,"found match in ppl, but the utility is listed ...",59472_2015_plant_total_57369,,,,,,,f1_gnrt_plant_2015_12_157_0_7,,,2015.0,,,,,287.0,Sierra Pacific Power Company d/b/a NV Energy,,203.0,,,fort churchill solar array,,,,,9801.0,,,19.5,,,,,,,,,,,,,,,,,,57369,[17166]
1,13,1.0,1.0,,AS,ferc mwh vs kwh issue with net gen,6482_2005_plant_total_14354,,,,,,cap,f1_gnrt_plant_2005_12_134_0_7,6482_2005_plant_total_14354,1.0,2005.0,prediction,plant,total,14354.0,246.0,PacifiCorp,PacifiCorp,7670.0,,1.0,cline falls,Cline Falls,,hydro,,1149000.0,1149.0,99.9,1.0,1.0,0.0,,0.131164,,,,,,,,,,,,,,14354,[14354]
2,14,1.0,1.0,,AS,ferc mwh vs kwh issue with net gen,3035_2005_plant_total_14354,,,,,,cap,f1_gnrt_plant_2005_12_134_0_22,3035_2005_plant_total_14354,1.0,2005.0,prediction,plant,total,14354.0,246.0,PacifiCorp,PacifiCorp,4622.0,,1.0,prospect no. 4 2630,Prospect 4,,hydro,,1777000.0,1777.002,99.9,1.0,1.0,0.0,,0.202854,,,,,,,,,,,,,,14354,[14354]
3,15,1.0,1.0,,AS,ferc mwh vs kwh issue with net gen,3659_2005_plant_total_14354,,,,,,cap,f1_gnrt_plant_2005_12_134_0_25,3659_2005_plant_total_14354,1.0,2005.0,prediction,plant,total,14354.0,246.0,PacifiCorp,PacifiCorp,5272.0,,3.0,stairs 597,Stairs,,hydro,,4877000.0,4876.999,99.9,1.0,1.0,0.0,,0.556735,,,,,,,,,,,,,,14354,[14354]
4,16,1.0,1.0,,AS,ferc mwh vs kwh issue with net gen,3041_2005_plant_total_14354,,,,,,cap,f1_gnrt_plant_2005_12_134_0_29,3041_2005_plant_total_14354,1.0,2005.0,prediction,plant,total,14354.0,246.0,PacifiCorp,PacifiCorp,5419.0,,1.0,wallowa falls 308,Wallowa Falls,,hydro,,7936000.0,7936.0,99.9,1.1,1.1,0.0,,0.823578,,,,,,,,,,,,,,14354,[14354]


In [160]:
bb['utility_id_eia_override'] = bb.utility_id_eia_override.astype("Int64")
bb.utility_id_eia_override

0    57369
1    14354
2    14354
3    14354
4    14354
Name: utility_id_eia_override, dtype: Int64

In [117]:
bb[["utility_id_pudl", "test", "record_id_eia_override_1", "utility_id_eia_override"]]

Unnamed: 0,utility_id_pudl,test,record_id_eia_override_1,utility_id_eia_override
0,287.0,[17166],59472_2015_plant_total_57369,57369
1,246.0,[14354],6482_2005_plant_total_14354,14354
2,246.0,[14354],3035_2005_plant_total_14354,14354
3,246.0,[14354],3659_2005_plant_total_14354,14354
4,246.0,[14354],3041_2005_plant_total_14354,14354


In [115]:
utils_df[utils_df["utility_id_eia"]==57369]

Unnamed: 0,report_date,utility_id_eia,utility_id_pudl,utility_name_eia,address_2,attention_line,city,contact_firstname,contact_firstname_2,contact_lastname,contact_lastname_2,contact_title,contact_title_2,entity_type,phone_extension,phone_extension_2,phone_number,phone_number_2,plants_reported_asset_manager,plants_reported_operator,plants_reported_other_relationship,plants_reported_owner,state,street_address,zip_code,zip_code_4
93830,2021-01-01,57369,511,"Apple, Inc",,,,,,,,,,,,,,,,,,,,,,
93831,2020-01-01,57369,511,"Apple, Inc",,,Cupertino,,,,,,,COM,,,,,,,,True,CA,1 Infinite Loop,95014.0,
93832,2019-01-01,57369,511,"Apple, Inc",,,Cupertino,,,,,,,COM,,,,,,,,True,CA,1 Infinite Loop,95014.0,
93833,2018-01-01,57369,511,"Apple, Inc",,,Cupertino,,,,,,,COM,,,,,,,,True,CA,1 Infinite Loop,95014.0,
93834,2017-01-01,57369,511,"Apple, Inc",,,Cupertino,,,,,,,COM,,,,,,,,True,CA,1 Infinite Loop,95014.0,
93835,2016-01-01,57369,511,"Apple, Inc",,,Cupertino,,,,,,,COM,,,,,,,,True,CA,1 Infinite Loop,95014.0,
93836,2015-01-01,57369,511,"Apple, Inc",,,Cupertino,,,,,,,COM,,,,,,,,True,CA,1 Infinite Loop,95014.0,
93837,2014-01-01,57369,511,"Apple, Inc",,,Cupertino,,,,,,,COM,,,,,,,,True,CA,1 Infinite Loop,95014.0,
93838,2013-01-01,57369,511,"Apple, Inc",,,Cupertino,,,,,,,COM,,,,,True,True,True,True,CA,1 Infinite Loop,95014.0,
93839,2012-01-01,57369,511,"Apple, Inc",,,Cupertino,,,,,,,,,,,,,,,,CA,,95014.0,


In [116]:
id_dict[511]

[57369]

In [84]:
bb.apply(lambda x: x.utility_id_eia_override.isin(id_dict[x.utility_id_pudl]))

AttributeError: 'Series' object has no attribute 'utility_id_eia_override'

In [92]:
bb["record_id_eia_override"].isin([])

AttributeError: 'Series' object has no attribute 'signature_1'

In [9]:
ppl[ppl["record_id_eia"]=='6165_2_2012_plant_unit_owned_14354']

Unnamed: 0,record_id_eia,plant_id_eia,report_date,plant_part,generator_id,unit_id_pudl,prime_mover_code,energy_source_code_1,technology_description,ferc_acct_name,utility_id_eia,true_gran,appro_part_label,appro_record_id_eia,capacity_eoy_mw,capacity_factor,capacity_mw,fraction_owned,fuel_cost_per_mmbtu,fuel_cost_per_mwh,fuel_type_code_pudl,heat_rate_mmbtu_mwh,installation_year,net_generation_mwh,operational_status,operational_status_pudl,ownership,ownership_dupe,planned_retirement_date,plant_id_pudl,plant_name_eia,plant_name_new,plant_part_id_eia,record_count,retirement_date,total_fuel_cost,total_mmbtu,utility_id_pudl,report_year,plant_id_report_year
2051506,6165_2_2012_plant_unit_owned_14354,6165,2012-01-01,plant_unit,2,2,ST,BIT,Conventional Steam Coal,Steam,14354,True,plant_unit,6165_2_2012_plant_unit_owned_14354,292.98,0.732672,292.98,0.6,1.848913,18.491759,coal,10.001422,,1885558.0,existing,operating,owned,False,NaT,281,Hunter,Hunter 2,6165_2_plant_unit_owned_14354,3.0,NaT,34867280.0,18858260.0,246,2012,281_2012


In [33]:
ppl[ppl["plant_name_new"].str.lower().str.contains("pinon pine")]

Unnamed: 0,record_id_eia,plant_id_eia,report_date,plant_part,generator_id,unit_id_pudl,prime_mover_code,energy_source_code_1,technology_description,ferc_acct_name,utility_id_eia,true_gran,appro_part_label,appro_record_id_eia,capacity_eoy_mw,capacity_factor,capacity_mw,fraction_owned,fuel_cost_per_mmbtu,fuel_cost_per_mwh,fuel_type_code_pudl,heat_rate_mmbtu_mwh,installation_year,net_generation_mwh,operational_status,operational_status_pudl,ownership,ownership_dupe,planned_retirement_date,plant_id_pudl,plant_name_eia,plant_name_new,plant_part_id_eia,record_count,retirement_date,total_fuel_cost,total_mmbtu,utility_id_pudl,report_year,plant_id_report_year
648114,7419_2001_plant_owned_17166,7419,2001-01-01,plant,1,,CC,BIT,,,17166,True,plant,7419_2001_plant_owned_17166,113.2,,113.2,1.0,,,coal,,1996,,existing,operating,owned,True,NaT,611,Pinon Pine,Pinon Pine,7419_plant_owned_17166,1.0,NaT,,,287,2001,611_2001
654554,7419_2001_plant_total_17166,7419,2001-01-01,plant,1,,CC,BIT,,,17166,True,plant,7419_2001_plant_total_17166,113.2,,113.2,1.0,,,coal,,1996,,existing,operating,total,False,NaT,611,Pinon Pine,Pinon Pine,7419_plant_total_17166,1.0,NaT,,,287,2001,611_2001
661515,7419_cc_2001_plant_prime_mover_owned_17166,7419,2001-01-01,plant_prime_mover,1,,CC,BIT,,,17166,False,plant,7419_2001_plant_owned_17166,113.2,,113.2,1.0,,,coal,,1996,,existing,operating,owned,True,NaT,611,Pinon Pine,Pinon Pine CC,7419_CC_plant_prime_mover_owned_17166,1.0,NaT,,,287,2001,611_2001
669147,7419_cc_2001_plant_prime_mover_total_17166,7419,2001-01-01,plant_prime_mover,1,,CC,BIT,,,17166,False,plant,7419_2001_plant_total_17166,113.2,,113.2,1.0,,,coal,,1996,,existing,operating,total,False,NaT,611,Pinon Pine,Pinon Pine CC,7419_CC_plant_prime_mover_total_17166,1.0,NaT,,,287,2001,611_2001
690646,7419_bit_2001_plant_prime_fuel_owned_17166,7419,2001-01-01,plant_prime_fuel,1,,CC,BIT,,,17166,False,plant,7419_2001_plant_owned_17166,113.2,,113.2,1.0,,,coal,,1996,,existing,operating,owned,True,NaT,611,Pinon Pine,Pinon Pine BIT,7419_BIT_plant_prime_fuel_owned_17166,1.0,NaT,,,287,2001,611_2001
698093,7419_bit_2001_plant_prime_fuel_total_17166,7419,2001-01-01,plant_prime_fuel,1,,CC,BIT,,,17166,False,plant,7419_2001_plant_total_17166,113.2,,113.2,1.0,,,coal,,1996,,existing,operating,total,False,NaT,611,Pinon Pine,Pinon Pine BIT,7419_BIT_plant_prime_fuel_total_17166,1.0,NaT,,,287,2001,611_2001
724939,7419_1_2001_plant_gen_owned_17166,7419,2001-01-01,plant_gen,1,,CC,BIT,,,17166,False,plant,7419_2001_plant_owned_17166,113.2,,113.2,1.0,,,coal,,1996,,existing,operating,owned,True,NaT,611,Pinon Pine,Pinon Pine 1,7419_1_plant_gen_owned_17166,1.0,NaT,,,287,2001,611_2001
744653,7419_1_2001_plant_gen_total_17166,7419,2001-01-01,plant_gen,1,,CC,BIT,,,17166,False,plant,7419_2001_plant_total_17166,113.2,,113.2,1.0,,,coal,,1996,,existing,operating,total,False,NaT,611,Pinon Pine,Pinon Pine 1,7419_1_plant_gen_total_17166,1.0,NaT,,,287,2001,611_2001


In [51]:
ppl[(ppl["report_year"]==2013) & (ppl["plant_id_pudl"]==611)][[
    "record_id_eia", "plant_id_eia", "plant_id_pudl", "report_date", 
    "utility_id_eia", "utility_id_pudl", "plant_name_new", "capacity_mw",
    "net_generation_mwh", "ownership_dupe", "true_gran"
]].sort_values("capacity_mw").head(50)

Unnamed: 0,record_id_eia,plant_id_eia,plant_id_pudl,report_date,utility_id_eia,utility_id_pudl,plant_name_new,capacity_mw,net_generation_mwh,ownership_dupe,true_gran
2298200,2336_gt2_2013_plant_gen_owned_17166_retired,2336,611,2013-01-01,17166,287,Tracy GT2,12.5,,True,True
2323717,2336_gt2_2013_plant_gen_total_17166_retired,2336,611,2013-01-01,17166,287,Tracy GT2,12.5,,False,True
2323716,2336_gt1_2013_plant_gen_total_17166_retired,2336,611,2013-01-01,17166,287,Tracy GT1,12.5,,False,True
2298199,2336_gt1_2013_plant_gen_owned_17166_retired,2336,611,2013-01-01,17166,287,Tracy GT1,12.5,,True,True
2220962,2336_gt_2013_plant_prime_mover_total_17166_ret...,2336,611,2013-01-01,17166,287,Tracy GT,25.0,,False,False
2263794,2336_dfo_2013_plant_prime_fuel_total_17166_ret...,2336,611,2013-01-01,17166,287,Tracy DFO,25.0,,False,False
2274161,2336_other_2013_plant_ferc_acct_owned_17166_re...,2336,611,2013-01-01,17166,287,Tracy Other,25.0,,True,False
2284307,2336_other_2013_plant_ferc_acct_total_17166_re...,2336,611,2013-01-01,17166,287,Tracy Other,25.0,,False,False
2209813,2336_gt_2013_plant_prime_mover_owned_17166_ret...,2336,611,2013-01-01,17166,287,Tracy GT,25.0,,True,False
2242756,2336_petroleum_liquids_2013_plant_technology_t...,2336,611,2013-01-01,17166,287,Tracy Petroleum Liquids,25.0,,False,False


In [74]:
utils = pudl_out.utils_eia860()
utils[utils["utility_id_eia"]==56369]

Unnamed: 0,report_date,utility_id_eia,utility_id_pudl,utility_name_eia,address_2,attention_line,city,contact_firstname,contact_firstname_2,contact_lastname,contact_lastname_2,contact_title,contact_title_2,entity_type,phone_extension,phone_extension_2,phone_number,phone_number_2,plants_reported_asset_manager,plants_reported_operator,plants_reported_other_relationship,plants_reported_owner,state,street_address,zip_code,zip_code_4
87405,2021-01-01,56369,3523,Truckee Meadows Water Authority,,,,,,,,,,,,,,,,,,,,,,
87406,2020-01-01,56369,3523,Truckee Meadows Water Authority,,,Reno,,,,,,,M,,,,,,,,True,NV,P O Box 30013,89520.0,
87407,2019-01-01,56369,3523,Truckee Meadows Water Authority,,,Reno,,,,,,,M,,,,,,,,True,NV,P O Box 30013,89520.0,
87408,2018-01-01,56369,3523,Truckee Meadows Water Authority,,,Reno,,,,,,,M,,,,,,,,True,NV,P O Box 30013,89520.0,
87409,2017-01-01,56369,3523,Truckee Meadows Water Authority,,,Reno,,,,,,,M,,,,,,,,True,NV,P O Box 30013,89520.0,
87410,2016-01-01,56369,3523,Truckee Meadows Water Authority,,,Reno,,,,,,,M,,,,,,,,True,NV,P O Box 30013,89520.0,
87411,2015-01-01,56369,3523,Truckee Meadows Water Authority,,,Reno,,,,,,,M,,,,,,,,True,NV,P O Box 30013,89520.0,
87412,2014-01-01,56369,3523,Truckee Meadows Water Authority,,,Reno,,,,,,,M,,,,,,,,True,NV,P O Box 30013,89520.0,
87413,2013-01-01,56369,3523,Truckee Meadows Water Authority,,,Reno,,,,,,,M,,,,,True,True,True,True,NV,P O Box 30013,89520.0,
87414,2012-01-01,56369,3523,Truckee Meadows Water Authority,,,Reno,,,,,,,,,,,,,,,,NV,,89520.0,


In [40]:
validate_and_add_to_training(connects_ferc1_eia, expect_override_overrides = False)

NameError: name 'connects_ferc1_eia' is not defined