# Add Overrides to Train FERC-EIA Connecter

The FERC-EIA record linkage process requries training data in order to work properly. Training matches also serve as overrides. This notebook helps you check whether the machine learning algroythem did a good job of matching FERC and EIA records. If you find a good match (or you correct a bad match), this process will turn it into training data.

This notebook has two purposes: 

1) [**Output override tools to verify connection between EIA and FERC1**](#verify-tools)
2) [**Upload changes to training data**](#upload-overrides)

## Settings

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pudl_rmi
from pudl_rmi.create_override_spreadsheets import *
                                           
import pudl
import sqlalchemy as sa
import logging
import sys

import warnings
warnings.filterwarnings('ignore')

logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS',fill_fuel_cost=True,roll_fuel_cost=True,fill_net_gen=True)
rmi_out = pudl_rmi.coordinate.Output(pudl_out)

In [6]:
bb = generate_input_dfs(pudl_out, rmi_out)

Reading the FERC to EIA connection from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/ferc1_eia.pkl.gz
Reading the plant part list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/plant_parts_eia.pkl.gz
Grabbing depreciation study output from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/deprish.pkl.gz



In [18]:
gg = bb["ferc_eia"]
gg.apply(lambda x: is_best_match(x))

AttributeError: 'Series' object has no attribute 'capacity_mw_pct_diff'

## Specify Utilities & Years

In [3]:
# The utilities you'd like to review as a dictionary where the format is: 
# {<UTILITY>: {'utility_id_pudl': [1, 2, 3], 'utility_id_eia': [44, 55, 66]}}
# Each key will be output into a different excel file.

specified_utilities = {
    # 'Dominion': {'utility_id_pudl': [292, 293, 349],
    #              'utility_id_eia': [17539, 17554, 19876]},
    # 'Evergy': {'utility_id_pudl': [159, 160, 161, 1270, 13243],
    #            'utility_id_eia': [10000, 10005, 56211, 3702, 55329]}, # pudl/eia 359/22500 --> 13243/55329, 1270/3702 --> BAD
    # 'IDACORP': {'utility_id_pudl': [140],
    #             'utility_id_eia': [9191]},
    # 'Duke': {'utility_id_pudl': [90, 91, 92, 93, 96, 97],
    #          'utility_id_eia': [5416, 6455, 15470, 55729, 3542, 3046]},
    'BHE': {'utility_id_pudl': [185, 246, 204, 287],
            'utility_id_eia': [12341, 14354, 13407, 17166]},
    'Southern': {'utility_id_pudl': [123, 18, 190, 11830],
                 'utility_id_eia': [7140, 195, 12686, 17622]},
    # 'NextEra': {'utility_id_pudl': [121, 130],
    #             'utility_id_eia': [6452, 7801]},
    # 'AEP': {'utility_id_pudl': [29, 301, 144, 275, 162, 361, 7],
    #         'utility_id_eia': [733, 17698, 9324, 15474, 22053, 20521, 343]},
    # 'Entergy': {'utility_id_pudl': [107, 106, 311, 113, 110],
    #             'utility_id_eia': [11241, 814, 12465, 55937, 13478]},
    # 'Xcel': {'utility_id_pudl': [224, 302, 272, 11297],
    #          'utility_id_eia': [13781, 13780, 17718, 15466]}
}


uu = {
    'BHE': [12341, 14354, 13407, 17166],
    'Southern':[7140, 195, 12686, 17622]
}

# This can be 'all' or a list of any years within the FERC data, ex: [2006, 2007]
# These are the years you would like to consider fixing AND the years you would like to 
# consider for detmining largest capacity (the latter is only used when `utilities = largest`.
specified_years = [2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 
                   2013, 2014, 2015, 2016, 2017, 2018, 2019] 

In [5]:
generate_override_tools(pudl_out, rmi_out, uu, specified_years)

Reading the FERC to EIA connection from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/ferc1_eia.pkl.gz


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [6]:
for util_name, util_ids in  specified_utilities.items():
    output_override_tools2(pudl_out, rmi_out, util_ids, specified_years, util_name)

TypeError: output_override_tools2() takes 4 positional arguments but 5 were given

In [122]:
util_df[util_df["utility_name_eia"].str.contains("estar")].utility_name_eia.unique().tolist()
#util_df[util_df["city"].str.contains('opeka')].utility_name_eia.unique().tolist()

['Cerestar USA Inc',
 'Questar Gas Management Co',
 'Questar Energy Trading Company',
 'Westar Energy Inc.']

In [96]:
util_df = pudl_out.utils_eia860()

for util, ids in specified_utilities.items():
    print(util.upper())
    for eia_id in ids["utility_id_eia"]:
        df = util_df[util_df["utility_id_eia"]==eia_id]
        print(f'   - EIA ID: {eia_id}')
        print(f'   - PUDL ID: {df.utility_id_pudl.unique().tolist()}')
        print(f'   - UTIL NAME: {df.utility_name_eia.unique().tolist()}')
        print('')

EVERGY
   - EIA ID: 10000
   - PUDL ID: [159]
   - UTIL NAME: ['Evergy Metro']

   - EIA ID: 10005
   - PUDL ID: [160]
   - UTIL NAME: ['Evergy Kansas South, Inc']

   - EIA ID: 56211
   - PUDL ID: [161]
   - UTIL NAME: ['Evergy Missouri West']

   - EIA ID: 3702
   - PUDL ID: [1270]
   - UTIL NAME: ['Clarksdale City of']

   - EIA ID: 55329
   - PUDL ID: [13243]
   - UTIL NAME: ['Westar Energy Inc.']

IDACORP
   - EIA ID: 9191
   - PUDL ID: [140]
   - UTIL NAME: ['Idaho Power Co']

DUKE
   - EIA ID: 5416
   - PUDL ID: [90]
   - UTIL NAME: ['Duke Energy Corp']

   - EIA ID: 6455
   - PUDL ID: [91]
   - UTIL NAME: ['Florida Power Corp']

   - EIA ID: 15470
   - PUDL ID: [92]
   - UTIL NAME: ['PSI Energy Inc']

   - EIA ID: 55729
   - PUDL ID: [93]
   - UTIL NAME: ['Duke Energy Kentucky Inc']

   - EIA ID: 3542
   - PUDL ID: [96]
   - UTIL NAME: ['Cincinnati Gas & Electric Co']

   - EIA ID: 3046
   - PUDL ID: [97]
   - UTIL NAME: ['Carolina Power & Light Co']

BHE
   - EIA ID: 12341
   

In [10]:
util_df = pudl_out.utils_eia860()

In [24]:
#pudl_eia = util_df[['utility_id_pudl', 'utility_id_eia']].drop_duplicates()
mm = dict(zip(util_df['utility_id_pudl'], util_df['utility_id_eia']))
deprish_df['utility_id_eia'] = deprish_df.utility_id_pudl.map(mm).astype("Int64")



{3411: 7,
 3412: 8,
 7103: 20,
 411: 21,
 8413: 23,
 4167: 50081,
 1934: 25,
 896: 34,
 413: 35,
 7761: 39,
 3840: 40,
 3841: 42,
 6936: 46,
 3842: 52,
 9340: 54,
 9341: 55,
 7592: 56,
 9342: 59,
 7037: 60,
 3843: 65,
 388: 82,
 379: 84,
 9343: 87,
 1915: 88,
 9344: 97,
 9345: 108,
 415: 109,
 9346: 113,
 6899: 114,
 9347: 116,
 9348: 118,
 9349: 122,
 9350: 123,
 9351: 127,
 8773: 128,
 9352: 134,
 419: 135,
 406: 142,
 9353: 144,
 9354: 146,
 9355: 149,
 397: 150,
 8685: 151,
 9356: 154,
 9357: 155,
 3844: 156,
 9358: 157,
 9359: 162,
 3845: 163,
 402: 164,
 7823: 172,
 8801: 174,
 9360: 176,
 407: 177,
 3846: 178,
 3847: 179,
 9361: 182,
 9362: 183,
 2888: 189,
 9363: 191,
 9364: 192,
 438: 194,
 18: 195,
 8634: 197,
 9365: 198,
 9366: 201,
 9367: 202,
 4328: 204,
 9368: 207,
 9369: 211,
 9370: 212,
 19: 213,
 432: 219,
 9371: 220,
 433: 221,
 897: 222,
 898: 228,
 9372: 229,
 9373: 230,
 9374: 232,
 9375: 240,
 437: 241,
 9376: 243,
 9377: 244,
 7227: 252,
 9378: 253,
 7371: 257,
 

## Gather Inputs

In [7]:
plant_parts_df = rmi_out.grab_plant_part_list().reset_index()

Reading the plant part list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/plant_parts_eia.pkl.gz


In [8]:
ferc_eia_df = rmi_out.grab_ferc1_to_eia()

Reading the FERC to EIA connection from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/ferc1_eia.pkl.gz


In [9]:
deprish_df = (
    rmi_out.grab_deprish()
    .assign(report_year=lambda x: x.report_date.dt.year.astype("Int64"))
)

Grabbing depreciation study output from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/deprish.pkl.gz


In [15]:
inputs_dict = {
    "ppl": rmi_out.grab_plant_part_list().pipe(_prep_ppl, pudl_out),
    "ferc_eia": rmi_out.grab_ferc1_to_eia().pipe(_prep_ferc_eia, pudl_out),
    "deprish": rmi_out.grab_deprish().pipe(_prep_deprish, pudl_out),
}

True

In [16]:
li = ["plant_part_list", "ferc1_to_eia", "deprish"]

li_di = {}
for table_name in li:
    li_di[table_name] = rmi_out.

In [8]:
for util_name, util_ids in  specified_utilities.items():
    output_override_sheets(util_name, util_ids, specified_years)

Reading the plant part list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/plant_parts_eia.pkl.gz
Prepping Plant Parts Table


KeyError: "['record_id_eia'] not in index"

<a id='verify-tools'></a>
## 1) Output Override Tools
Run the following function and you'll find excel files called `<UTILITY>_fix_FERC-EIA_overrides.xlsx` in the `outputs/overrides` directory created based on the utility and year inputs you specified above. Read the [Override Instructions](https://docs.google.com/document/d/1nJfmUtbSN-RT5U2Z3rJKfOIhWsRFUPNxs9NKTes0SRA/edit#) to learn how to begin fixing/verifying the FERC-EIA connections.

In [24]:
%%time
for util_name, id_dict in specified_utilities.items():
    output_override_tools(
        ferc_eia=ferc_eia_df, 
        ppl=plant_parts_df, 
        deprish=deprish_df,
        util_ids=id_dict,
        util_name=util_name,
        years=specified_years,
        pudl_out=pudl_out
    )

Making override file for BHE
Loading ferc-eia subset
Prepping FERC-EIA table
Getting utility-year subsets
Loading plant part list subset
Prepping Plant Parts Table
Getting utility-year subsets
Loading depreciation subset
Getting utility-year subsets


KeyError: 'utility_id_eia'

<a id='upload-overrides'></a>
## 2) Upload changes to training data
When you've finished editing the `<UTILITY>_fix_FERC-EIA_overrides.xlsx` and want to add your changes to the official override csv, move your file to the directory called `add_to_training` and then run the following function. 

**Note:** If you have changed or marked TRUE any records that have already been overridden and included in the training data, you will want to set `expect_override_overrides = True`. Otherwise, the function will check to see if you have accidentally tampered with values that have already been matched.

Right now, the module points to a COPY of the training data so it doesn't override the official version. You'll need to change that later if you want to update the official version.

In [25]:
validate_and_add_to_training(connects_ferc1_eia, expect_override_overrides = False)

Processing fixes in Evergy_fix_FERC-EIA_overrides.xlsx
Validating overrides
Processing fixes in IDACORP_fix_FERC-EIA_overrides.xlsx
Validating overrides
Adding overrides to training data
Combining all new overrides with existing training data


Unnamed: 0_level_0,Unnamed: 1_level_0,signature,notes,signature_1
record_id_eia,record_id_ferc1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1005_2018_plant_owned_15470,f1_hydro_2018_12_144_0_1,,,
10773_2018_plant_owned_19876,f1_steam_2018_12_186_0_1,,,
10774_2018_plant_owned_19876,f1_steam_2018_12_186_3_2,,,
10_GT_2018_plant_prime_mover_owned_195,f1_steam_2018_12_2_1_2,,,
1109_2018_plant_owned_19436,f1_hydro_2018_12_177_0_2,,,
...,...,...,...,...
470_2018_plant_total_15466,f1_steam_2018_12_145_0_4,,,
6112_2018_plant_total_15466,f1_steam_2018_12_145_2_4,,,
6112_gt_2018_plant_prime_mover_total_15466,f1_steam_2018_12_145_2_5,,,
56219_j1_2005_plant_gen_owned_22500,f1_gnrt_plant_2005_12_191_0_1,,test,AS
