# Add Overrides to Train FERC-EIA Connecter

This notebook is intended to help with adding overrides to the FERC-EIA connection csv. Adding new connections will fill in gaps and improve the program's ability to predict other matches. To adequately check each of the connections, we'll provide you with subsets from *three* different spreadsheets:

1) **The current FERC-EIA connection:** to look for good, bad, and empty links between FERC and EIA records.
2) **The Master Unit List:** to confirm or disprove those connections.
3) **Depreciation data** from our previous work.

Downloading all the files at once will overwhelm excel, so we need to make edits in segments. This notebook will help you:

1) **Download useful utility-based subsets of each table for review.**
2) **Update the old training data with new verified matches.**

Once you edit the inputs below, run the entire notebook. Next, pick whether you want to [download tools to verify connection between EIA and FERC1](#verify-tools) or [upload changes to training data](#upload-overrides) and run the relevant functions.

## Settings

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pudl_rmi
from pudl_rmi.create_override_spreadsheets import *
from pudl_rmi.connect_ferc1_to_eia import *

import pudl
import sqlalchemy as sa
import logging
import sys

import warnings
warnings.filterwarnings('ignore')

#sys.path.append("../") <<bad

logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS',fill_fuel_cost=True,roll_fuel_cost=True,fill_net_gen=True)

## Specify Utilities & Years

In [3]:
# The utilities you'd like to review as a dictionary where the format is: 
# {<UTILITY>: {'utility_id_pudl': [1, 2, 3], 'utility_id_eia': [44, 55, 66]}}
# Each key will be output into a different excel file.

specified_utilities = {
    # 'Dominion': {'utility_id_pudl': [292, 293, 349],
    #              'utility_id_eia': [17539, 17554, 19876]},
    # 'Evergy': {'utility_id_pudl': [159, 160, 161, 359],
    #            'utility_id_eia': [10000, 10005, 56211, 3702, 22500]},
    # 'IDACORP': {'utility_id_pudl': [140],
    #             'utility_id_eia': [9191]}}
#     'Duke': {'utility_id_pudl': [90, 91, 92, 93, 96, 97],
#              'utility_id_eia': [5416, 6455, 15470, 55729, 3542, 3046]},
    'BHE': {'utility_id_pudl': [185, 246, 204, 287],
            'utility_id_eia': [12341, 14354, 13407, 17166]},
#     'Southern': {'utility_id_pudl': [123, 18, 190],
#                  'utility_id_eia': [7140, 195, 12686, 17622]},
#     'NextEra': {'utility_id_pudl': [121, 130],
#                 'utility_id_eia': [6452, 7801]},
#     'AEP': {'utility_id_pudl': [29, 301, 144, 275, 162, 7],
#             'utility_id_eia': [733, 17698, 9324, 15474, 22053, 20521, 343]},
#     'Entergy': {'utility_id_pudl': [107, 106, 311, 113, 110],
#                 'utility_id_eia': [11241, 814, 12465, 55937, 13478]},
#     'Xcel': {'utility_id_pudl': [224, 302, 272],
#              'utility_id_eia': [13781, 13780, 17718, 15466]}
}

# This can be 'all' or a list of any years within the FERC data, ex: [2006, 2007]
# These are the years you would like to consider fixing AND the years you would like to 
# consider for detmining largest capacity (the latter is only used when `utilities = largest`.
specified_years = [2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 
                   2013, 2014, 2015, 2016, 2017, 2018, 2019] 

## Gather Inputs

In [10]:
rmi_out = pudl_rmi.coordinate.Output(
    pudl_out,
)

plant_parts_df = rmi_out.grab_plant_part_list().reset_index()

Reading the plant part list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/Repos/rmi-ferc1-eia/outputs/plant_parts_eia.pkl.gz


In [5]:
# FERC-EIA 

inputs = connect_ferc1_to_eia.InputManager(file_path_training, pudl_out, plant_parts_df)
features_all = (connect_ferc1_to_eia.Features(feature_type='all', inputs=inputs)
                 .get_features(clobber=False))
features_train = (connect_ferc1_to_eia.Features(feature_type='training', inputs=inputs)
                  .get_features(clobber=False))
tuner = connect_ferc1_to_eia.ModelTuner(features_train, inputs.get_train_index(), n_splits=10)

matcher = connect_ferc1_to_eia.MatchManager(best=tuner.get_best_fit_model(), inputs=inputs)
matches_best = matcher.get_best_matches(features_train, features_all)

Preparing the FERC1 tables.
loading steam table
loading small gens table
loading hydro table
loading pumped storage table
prepping steam table
prepping hydro tables
combining all tables
Generated 168541 all candidate features.
Generated 1144 training candidate features.
We are about to test hyper parameters of the model while doing k-fold cross validation. This takes a few minutes....
Scores from the best model hyperparameters:
    F-Score:   0.87
    Precision: 0.87
    Accuracy:  0.74

Fit and predict a model w/ the highest scoring hyperparameters.
Get the top scoring match for each FERC1 plant record.
Winning match stats:
    matches vs ferc:      71.23%
    best match v ferc:    62.18%
    best match vs matches:87.29%
    murk vs matches:      9.60%
    ties vs matches:      1.02%

Overridden records:       22.9%
New best match v ferc:    62.71%


In [6]:
connects_ferc1_eia = (
    prettyify_best_matches(
        matches_best, 
        plant_parts_true_df=inputs.plant_parts_true_df,
        plants_ferc1_df=inputs.plants_ferc1_df,
        train_df=inputs.train_df)
    .copy()
)

jsuk there are some FERC-EIA matches that aren't in the steam                 table but this is because they are linked to retired EIA generators.
Coverage for matches during EIA working years:
    Fuel type: 97.1%
    Tech type: 95.0%

Coverage for all steam table records during EIA working years:
    EIA matches: 71.7

Coverage for all small gen table records during EIA working years:
    EIA matches: 37.9

Coverage for all hydro table records during EIA working years:
    EIA matches: 84.8

Coverage for all pumped storage table records during EIA working years:
    EIA matches: 48.6
Matches with consistency across years of all matches is 79.6%
Matches with completely consistent FERC capacity have a consistency of 93.1%
Matches with consistency across years of overrides matches is 41.8%
Matches with completely consistent FERC capacity have a consistency of 87.2%


In [78]:
# Depriciation Data

file_path_deprish = pathlib.Path().cwd().parent/'inputs'/'depreciation_rmi.xlsx'
sheet_name_deprish='Depreciation Studies Raw'
transformer = pudl_rmi.deprish.Transformer(
    pudl_rmi.deprish.Extractor(
        file_path=file_path_deprish,
        sheet_name=sheet_name_deprish
    ).execute())

deprish_df = transformer.execute()

TypeError: Extractor.__init__() got an unexpected keyword argument 'file_path'

<a id='verify-tools'></a>
## 1) Output Override Tools
When you run the following function, you'll find excel files called `<UTILITY>_fix_FERC-EIA_overrides.xlsx` in the `outputs/overrides` directory created based on the utility and year inputs you specified above. Read the [Override Instructions](https://docs.google.com/document/d/1nJfmUtbSN-RT5U2Z3rJKfOIhWsRFUPNxs9NKTes0SRA/edit#) to learn how to begin fixing/verifying the FERC-EIA connections.

In [13]:
%%time
for util_name, id_dict in specified_utilities.items():
    output_override_tools2(
        ferc_eia=connects_ferc1_eia, 
        mul=plant_parts_df, 
        util_ids=id_dict,
        util_name=util_name,
        years=specified_years,
        pudl_out=pudl_out
    )

Making override file for BHE
Prepping FERC-EIA table
Getting utility-year subsets
Prepping Plant Parts Table
Getting utility-year subsets
Outputing table subsets to tabs

CPU times: user 14.5 s, sys: 5.03 s, total: 19.5 s
Wall time: 21.4 s


<a id='upload-overrides'></a>
## 2) Upload changes to training data
When you've finished editing the `<UTILITY>_fix_FERC-EIA_overrides.xlsx` and want to add your changes to the official override csv, move your file to the directory called `add_to_training` and then run the following function. 

**Note:** If you have changed or marked TRUE any records that have already been overridden and included in the training data, you will want to set `expect_override_overrides = True`. Otherwise, the function will check to see if you have accidentally tampered with values that have already been matched.

Right now, the module points to a COPY of the training data so it doesn't override the official version. You'll need to change that later if you want to update the official version.

In [25]:
validate_and_add_to_training(connects_ferc1_eia, expect_override_overrides = False)

Processing fixes in Evergy_fix_FERC-EIA_overrides.xlsx
Validating overrides
Processing fixes in IDACORP_fix_FERC-EIA_overrides.xlsx
Validating overrides
Adding overrides to training data
Combining all new overrides with existing training data


Unnamed: 0_level_0,Unnamed: 1_level_0,signature,notes,signature_1
record_id_eia,record_id_ferc1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1005_2018_plant_owned_15470,f1_hydro_2018_12_144_0_1,,,
10773_2018_plant_owned_19876,f1_steam_2018_12_186_0_1,,,
10774_2018_plant_owned_19876,f1_steam_2018_12_186_3_2,,,
10_GT_2018_plant_prime_mover_owned_195,f1_steam_2018_12_2_1_2,,,
1109_2018_plant_owned_19436,f1_hydro_2018_12_177_0_2,,,
...,...,...,...,...
470_2018_plant_total_15466,f1_steam_2018_12_145_0_4,,,
6112_2018_plant_total_15466,f1_steam_2018_12_145_2_4,,,
6112_gt_2018_plant_prime_mover_total_15466,f1_steam_2018_12_145_2_5,,,
56219_j1_2005_plant_gen_owned_22500,f1_gnrt_plant_2005_12_191_0_1,,test,AS
