# Manually Overriding FERC-EIA Record Linkage

The FERC-EIA record linkage process requries training data in order to work properly. Training matches also serve as overrides. This notebook helps you check whether the machine learning algroythem did a good job of matching FERC and EIA records. If you find a good match (or you correct a bad match), this process will turn it into training data.

This notebook has three purposes: 

- [**Step 1: Output Override Tools:**](#verify-tools) Where you create and output the spreadsheets used to conduct the manual overrides.
- [**Step 2: Validate New Training Data:**](#validate) Where you check that the overrides we made are sound.
- [**Step 3: Upload Changes to Training Data:**](#upload-overrides) Where integrate the overrides into the training data.

## Settings

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
#import pudl_rmi # don't forget to pip install -e . if you need to
#from pudl_rmi.create_override_spreadsheets import *
                                           
import pudl
import sqlalchemy as sa
import logging
import sys

import warnings
warnings.filterwarnings('ignore')

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS',fill_fuel_cost=True,roll_fuel_cost=True,fill_net_gen=True)

from pudl.analysis.ferc1_eia import *

In [None]:
# old

specified_utilities = {
    # 'Dominion': {'utility_id_pudl': [292, 293, 349],
    #              'utility_id_eia': [17539, 17554, 19876]},
    # 'Evergy': {'utility_id_pudl': [159, 160, 161, 1270, 13243],
    #            'utility_id_eia': [10000, 10005, 56211, 25000]},
    # 'IDACORP': {'utility_id_pudl': [140],
    #             'utility_id_eia': [9191]},
    # 'Duke': {'utility_id_pudl': [90, 91, 92, 93, 96, 97],
    #          'utility_id_eia': [5416, 6455, 15470, 55729, 3542, 3046]},
    'BHE': {'utility_id_pudl': [185, 246, 204, 287],
            'utility_id_eia': [12341, 14354, 13407, 17166]},
    'Southern': {'utility_id_pudl': [123, 18, 190, 11830],
                 'utility_id_eia': [7140, 195, 12686, 17622]},
    # 'NextEra': {'utility_id_pudl': [121, 130],
    #             'utility_id_eia': [6452, 7801]},
    # 'AEP': {'utility_id_pudl': [29, 301, 144, 275, 162, 361, 7],
    #         'utility_id_eia': [733, 17698, 9324, 15474, 22053, 20521, 343]},
    # 'Entergy': {'utility_id_pudl': [107, 106, 311, 113, 110],
    #             'utility_id_eia': [11241, 814, 12465, 55937, 13478]},
    # 'Xcel': {'utility_id_pudl': [224, 302, 272, 11297],
    #          'utility_id_eia': [13781, 13780, 17718, 15466]}
}

<a id='verify-tools'></a>
## Step 1: Output Override Tools

In [None]:
specified_utilities = {
    #'BHE': [12341, 14354, 13407, 17166],
    #'Southern':[7140, 195, 12686, 17622]
    #'Dominion': [17539, 17554, 19876]
    #'Entergy': [11241, 814, 12465, 55937, 13478],
    #'Xcel': [13781, 13780, 17718, 15466],
    #'NextEra': [6452, 7801]
    #'IDACORP': [9191]
    'Evergy': [10000, 10005, 56211, 22500]
}

specified_years = [2020
    # 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 
    # 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020
]

Run the following function and you'll find excel files called `<UTILITY>_fix_FERC-EIA_overrides.xlsx` in the `outputs/overrides` directory created based on the utility and year inputs you specified above. Read the [Override Instructions](https://docs.google.com/document/d/1nJfmUtbSN-RT5U2Z3rJKfOIhWsRFUPNxs9NKTes0SRA/edit#) to learn how to begin fixing/verifying the FERC-EIA connections.

In [None]:
generate_all_override_spreadsheets(pudl_out, rmi_out, specified_utilities, specified_years)

<a id='validate'></a>
## Step 2: Validate New Training Data

Once you've finished checking the maps, make sure everything you want to validate is set to `verified=TRUE`. Then, move the file into the add_to_training folder and run the following function:

In [None]:
# Define function inputs
#ferc1_eia_df = pudl_out.ferc1_eia()
#ppl_df = pudl_out.plant_parts_eia().reset_index()

ferc1_eia_df = pd.read_pickle("ferc1_eia.pkl")
ppl_df = pd.read_pickle("ppl.pkl")

In [None]:
utils_df = pudl_out.utils_eia860()
training_df = pd.read_csv("../../src/pudl/package_data/glue/ferc1_eia_train.csv")
path_to_overrides = "../../src/pudl/package_data/glue/add_to_ferc1_eia_training/" 

override_files = os.listdir(path_to_overrides)
override_files = [file for file in override_files if file.endswith(".xlsx")]

In [None]:
# See if there are any values that aren't in the training data yet. 
# This is useful if you're not sure whether some part (or all) of the data 
# has been integrated into the training data yet.
for file in override_files:
    
    print(f"VALIDATING {file} ************** ")
    file_df = pd.read_excel(path_to_overrides + file)
    
    pudl.analysis.ferc1_eia.already_in_training(
        training_data=training_df,
        validated_connections=file_df
    )

In [None]:
for file in override_files:
    
    print(f"VALIDATING {file} ************** ")
    file_df = pd.read_excel(path_to_overrides + file)
    
    validate_override_fixes(
        validated_connections=file_df,
        utils_eia860=utils_df,
        ppl=ppl_df,
        ferc1_eia=ferc1_eia_df,
        training_data=training_df,
        expect_override_overrides=True,
        allow_mismatched_utilities=True
    )
    
    print(" ")

<a id='upload-overrides'></a>
## Step 3: Upload Changes to Training Data

When you've finished editing the `<UTILITY>_fix_FERC-EIA_overrides.xlsx` and want to add your changes to the official override csv, move your file to the directory called `add_to_training` and then run the following function. 

**Note:** If you have changed or marked TRUE any records that have already been overridden and included in the training data, you will want to set `expect_override_overrides = True`. Otherwise, the function will check to see if you have accidentally tampered with values that have already been matched.

Right now, the module points to a COPY of the training data so it doesn't override the official version. You'll need to change that later if you want to update the official version.

In [None]:
validate_and_add_to_training(
    utils_eia860=utils_df,
    ppl=ppl_df,
    ferc1_eia=ferc1_eia_df,
    expect_override_overrides=True, 
    allow_mismatched_utilities=True,
)