# Add Overrides to Train FERC-EIA Connecter

This notebook is intended to help with adding overrides to the FERC-EIA connection csv. Adding new connections will fill in gaps and improve the program's ability to predict other matches. To begin this process, you'll need to look at *two* spreadsheets:

1) **The current FERC-EIA connection:** to look for good, bad, and empty links between FERC and EIA records
2) **The Master Unit List:** to confirm or disprove those connections

Downloading both files in their entirety will overwhelm excel, so we need to make our edits in segments. This notebook will help you:

1) **Download useful utility-based segments of each table for review**
2) **Update the old training data**

## Edit Inputs

It's time to choose what kind of data you'd like to wrangle first. We'll only download data from a specific subset of utilities and years if you say so. If you're not sure which PUDL IDs refer to which utilities, scroll down to section 1.3.

In [81]:
# This can be 'largest' or a list of pudl ids, ex: [1, 2, 3]
specified_utilities = 'largest' 

# You can change this to any integer. This represents the number of utilities you'd like
# to review (only applies when specified_utilities='largest').
specified_amount = 2 

# This can be 'all' or a list of any years within the FERC data, ex: [2006, 2007]
# These are the years you would like to consider fixing AND the years you would like to 
# consider for detmining largest capacity (the latter is only used when `utilities = largest`.
specified_years = [2018] 

## Verify Connections
When you un-comment and run the following function, you'll find two new csvs in the output directory that were created based on the inputs specified above. Read the Instruction Manual to learn how to begin fixing/verifying the FERC-EIA connections.

**Warning:** Running this funcion will REPLACE any override tools you currently have saved (unless you have changed their name). DO NOT run this function if you are in the middle of working on one of the output files.

In [82]:
# output_override_tools(
#     check_connections, 
#     mul, 
#     utilties=specified_utilities,
#     amount=specified_amount,
#     year=specified_years
# )

## Upload Changes
When you've finished editing the `ferc_eia_util_subset.csv` and want to add your changes to the official override csv, you can uncomment and run the following function:

In [83]:
#training_data_out.to_csv(training_path)

----------

## Notebook Setup

In [84]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [85]:
import pandas as pd
import numpy as np
import pudl
import pudl.constants as pc
import pudl.extract.ferc1
import sqlalchemy as sa
import logging
import sys
import copy
from copy import deepcopy
import scipy
import statistics
import yaml

import recordlinkage as rl
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

In [86]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

In [87]:
sys.path.append("../")
from pudl.output.ferc1 import *
from pudl_rmi.connect_ferc1_to_eia import *
from pudl_rmi.make_plant_parts_eia import *
import pudl_rmi.connect_ferc1_to_eia
pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
ferc_engine = sa.create_engine(pudl_settings['ferc1_db'])
pd.options.display.max_columns = None

In [88]:
relevant_cols_ferc_eia = [
    'record_id_ferc1',
    'record_id_eia',
    'true_gran',
    'report_year',
    'match_type',
    'plant_part',
    'ownership',
    'utility_id_pudl_eia',
    'utility_name_ferc1',
    'plant_id_pudl_eia',
    'unit_id_pudl',
    'generator_id',
    'plant_name_eia',
    'plant_name_ferc1',
    'technology_description',
    'energy_source_code_1',
    'net_generation_mwh_eia',
    'net_generation_mwh_ferc1',
    'capacity_mw_eia',
    'capacity_mw_ferc1',
    'total_fuel_cost_eia',
    'total_fuel_cost_ferc1',
    'installation_year',
    'construction_year',
]

relevant_cols_mul = [
    'record_id_eia',
    'report_year',
    'utility_id_pudl',
    'utility_name_eia',
    'fraction_owned',
    'plant_id_eia',
    'plant_name_new',
    'generator_id',
    'capacity_mw',
    'capacity_factor',
    'net_generation_mwh',
    'installation_year',
    'energy_source_code_1',
    'technology_description',
    'prime_mover_code',
]

## **Part 1:** Generate Override Tools

### 1.1 Get current FERC-EIA & MUL tables
This is going to look a lot like the `connect_ferc1_to_eia.ipynb`.

In [89]:
file_path_training = pathlib.Path().cwd().parent /'inputs'/'train_ferc1_to_eia.csv'
file_path_mul = pathlib.Path().cwd().parent /'outputs' /'master_unit_list.pkl.gz'
# pudl output object for ferc data
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, ferc_engine, freq='AS',fill_fuel_cost=True,roll_fuel_cost=True,fill_net_gen=False)

In [90]:
inputs = InputManager(file_path_training, file_path_mul, pudl_out)
features_all = (Features(feature_type='all', inputs=inputs)
                .get_features(clobber=False))
features_train = (Features(feature_type='training', inputs=inputs)
                  .get_features(clobber=False))
tuner = ModelTuner(features_train, inputs.get_train_index(), n_splits=10)

matcher = MatchManager(best=tuner.get_best_fit_model(), inputs=inputs)
matches_best = matcher.get_best_matches(features_train, features_all)

Preparing the FERC1 tables.
loading steam table
loading small gens table
loading hydro table
loading pumped storage table
loading fbp table
prepping steam table
prepping hydro tables
combining all tables
Reading the master unit list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/rmi-ferc1-eia/outputs/master_unit_list.pkl.gz
Generated 136066 all candidate features.
Generated 526 training candidate features.
We are about to test hyper parameters of the model while doing k-fold cross validation. This takes a few minutes....
Scores from the best model hyperparameters:
  F-Score:   0.78
  Precision: 0.85
  Accuracy:  0.53
Fit and predict a model w/ the highest scoring hyperparameters.
Get the top scoring match for each FERC1 steam record.
Winning match stats:
        matches vs ferc:      64.98%
        best match v ferc:    56.62%
        best match vs matches:87.13%
        murk vs matches:      4.60%
        ties vs matches:      4.72%
Overridden records:       13.0%
New best match v fe

In [None]:
connects_ferc1_eia = (
    prettyify_best_matches(
        matches_best, 
        plant_parts_true_df=inputs.plant_parts_true_df,
        steam_df=inputs.all_plants_ferc1_df)
    [relevant_cols_ferc_eia].copy()
)

In [None]:
mul = (
    make_plant_parts_eia.get_master_unit_list_eia(file_path_mul)
    .reset_index()[relevant_cols_mul]
    .copy()
)

### 1.2 Get the portion of FERC-EIA subject to review

Here we will exclude the correctly mapped records. The `match_type` column in the `connects_ferc1_eia` table indicates the status of the match. We'll use that to sort for the records we are less sure about--i.e. the ones that aren't already in the override csv. We'll also limit the columns in the output file to those that will be useful for analysing match correctness.

**Match Types:**

* `prediction`: prediction based on the training data.
* `correct_prediction`: prediction based on training data that matches record in the training data.
* `no prediction; training`: not filled in by the prediction algorythem but filled in by the training data.
* `overridden`: incorrectly filled in my prediction algorythem and corrected by training data.
* `NaN`: not filled in by the training data or the prediction algorythem. 

In [None]:
connects_ferc1_eia.match_type.value_counts(dropna=False)

In [None]:
# Grab the FERC-EIA connections that are either predictions based on the taining
# data or have not been filled.

check_connections = (
    connects_ferc1_eia[connects_ferc1_eia['match_type'].isin(
        [np.nan, 'prediction'])].copy()
)

# Add a column to tell whether it's a good match
check_connections.insert(0, "is_correct_match", np.nan)

### 1.3 Get and output utility subsets for editing

Not sure which PUDL ID you need? Use this cell to search for them by name:

In [None]:
util_name_string = 'alabama' # edit this, must be lower case

utils = (
    pudl_out.utils_eia860()[['utility_id_pudl', 'utility_id_eia', 'utility_name_eia']]
    .drop_duplicates()
    .dropna(subset=['utility_name_eia', 'utility_id_pudl'])
    .assign(utility_name_eia=lambda x: x.utility_name_eia.str.lower())
)
utils[utils['utility_name_eia'].str.contains(f"{util_name_string}")]

In [None]:
def prep_inputs(check_connections_df, utilities='largest', amount=5, years='all'):
    
        all_plants_ferc1 = pudl_out.all_plants_ferc1().copy()
        max_year = all_plants_ferc1.report_year.max()
        min_year = all_plants_ferc1.report_year.min()

        if years != 'all':
            assert type(years) == list, "years must be reported as a list if not 'all'"
            assert len([year for year in years if year in range(min_year, max_year)]) == len(years), \
                "years must be 'all' or a valid year integer within the bounds of FERC reporting years"
        if years == 'all':
            years = range(min_year, max_year+1)
            
        check_years = check_connections_df[check_connections_df['report_year'].isin(years)]
        
        if utilities == 'largest':
            logger.info(f"getting pudl ids for the top {amount} largest utilities")
            utilities = (
                check_years
                .groupby(['utility_id_pudl_eia', 'utility_name_ferc1'])['capacity_mw_ferc1']
                .sum()
                .reset_index()
                .sort_values('capacity_mw_ferc1', ascending=False)
                .head(amount)
                .utility_id_pudl_eia
                .tolist()
            )
        else:
            assert type(utilities) == list, "if not 'largest', utilities must be presented as a list of PUDL IDs"
            
        return utilities, years

In [None]:
def get_ferc_eia_utilities_subset(check_connections_df, utilities, years):
    logger.info("retreiving the ferc-eia connection for the given utilities")
    check_years = check_connections_df[check_connections_df['report_year'].isin(years)]
    util_output = check_years[check_years['utility_id_pudl_eia'].isin(utilities)].copy()
    return util_output
    
def get_mul_subset(mul, utilities, years):
    logger.info("retreiving the MUL for the given utilities")
    mul_years = mul[mul['report_year'].isin(years)]
    mul_years[mul_years['utility_id_pudl'].isin(utilities)]
    return mul_years

In [None]:
def output_override_tools(check_connections_df, mul, utilities='largest', amount=5, years='all'):
    
    utilities, years = prep_inputs(check_connections_df, utilities, amount, years)
    ferc_eia_util_subset = get_ferc_eia_utilities_subset(check_connections_df, utilities, years)
    mul_util_subset = get_mul_subset(mul, utilities, years)
    
    ferc_eia_path = pathlib.Path().cwd().parent / 'outputs' / 'ferc_eia_util_subset.csv'
    mul_path = pathlib.Path().cwd().parent / 'outputs' / 'mul_util_subset.csv'
    
    assert len(mul_util_subset) < 1000000, "Your MUL subset is more than a million rows...this is going to break excel. \
        Try entering a smaller utility or year subset"
    
    logger.info("outputing files to csv")
    ferc_eia_util_subset.to_csv(ferc_eia_path)
    mul_util_subset.to_csv(mul_path)
    
    #return ferc_eia_util_subset, mul_util_subset

## **Part 2:** Re-incorporating Matched Records

Now that you've marked the correctly matched records as `TRUE`, we'll want to incorporate those into the perminant override list. All you have to do is save the `ferc_eia_util_subset.csv` file and run the following cells:

### 2.1 Update training data

In [None]:
ferc_eia_path = pathlib.Path().cwd().parent / 'outputs' / 'ferc_eia_util_subset.csv'
training_path = pathlib.Path().cwd().parent / 'inputs' / 'train_ferc1_to_eia.csv'

validated_connections = (
    pd.read_csv(ferc_eia_path)
    .assign(is_correct_match=lambda x: x.is_correct_match.replace({'TRUE':True, np.nan: False}))
)

training_data = pd.read_csv(training_path)

In [None]:
# Make sure that there are no rouge descriptions in the is_correct_match field (besides TRUE)
match_language = validated_connections.is_correct_match.unique()
assert len(outliers:=[x for x in match_language if x not in [True, False]]) == 0, \
    f"All correct matchs must be marked TRUE; found {outliers}"

# Make it a boolean column
validated_connections.loc[:, "is_correct_match"] = (
    validated_connections.is_correct_match.astype('bool'))

# Get TRUE records
true_connections = validated_connections[validated_connections['is_correct_match']]

# Make sure that the eia and ferc ids haven't been tampered with
assert len(bad_eia := [x for x in true_connections.record_id_eia.unique()
                    if x not in connects_ferc1_eia.record_id_eia.unique()]) == 0, \
    f"Found record_id_eia values that aren't in the existing FERC-EIA connection: {bad_eia}"
assert len(bad_ferc := [x for x in true_connections.record_id_ferc1.unique()
                    if x not in connects_ferc1_eia.record_id_ferc1.unique()]) == 0, \
    f"Found record_id_ferc1 values that aren't in the existing FERC-EIA connection: {bad_ferc}"

# Make sure that these aren't already in the overrides (this should be impossible, but just in case)
assert len(bad_eia := [x for x in true_connections.record_id_eia.unique()
                    if x in training_data.record_id_eia.unique()]) == 0, \
    f"Found record_id_eia values that are already in the existing FERC-EIA connection: {bad_eia}"
assert len(bad_ferc := [x for x in true_connections.record_id_ferc1.unique()
                    if x in training_data.record_id_ferc1.unique()]) == 0, \
    f"Found record_id_ferc1 values that are already in the existing FERC-EIA connection: {bad_ferc}"

In [None]:
training_data_out = (
    training_data.append(true_connections[['record_id_eia', 'record_id_ferc1']])
    .set_index(['record_id_eia', 'record_id_ferc1'])
)

### 2.2 Export udated data

Don't run this until you're ready! -- See top of the notebook