# Add Overrides to Train FERC-EIA Connecter

This notebook is intended to help with adding overrides to the FERC-EIA connection csv. Adding new connections will fill in gaps and improve the program's ability to predict other matches. To adequately check each of the connections, we'll provide you with subsets from *three* different spreadsheets:

1) **The current FERC-EIA connection:** to look for good, bad, and empty links between FERC and EIA records.
2) **The Master Unit List:** to confirm or disprove those connections.
3) **Depreciation data** from our previous work.

Downloading all the files at once will overwhelm excel, so we need to make edits in segments. This notebook will help you:

1) **Download useful utility-based subsets of each table for review.**
2) **Update the old training data with new verified matches.**

Once you edit the inputs below, run the entire notebook. Next, pick whether you want to [download tools to verify connection between EIA and FERC1](#verify-tools) or [upload changes to training data](#upload-overrides) and run the relevant functions.

## Edit Inputs

It's time to choose what subset of the data you'd like to wrangle first. We'll only download data from specific utilities and years if you say so (we highly recommend this so you don't crash excel). If you're not sure which PUDL IDs refer to which utilities, scroll down to section 1.3.

Current inputs: 
* Dominion: `[292, 349]`
* Evergy: `[159, 160, 161, 359]`
* IDACORP: `[140]`
* Duke: `[90, 91, 92, 93, 96, 97]`

In [1]:
# This can be 'largest' or a list of pudl ids, ex: [1, 2, 3]
specified_utilities = [90, 91, 92, 93, 96, 97]

# You can change this to any integer. This represents the number of utilities you'd like
# to review (only applies when specified_utilities='largest').
specified_amount = 2 

# This can be 'all' or a list of any years within the FERC data, ex: [2006, 2007]
# These are the years you would like to consider fixing AND the years you would like to 
# consider for detmining largest capacity (the latter is only used when `utilities = largest`.
specified_years = [2018, 2019] 

**Now, run the entire notebook to prep the rest of the tools!**

<a id='verify-tools'></a>
## Download tools to verify connection between EIA and FERC1
When you un-comment and run the following function, you'll find a new excel file called `fix_FERC-EIA_overrides.xlsx` in the `outputs` directory created based on the inputs you specified above. Read the [Override Instructions](https://docs.google.com/document/d/1nJfmUtbSN-RT5U2Z3rJKfOIhWsRFUPNxs9NKTes0SRA/edit#) to learn how to begin fixing/verifying the FERC-EIA connections.

**Warning:** Running this funcion will REPLACE any override tools you currently have saved (unless you have changed their name or location). DO NOT run this function if you are in the middle of working on one of the output files OR move/rename the file you are working on.

**Note:** If you choose to move the override file you're working on, make sure to *copy* and paste it in a new location. The following function depends on the existance of an excel file called `fix_FERC-EIA_overrides.xlsx` in the `output` directory in order to run.

In [2]:
# %%time
# test = output_override_tools(
#     check_connections, 
#     mul, 
#     deprish_df,
#     utilities=specified_utilities,
#     amount=specified_amount,
#     years=specified_years
# )

In [186]:
check_connections[check_connections['plant_id_pudl'].isna()]

Unnamed: 0,verified,used_match_record,signature_1,signature_2,notes,record_id_eia_override_1,record_id_eia_override_2,record_id_eia_override_3,record_id_eia_override_4,record_id_eia_override_5,record_id_eia_override_6,record_id_eia_override_7,record_id_eia_override_8,record_id_eia_override_9,record_id_eia_override_10,record_id_ferc1,record_id_eia,true_gran,report_year,match_type,plant_part,ownership,utility_id_pudl,utility_name_ferc1,plant_id_pudl,unit_id_pudl,generator_id,plant_name_ferc1,plant_name_eia,fuel_type_code_pudl_ferc1,fuel_type_code_pudl_eia,fuel_type_code_pudl_diff,net_generation_mwh_ferc1,net_generation_mwh_eia,net_generation_mwh_pct_diff,capacity_mw_ferc1,capacity_mw_eia,capacity_mw_pct_diff,capacity_factor_ferc1,capacity_factor_eia,capacity_factor_pct_diff,total_fuel_cost_ferc1,total_fuel_cost_eia,total_fuel_cost_pct_diff,total_mmbtu_ferc1,total_mmbtu_eia,total_mmbtu_pct_diff,fuel_cost_per_mmbtu_ferc1,fuel_cost_per_mmbtu_eia,fuel_cost_per_mmbtu_pct_diff,installation_year_ferc1,installation_year_eia,installation_year_diff
3911,,,,,,,,,,,,,,,,f1_gnrt_plant_2018_12_148_0_1,2963_1_2018_plant_unit_total_15474,True,2018,no prediction; training,plant_unit,total,,,,1.0,2.0,,Northeastern 1,,gas,,,72324.21,,,473.0,,,0.017455,,,15663260.0,,,5140535.0,,,3.047011,,,1970,
3960,,,,,,,,,,,,,,,,f1_gnrt_plant_2018_12_294_0_1,55856_2018_plant_total_40211,True,2018,no prediction; training,plant,total,,,,,,,Prairie State Generatng Station,,coal,,,11532420.0,,,1766.0,,,0.745462,,,,,,108427100.0,,,,,,2012,
3961,,,,,,,,,,,,,,,,f1_gnrt_plant_2018_12_294_0_5,55856_2018_plant_total_40211,True,2018,no prediction; training,plant,total,,,,,,,Prairie State Generatng Station,,coal,,,11532420.0,,,1766.0,,,0.745462,,,,,,108427100.0,,,,,,2012,


In [3]:
#test

<a id='upload-overrides'></a>
## Upload changes to training data
When you've finished editing the `fix_FERC-EIA_overrides.xlsx` and want to add your changes to the official override csv, move your file to the directory called `overrides` and then uncomment and run the following functions. 

**Note:** If you have changed or marked TRUE any records that have already been overridden and included in the training data, you will want to set `expect_override_overrides = True`. Otherwise, the function will check to see if you have accidentally tampered with values that have already been matched.

In [412]:
expect_override_overrides = True

In [413]:
# training_data_out = (
#     combine_new_overrides(expect_override_overrides)
#     .pipe(combine_all_overrides, training_data)
# )

#training_data_out.to_csv(training_path)

----------

## Notebook Setup

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
import pandas as pd
import numpy as np
import pudl
import pudl.constants as pc
import pudl.extract.ferc1
import sqlalchemy as sa
import logging
import sys
import copy
from copy import deepcopy
import scipy
import statistics
import yaml
import os

import recordlinkage as rl
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

In [6]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

In [7]:
sys.path.append("../")
from pudl.output.ferc1 import *
from pudl_rmi.connect_ferc1_to_eia import *
from pudl_rmi.make_plant_parts_eia import *
import pudl_rmi.connect_ferc1_to_eia
pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
ferc_engine = sa.create_engine(pudl_settings['ferc1_db'])
pd.options.display.max_columns = None

In [8]:
relevant_cols_ferc_eia = [
    'record_id_ferc1',
    'record_id_eia',
    'true_gran',
    'report_year',
    'match_type',
    'plant_part',
    'ownership',
    'utility_id_pudl_ferc1',
    'utility_name_ferc1',
    'plant_id_pudl_ferc1',
    'unit_id_pudl',
    'generator_id',
    'plant_name_ferc1',
    'plant_name_new',
    'fuel_type_code_pudl_ferc1',
    'fuel_type_code_pudl_eia',
    'net_generation_mwh_ferc1',
    'net_generation_mwh_eia',
    'capacity_mw_ferc1',
    'capacity_mw_eia',
    'capacity_factor_ferc1',
    'capacity_factor_eia',
    'total_fuel_cost_ferc1',
    'total_fuel_cost_eia',
    'total_mmbtu_ferc1',
    'total_mmbtu_eia',
    'fuel_cost_per_mmbtu_ferc1',
    'fuel_cost_per_mmbtu_eia',
    'installation_year_ferc1',
    'installation_year_eia',
]

relevant_cols_mul = [
    'record_id_eia',
    'report_year',
    'utility_id_pudl',
    #'utility_name_eia',
    'true_gran',
    'plant_part',
    'ownership_dupe',
    'fraction_owned',
    'plant_id_eia',
    'plant_id_pudl',
    'plant_name_new',
    'generator_id',
    'capacity_mw',
    'capacity_factor',
    'net_generation_mwh',
    'installation_year',
    'fuel_type_code_pudl',
    'total_fuel_cost',
    'total_mmbtu',
    'fuel_cost_per_mmbtu',
    'heat_rate_mmbtu_mwh',
]

## **Part 1:** Generate Override Tools

### 1.1 Get current FERC-EIA & MUL tables
This is going to look a lot like the `connect_ferc1_to_eia.ipynb`.

In [11]:
file_path_training = pathlib.Path().cwd().parent /'inputs'/'train_ferc1_to_eia.csv'
file_path_mul = pathlib.Path().cwd().parent /'outputs' / 'master_unit_list.pkl.gz'
# pudl output object for ferc data
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS',fill_fuel_cost=True,roll_fuel_cost=True,fill_net_gen=True)

In [12]:
inputs = InputManager(file_path_training, file_path_mul, pudl_out)
features_all = (Features(feature_type='all', inputs=inputs)
                .get_features(clobber=False))
features_train = (Features(feature_type='training', inputs=inputs)
                  .get_features(clobber=False))
tuner = ModelTuner(features_train, inputs.get_train_index(), n_splits=10)

matcher = MatchManager(best=tuner.get_best_fit_model(), inputs=inputs)
matches_best = matcher.get_best_matches(features_train, features_all)

Preparing the FERC1 tables.
loading steam table
loading small gens table
loading hydro table
loading pumped storage table
loading fbp table
prepping steam table
prepping hydro tables
combining all tables
Reading the master unit list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/rmi-ferc1-eia/outputs/master_unit_list.pkl.gz
Generated 46261 all candidate features.
Generated 118 training candidate features.
We are about to test hyper parameters of the model while doing k-fold cross validation. This takes a few minutes....
Scores from the best model hyperparameters:
  F-Score:   0.82
  Precision: 0.88
  Accuracy:  0.62
Fit and predict a model w/ the highest scoring hyperparameters.
Get the top scoring match for each FERC1 steam record.
Winning match stats:
        matches vs ferc:      15.40%
        best match v ferc:    14.12%
        best match vs matches:91.72%
        murk vs matches:      3.39%
        ties vs matches:      3.29%
Overridden records:       100.0%
New best match v fe

In [16]:
file_path_deprish = pathlib.Path().cwd().parent/'inputs'/'depreciation_rmi.xlsx'
sheet_name_deprish='Depreciation Studies Raw'
transformer = pudl_rmi.deprish.Transformer(
    pudl_rmi.deprish.Extractor(
        file_path=file_path_deprish,
        sheet_name=sheet_name_deprish
    ).execute())

Reading the depreciation data from /Users/aesharpe/Desktop/Work/Catalyst_Coop/rmi-ferc1-eia/inputs/depreciation_rmi.xlsx


In [13]:
connects_ferc1_eia = (
    prettyify_best_matches(
        matches_best, 
        plant_parts_true_df=inputs.plant_parts_true_df,
        steam_df=inputs.all_plants_ferc1_df)
    .copy()
)

Coverage for matches during EIA working years:
    Fuel type: 93.6%
    Tech type: 37.4%

Coverage for all steam table records during EIA working years:
    EIA matches: 23.3

Coverage for all small gen table records during EIA working years:
    EIA matches: 3.3

Coverage for all hydro table records during EIA working years:
    EIA matches: 3.4

Coverage for all pumped storage table records during EIA working years:
    EIA matches: 0.0


In [14]:
mul = (
    make_plant_parts_eia.get_master_unit_list_eia(file_path_mul, pudl_out).reset_index()
    .copy()
)

Reading the master unit list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/rmi-ferc1-eia/outputs/master_unit_list.pkl.gz


In [17]:
deprish_df = transformer.execute()

# of reserve_rate over 1 (100%): 1. Higher #s here may indicate an issue with the original data or the fill_in method
Added 16995 ferc_acct_name's out of 17210 options
aggregating to: ['report_date', 'plant_id_pudl', 'plant_part_name', 'ferc_acct', 'utility_id_pudl', 'data_source', 'line_id', 'utility_name_ferc1']
overriding auto-generated common associations with 472 mannual associations
grabbed 1144 common records
Allocating common records for: plant_balance
   grabbed 1928 common reocrds and 14326 atomic records. of total 14326
The resulting plant_balance allocated is 99.48% of the original
Allocating common records for: book_reserve
   grabbed 1928 common reocrds and 14326 atomic records. of total 14326
The resulting book_reserve allocated is 99.87% of the original
Allocating common records for: unaccrued_balance
   grabbed 1928 common reocrds and 14326 atomic records. of total 14326
The resulting unaccrued_balance allocated is 99.20% of the original
Allocating common records for: 

In [27]:
#mul[mul['record_id_eia']=='6068_2019_plant_owned_56211'].T
test = mul[mul['plant_name_new'].str.contains('olf')]
aa = test[test['report_date'].dt.year==2018]
aa[aa['fuel_type_code_pudl']=='nuclear']

Unnamed: 0,record_id_eia,plant_id_eia,report_date,plant_part,generator_id,unit_id_pudl,prime_mover_code,energy_source_code_1,technology_description,ferc_acct_name,utility_id_eia,true_gran,appro_part_label,appro_record_id_eia,capacity_factor,capacity_mw,fraction_owned,fuel_cost_per_mmbtu,fuel_cost_per_mwh,fuel_type_code_pudl,heat_rate_mmbtu_mwh,installation_year,net_generation_mwh,operational_status,ownership,ownership_dupe,planned_retirement_date,plant_id_pudl,plant_name_eia,plant_name_new,record_count,total_fuel_cost,total_mmbtu,utility_id_pudl,report_year,plant_id_report_year
1702648,210_2018_plant_owned_9961,210,2018-01-01,plant,1,,ST,NUC,Nuclear,Nuclear,9961,True,plant,210_2018_plant_owned_9961,0.825596,76.062,0.06,,,nuclear,,1985,550097.34,existing,owned,False,NaT,,Wolf Creek Generating Station,Wolf Creek Generating Station,1,,,,2018,<NA>_2018
1702649,210_2018_plant_owned_10000,210,2018-01-01,plant,1,,ST,NUC,Nuclear,Nuclear,10000,True,plant,210_2018_plant_owned_10000,0.825596,595.819,0.47,,,nuclear,,1985,4309095.83,existing,owned,False,NaT,653.0,Wolf Creek Generating Station,Wolf Creek Generating Station,1,,,159.0,2018,653_2018
1702650,210_2018_plant_owned_10005,210,2018-01-01,plant,1,,ST,NUC,Nuclear,Nuclear,10005,True,plant,210_2018_plant_owned_10005,0.825596,595.819,0.47,,,nuclear,,1985,4309095.83,existing,owned,False,NaT,,Wolf Creek Generating Station,Wolf Creek Generating Station,1,,,,2018,<NA>_2018
1712319,210_2018_plant_total_9961,210,2018-01-01,plant,1,,ST,NUC,Nuclear,Nuclear,9961,True,plant,210_2018_plant_total_9961,0.825596,1267.7,1.0,,,nuclear,,1985,9168289.0,existing,total,False,NaT,,Wolf Creek Generating Station,Wolf Creek Generating Station,1,,,,2018,<NA>_2018
1712320,210_2018_plant_total_10000,210,2018-01-01,plant,1,,ST,NUC,Nuclear,Nuclear,10000,True,plant,210_2018_plant_total_10000,0.825596,1267.7,1.0,,,nuclear,,1985,9168289.0,existing,total,False,NaT,653.0,Wolf Creek Generating Station,Wolf Creek Generating Station,1,,,159.0,2018,653_2018
1712321,210_2018_plant_total_10005,210,2018-01-01,plant,1,,ST,NUC,Nuclear,Nuclear,10005,True,plant,210_2018_plant_total_10005,0.825596,1267.7,1.0,,,nuclear,,1985,9168289.0,existing,total,False,NaT,,Wolf Creek Generating Station,Wolf Creek Generating Station,1,,,,2018,<NA>_2018
1740756,210_st_2018_plant_prime_mover_owned_9961,210,2018-01-01,plant_prime_mover,1,,ST,NUC,Nuclear,Nuclear,9961,False,plant,210_2018_plant_owned_9961,0.825596,76.062,0.06,,,nuclear,,1985,550097.34,existing,owned,False,NaT,,Wolf Creek Generating Station,Wolf Creek Generating Station ST,1,,,,2018,<NA>_2018
1740757,210_st_2018_plant_prime_mover_owned_10000,210,2018-01-01,plant_prime_mover,1,,ST,NUC,Nuclear,Nuclear,10000,False,plant,210_2018_plant_owned_10000,0.825596,595.819,0.47,,,nuclear,,1985,4309095.83,existing,owned,False,NaT,653.0,Wolf Creek Generating Station,Wolf Creek Generating Station ST,1,,,159.0,2018,653_2018
1740758,210_st_2018_plant_prime_mover_owned_10005,210,2018-01-01,plant_prime_mover,1,,ST,NUC,Nuclear,Nuclear,10005,False,plant,210_2018_plant_owned_10005,0.825596,595.819,0.47,,,nuclear,,1985,4309095.83,existing,owned,False,NaT,,Wolf Creek Generating Station,Wolf Creek Generating Station ST,1,,,,2018,<NA>_2018
1753926,210_st_2018_plant_prime_mover_total_9961,210,2018-01-01,plant_prime_mover,1,,ST,NUC,Nuclear,Nuclear,9961,False,plant,210_2018_plant_total_9961,0.825596,1267.7,1.0,,,nuclear,,1985,9168289.0,existing,total,False,NaT,,Wolf Creek Generating Station,Wolf Creek Generating Station ST,1,,,,2018,<NA>_2018


### 1.2 Curate columns

connects_ferc1_eiaHere we'll limit the columns in the output file to those that will be useful for analysing match correctness. We'll also add some columns for you to use during the match verification process. All match types are included in the outputs (even those that have been correctly mapped according the current overrides) just incase there is a discrepancy or error that we want to fix.

**Match Types:**

* `prediction`: prediction based on the training data.
* `correct_prediction`: prediction based on training data that matches record in the training data.
* `no prediction; training`: not filled in by the prediction algorithm but filled in by the training data.
* `overridden`: incorrectly filled in my prediction algorithm and corrected by training data.
* `no_match`: a reviewer has found there to be no verified EIA match for the given FERC record.
* `NaN`: not filled in by the training data or the prediction algorithm. 

In [164]:
#connects_ferc1_eia[connects_ferc1_eia['plant_name_ferc1']=='jeffrey ener ctr 8%']

In [165]:
# Grab the MUL and FERC-EIA connections that show the comparison between FERC and EIA values
check_connections = connects_ferc1_eia[relevant_cols_ferc_eia].copy()
mul = mul[relevant_cols_mul].copy()

# Add a column to tell whether it's a good match, who verified / made the match,
# and any notes about weirdness.
check_connections.insert(0, "verified", np.nan)
check_connections.insert(1, "used_match_record", np.nan)
check_connections.insert(2, "signature_1", np.nan)
check_connections.insert(3, "signature_2", np.nan)
check_connections.insert(4, "notes", np.nan)
check_connections.insert(6, "record_id_eia_override_1", np.nan)
check_connections.insert(7, "record_id_eia_override_2", np.nan)
check_connections.insert(8, "record_id_eia_override_3", np.nan)
check_connections.insert(9, "record_id_eia_override_4", np.nan)
check_connections.insert(10, "record_id_eia_override_5", np.nan)
check_connections.insert(11, "record_id_eia_override_6", np.nan)
check_connections.insert(12, "record_id_eia_override_7", np.nan)
check_connections.insert(13, "record_id_eia_override_8", np.nan)
check_connections.insert(14, "record_id_eia_override_9", np.nan)
check_connections.insert(15, "record_id_eia_override_10", np.nan)

# put these in the right order to be filled in by pct_diff
check_connections.insert(31, "fuel_type_code_pudl_diff", np.nan)
check_connections.insert(34, "net_generation_mwh_pct_diff", np.nan)
check_connections.insert(37, "capacity_mw_pct_diff", np.nan)
check_connections.insert(40, "capacity_factor_pct_diff", np.nan)
check_connections.insert(43, "total_fuel_cost_pct_diff", np.nan)
check_connections.insert(46, "total_mmbtu_pct_diff", np.nan)
check_connections.insert(49, "fuel_cost_per_mmbtu_pct_diff", np.nan)
check_connections.insert(52, "installation_year_diff", np.nan)

# Fix some column names
check_connections.rename(
    columns={'utility_id_pudl_ferc1': 'utility_id_pudl', 
             'plant_id_pudl_ferc1': 'plant_id_pudl',
             'plant_name_new': 'plant_name_eia'}, inplace=True)

def pct_diff(df, col):
    df.loc[(df[f"{col}_eia"] > 0) & (df[f"{col}_ferc1"] > 0), f"{col}_pct_diff"] = (
        round(((df[f"{col}_ferc1"] - df[f"{col}_eia"]) / df[f"{col}_ferc1"] * 100), 2)
    )

# Add pct diff columns
for col in ['net_generation_mwh', 'capacity_mw', 'capacity_factor', 
            'total_fuel_cost', 'total_mmbtu', 'fuel_cost_per_mmbtu']:
    pct_diff(check_connections, col)
    
# Add qualitative similarity columns (fuel_type_code_pudl)
check_connections.loc[
    (check_connections.fuel_type_code_pudl_eia.notna())
    & (check_connections.fuel_type_code_pudl_ferc1.notna()),
    "fuel_type_code_pudl_diff"
] = check_connections.fuel_type_code_pudl_eia == check_connections.fuel_type_code_pudl_ferc1

# Add quantitative similarity columns (installation year)
check_connections.loc[:, "installation_year_ferc1"] = check_connections.installation_year_ferc1.astype("Int64")
check_connections.loc[
    (check_connections.installation_year_eia.notna())
    & (check_connections.installation_year_ferc1.notna()),
    "installation_year_diff"
] = check_connections.installation_year_eia - check_connections.installation_year_ferc1

# Move record_id_ferc1
record_id_ferc1 = check_connections.pop('record_id_ferc1')
check_connections.insert(15, "record_id_ferc1", record_id_ferc1) 

### 1.3 Get utility and year subsets for editing

Not sure which PUDL ID you need? Use this cell to search for them by name:

In [None]:
util_name_string = 'kansas' # edit this, must be lower case

utils = (
    pudl_out.utils_eia860()[['utility_id_pudl', 'utility_id_eia', 'utility_name_eia', 'state']]
    .drop_duplicates()
    .dropna(subset=['utility_name_eia', 'utility_id_pudl'])
    .assign(utility_name_eia=lambda x: x.utility_name_eia.str.lower())
)
utils[utils['utility_name_eia'].str.contains(f"{util_name_string}")]

In [None]:
test = pudl_out.plants_steam_ferc1().copy()
test[test.utility_name_ferc1.str.contains('Duke')].drop_duplicates(subset=['utility_name_ferc1'])#.utility_name_ferc1.unique()

#check_connections[check_connections['utility_id_pudl_eia']==159]
#97, 

In [166]:
def prep_inputs(check_connections_df, utilities='largest', amount=5, years='all'):
    
        all_plants_ferc1 = pudl_out.all_plants_ferc1().copy()
        max_year = all_plants_ferc1.report_year.max()
        min_year = all_plants_ferc1.report_year.min()

        if years != 'all':
            assert type(years) == list, "years must be reported as a list if not 'all'"
            assert len([year for year in years if year in range(min_year, max_year+1)]) == len(years), \
                "years must be 'all' or a valid year integer within the bounds of FERC reporting years"
        if years == 'all':
            years = range(min_year, max_year+1)
            
        check_years = check_connections_df[check_connections_df['report_year'].isin(years)]
        
        if utilities == 'largest':
            logger.info(f"getting pudl ids for the top {amount} largest utilities")
            utilities = (
                check_years
                .groupby(['utility_id_pudl_eia', 'utility_name_ferc1'])['capacity_mw_ferc1']
                .sum()
                .reset_index()
                .sort_values('capacity_mw_ferc1', ascending=False)
                .head(amount)
                .utility_id_pudl_eia
                .tolist()
            )
        else:
            assert type(utilities) == list, "if not 'largest', utilities must be presented as a list of PUDL IDs"
            
        return utilities, years

In [167]:
def get_ferc_eia_utilities_subset(check_connections_df, utilities, years):
    logger.info("retreiving the ferc-eia connection for the given utilities")
    check_years = check_connections_df[check_connections_df['report_year'].isin(years)]
    util_output = check_years[check_years['utility_id_pudl'].isin(utilities)].copy()
    return util_output
    
def get_mul_subset(mul, utilities, years):
    logger.info("retreiving the MUL for the given utilities")
    mul_years = mul[mul['report_year'].isin(years)]
    mul_output = mul_years[mul_years['utility_id_pudl'].isin(utilities)]
    return mul_output

def get_deprish_subset(deprish_df, utilities):
    logger.info("retrieving depreciation data for the given utilities")
    deprish_output = deprish_df[deprish_df['utility_id_pudl'].isin(utilities)]
    return deprish_output

In [168]:
def output_override_tools(check_connections_df, mul, deprish_df, utilities='largest', amount=5, years='all'):
    
    utilities, years = prep_inputs(check_connections_df, utilities, amount, years)
    
    ferc_eia_util_subset = get_ferc_eia_utilities_subset(check_connections_df, utilities, years)
    # Add some functions to it
    ferc_eia_util_subset = (
        ferc_eia_util_subset.reset_index(drop=True)
        .assign(used_match_record=lambda x: "=(F" + (x.index+2).astype('str') + "=Q" + (x.index+2).astype('str') + ")" )
    )
    
    mul_util_subset = get_mul_subset(mul, utilities, years)
    deprish_util_subset = get_deprish_subset(deprish_df, utilities)
    
    # Create a dict of each df and the tab name you want to give it in the output
    tool_dict = {
        'ferc_eia_util_subset': ferc_eia_util_subset,
        'mul_util_subset': mul_util_subset,
        'deprish_util_subset': deprish_util_subset
    }
    
    output_path = pathlib.Path().cwd().parent / 'outputs' / 'fix_FERC-EIA_overrides.xlsx'
    
    assert len(mul_util_subset) < 500000, "Your MUL subset is more than 500,000 rows...this is going to make excel \
        reaaalllllyyy slow. Try entering a smaller utility or year subset"
    
    logger.info("outputing override tools to tabs in fix_FERC-EIA_overrides.xlsx")
    pudl_rmi.connect_deprish_to_eia.save_to_workbook(output_path, tool_dict)
    
    return ferc_eia_util_subset

## **Part 2:** Re-incorporating Matched Records

Now that you've marked the correctly matched records as `TRUE`, we'll want to incorporate those into the perminant override list. All you have to do is move the `fix_FERC-EIA_overrides.xlsx` file to the `overrides` directory, run the following cells, and then run...

### 2.1 Update training data

In [None]:
fixed_overrides_path = pathlib.Path().cwd().parent / 'overrides' #/ #'fix_FERC-EIA_overrides.xlsx'
training_path = pathlib.Path().cwd().parent / 'inputs' / 'train_ferc1_to_eia.csv'
training_data = pd.read_csv(training_path)

In [None]:
def validate_override_fixes(validated_connections, expect_override_overrides=False):
    """Process the verified / fixed matches."""
    logger.info("validating override fixes")
    
    # Make sure that there are no rouge descriptions in the is_correct_match field (besides TRUE)
    match_language = validated_connections.is_correct_match.unique()
    assert len(outliers:=[x for x in match_language if x not in [True, False]]) == 0, \
        f"All correct matches must be marked TRUE; found {outliers}"

    # Make it a boolean column
    validated_connections.loc[:, "is_correct_match"] = (
        validated_connections.is_correct_match.astype('bool'))

    # Get TRUE records
    true_connections = validated_connections[validated_connections['is_correct_match']].copy()

    # Make sure that the eia and ferc ids haven't been tampered with
    assert len(bad_eia := [x for x in true_connections.dropna().record_id_eia.unique()
                        if x not in connects_ferc1_eia.record_id_eia.unique()]) == 0, \
        f"Found record_id_eia values that aren't in the existing FERC-EIA connection: {bad_eia}"
    assert len(bad_ferc := [x for x in true_connections.dropna().record_id_ferc1.unique()
                        if x not in connects_ferc1_eia.record_id_ferc1.unique()]) == 0, \
        f"Found record_id_ferc1 values that aren't in the existing FERC-EIA connection: {bad_ferc}"

    if not expect_override_overrides:
        # Make sure that these aren't already in the overrides (this should be impossible, but just in case)
        assert len(bad_eia := [x for x in true_connections.record_id_eia.unique()
                            if x in training_data.dropna(subset=['record_id_eia']).record_id_eia.unique()]) == 0,  \
            f"Found record_id_eia values that are already in the existing FERC-EIA training data: {bad_eia}"
        assert len(bad_ferc := [x for x in true_connections.record_id_ferc1.unique()
                            if x in training_data.dropna(subset=['record_id_eia']).record_id_ferc1.unique()]) == 0, \
            f"Found record_id_ferc1 values that are already in the existing FERC-EIA training data: {bad_ferc}"
    
    return true_connections

In [None]:
def combine_new_overrides(expect_override_overrides):
    logger.info("combining all new override files")
    all_fixes = pd.DataFrame(columns=['record_id_eia', 'record_id_ferc1', 'signature', 'notes'])
    all_files = os.listdir(fixed_overrides_path)
    files = [file for file in all_files if not file.startswith('.')]
    for file in files:
        assert (file.endswith('.xlsx'), 'fixing the overrides can only read .xslx \
            files; found other file types in the overrides directory')
    for file in files:
        logger.info(f"Processing fixes in {file}")
        file_df = (
            pd.read_excel(
                (fixed_overrides_path / file), 
                sheet_name='ferc_eia_util_subset')
            .replace(columns={'record_id_eia_override': 'record_id_eia', 'record_id_eia': 'record_id_eia_old'})
            .assign(
                is_correct_match=lambda x: x.is_correct_match.replace({'TRUE':True, np.nan: False}))
            .pipe(validate_override_fixes, expect_override_overrides=expect_override_overrides))
        all_fixes = all_fixes.append(file_df[['record_id_eia', 'record_id_ferc1',
                                              'signature', 'notes']])
    return all_fixes

In [None]:
def combine_all_overrides(new_overrides, training_df):
    logger.info("combining all overrides")
    training_data_out = (
        training_df.append(
            new_overrides[['record_id_eia', 'record_id_ferc1', 'signature', 'notes']])
        .set_index(['record_id_eia', 'record_id_ferc1'])
    )
    return training_data_out

### 2.2 Export updated data

Move your updated version of the `fix_FERC-EIA_overrides.xlsx` file into the directory called `overrides`. This notebook will only process files to supplement the existing training data that are located in that folder.

## Explore

See how well the `connects_ferc1_eia` table predicts back-fillable values

In [None]:
# # For the override plants, how often are the other records of the same ferc id matched?

# override_ferc_ids = (
#     connects_ferc1_eia[connects_ferc1_eia['match_type'].isin(['overridden', 'correct_prediction'])]
#     .dropna(subset=['plant_id_ferc1'])
# )

# override_ferc_ids.assign(test=lambda x: tt(x.record_id_eia))



# # this only works because there are no duplicate plant_id_ferc1 values that are overrides
# override_dict = dict(zip(override_ferc_ids['plant_id_ferc1'], override_ferc_ids['record_id_eia']))


In [None]:
li

In [None]:
# #def check(df, ferc_id):
# df = connects_ferc1_eia

# for ferc_id, eia_id in override_dict.items():
#     ferc_id_view = df[
#         (df['plant_id_ferc1']==ferc_id)
#         & (~df['match_type'].isin(['overridden', 'correct_prediction']))]

#     total_records = len(ferc_id_view)
#     total_records_notna = len(ferc_id_view[ferc_id_view['record_id_eia'].notna()])
#     pct_matched = total_records_notna / total_records * 100

#     total_correct_matches = len(ferc_id_view[ferc_id_view['record_id_eia']==eia_id])
#     pct_correct_match = total_correct_matches / total_records * 100
#     pct_correct_of_matches = total_correct_matches / total_records_notna * 100
    
#     print(ferc_id)
#     print(f"total records: {total_records}")
#     print(f"PCT MATCHED: {pct_matched}")
#     print(f"PCT CORRECT: {pct_correct_match}")
#     print(f"PCT OF MATCHES CORRECT: {pct_correct_of_matches}")
#     print("\n")

In [435]:
#use appro_record_id_eia field!