# Add Overrides to Train FERC-EIA Connecter

This notebook is intended to help with adding overrides to the FERC-EIA connection csv. Adding new connections will fill in gaps and improve the program's ability to predict other matches. To adequately check each of the connections, we'll provide you with subsets from *three* different spreadsheets:

1) **The current FERC-EIA connection:** to look for good, bad, and empty links between FERC and EIA records.
2) **The Master Unit List:** to confirm or disprove those connections.
3) **Depreciation data** from our previous work.

Downloading all the files at once will overwhelm excel, so we need to make edits in segments. This notebook will help you:

1) **Download useful utility-based subsets of each table for review.**
2) **Update the old training data with new verified matches.**

Once you edit the inputs below, run the entire notebook. Next, pick whether you want to [download tools to verify connection between EIA and FERC1](#verify-tools) or [upload changes to training data](#upload-overrides) and run the relevant functions.

## Edit Inputs

It's time to choose what subset of the data you'd like to wrangle first. We'll only download data from specific utilities and years if you say so (we highly recommend this so you don't crash excel). If you're not sure which PUDL IDs refer to which utilities, scroll down to section 1.3.

In [2]:
# The utilities you'd like to review as a dictionary where the format is: 
# {<UTILITY>: {'utility_id_pudl': [1, 2, 3], 'utility_id_eia': [44, 55, 66]}}
# Each key will be output into a different excel file.

specified_utilities = {
    'Dominion': {'utility_id_pudl': [292, 293, 349],
                 'utility_id_eia': [17539, 17554, 19876]},
    'Evergy': {'utility_id_pudl': [159, 160, 161, 359],
               'utility_id_eia': [10000, 10005, 56211, 3702, 22500]},
    'IDACORP': {'utility_id_pudl': [140],
                'utility_id_eia': [9191]},
    'Duke': {'utility_id_pudl': [90, 91, 92, 93, 96, 97],
             'utility_id_eia': [5416, 6455, 15470, 55729, 3542, 3046]},
    'BHE': {'utility_id_pudl': [185, 246, 204, 287],
            'utility_id_eia': [12341, 14354, 13407, 17166]},
    'Southern': {'utility_id_pudl': [123, 18, 190],
                 'utility_id_eia': [7140, 195, 12686, 17622]},
    'NextEra': {'utility_id_pudl': [121, 130],
                'utility_id_eia': [6452, 7801]},
    'AEP': {'utility_id_pudl': [29, 301, 144, 275, 162, 7],
            'utility_id_eia': [733, 17698, 9324, 15474, 22053, 20521, 343]},
    'Entergy': {'utility_id_pudl': [107, 106, 311, 113, 110],
                'utility_id_eia': [11241, 814, 12465, 55937, 13478]},
    'Xcel': {'utility_id_pudl': [224, 302, 272],
             'utility_id_eia': [13781, 13780, 17718, 15466]}
}

# This can be 'all' or a list of any years within the FERC data, ex: [2006, 2007]
# These are the years you would like to consider fixing AND the years you would like to 
# consider for detmining largest capacity (the latter is only used when `utilities = largest`.
specified_years = [2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 
                   2013, 2014, 2015, 2016, 2017, 2018, 2019] 

<a id='verify-tools'></a>
## 1) Output Override Tools
When you un-comment and run the following function, you'll find excel files called `<UTILITY>_fix_FERC-EIA_overrides.xlsx` in the `outputs/overrides` directory created based on the utility and year inputs you specified above. Read the [Override Instructions](https://docs.google.com/document/d/1nJfmUtbSN-RT5U2Z3rJKfOIhWsRFUPNxs9NKTes0SRA/edit#) to learn how to begin fixing/verifying the FERC-EIA connections.

In [19]:
# %%time
# for util_name, id_dict in specified_utilities.items():
#     output_override_tools(
#         check_connections, 
#         mul, 
#         deprish_df,
#         util_name=util_name,
#         utilities=id_dict,
#         #amount=specified_amount,
#         years=specified_years,
#     )

<a id='upload-overrides'></a>
## 2) Upload changes to training data
When you've finished editing the `<UTILITY>_fix_FERC-EIA_overrides.xlsx` and want to add your changes to the official override csv, move your file to the directory called `add_to_training` and then uncomment and run the following functions. 

**Note:** If you have changed or marked TRUE any records that have already been overridden and included in the training data, you will want to set `expect_override_overrides = True`. Otherwise, the function will check to see if you have accidentally tampered with values that have already been matched.

In [265]:
expect_override_overrides = False

fixed_overrides_path = pathlib.Path().cwd().parent / 'add_to_training'
training_path = pathlib.Path().cwd().parent / 'inputs' / 'train_ferc1_to_eia_copy.csv'
training_data = pd.read_csv(training_path)

In [277]:
training_data_out = (
    validate_and_combine_new_overrides(expect_override_overrides, connects_ferc1_eia, training_data)
    .pipe(add_to_training, training_data)
)


# Only uncomment this when you're ready to replace the old training data.

#training_data_out.to_csv(training_path)

Processing fixes in Evergy_fix_FERC-EIA_overrides.xlsx
Validating overrides
    verified  used_match_record signature_1  signature_2  \
32      True               True          AS          NaN   
33      True               True          AS          NaN   
34      True               True          AS          NaN   

                                          notes     record_id_eia_override_1  \
32  capacity slightly off but no better options  6074_2009_plant_total_56211   
33  capacity slightly off but no better options  6074_2010_plant_total_56211   
34  capacity slightly off but no better options  6074_2011_plant_total_56211   

   record_id_eia_override_2  record_id_eia_override_3 best_match  \
32                      NaN                       NaN    net-gen   
33                      NaN                       NaN    net-gen   
34                      NaN                       NaN    net-gen   

             record_id_ferc1                record_id_eia  true_gran  \
32  f1_steam_2009

In [275]:
test[test['match_type']=='overridden']
training_data[training_data['record_id_ferc1']=='f1_steam_2018_12_191_0_4']

Unnamed: 0,record_id_eia,record_id_ferc1,signature,notes
160,56502_1_2018_plant_gen_owned_22500,f1_steam_2018_12_191_0_4,,


----------

## Notebook Setup

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
import pandas as pd
import numpy as np
import pathlib
import pudl
import pudl.constants as pc
import pudl.extract.ferc1
import sqlalchemy as sa
import logging
import sys
import copy
from copy import deepcopy
import scipy
import statistics
import yaml
import os

import recordlinkage as rl
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

In [5]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

In [6]:
sys.path.append("../")
from pudl.output.ferc1 import *
from pudl_rmi.connect_ferc1_to_eia import *
from pudl_rmi.make_plant_parts_eia import *
import pudl_rmi.connect_ferc1_to_eia
pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
ferc_engine = sa.create_engine(pudl_settings['ferc1_db'])
pd.options.display.max_columns = None

In [7]:
relevant_cols_ferc_eia = [
    'record_id_ferc1',
    'record_id_eia',
    'true_gran',
    'report_year',
    'match_type',
    'plant_part',
    'ownership',
    'utility_id_eia',
    'utility_id_pudl',
    'utility_name_ferc1',
    'plant_id_pudl',
    'unit_id_pudl',
    'generator_id',
    'plant_name_ferc1',
    'plant_name_new',
    'fuel_type_code_pudl_ferc1',
    'fuel_type_code_pudl_eia',
    'net_generation_mwh_ferc1',
    'net_generation_mwh_eia',
    'capacity_mw_ferc1',
    'capacity_mw_eia',
    'capacity_factor_ferc1',
    'capacity_factor_eia',
    'total_fuel_cost_ferc1',
    'total_fuel_cost_eia',
    'total_mmbtu_ferc1',
    'total_mmbtu_eia',
    'fuel_cost_per_mmbtu_ferc1',
    'fuel_cost_per_mmbtu_eia',
    'installation_year_ferc1',
    'installation_year_eia',
]

relevant_cols_mul = [
    'record_id_eia',
    'report_year',
    'utility_id_pudl',
    'utility_id_eia',
    'utility_name_eia', # I add this in from the utils_eia860() table
    'operational_status_pudl',
    'true_gran',
    'plant_part',
    'ownership_dupe',
    'fraction_owned',
    'plant_id_eia',
    'plant_id_pudl',
    'plant_name_new',
    'generator_id',
    'capacity_mw',
    'capacity_factor',
    'net_generation_mwh',
    'installation_year',
    'fuel_type_code_pudl',
    'total_fuel_cost',
    'total_mmbtu',
    'fuel_cost_per_mmbtu',
    'heat_rate_mmbtu_mwh',
]

In [8]:
file_path_training = pathlib.Path().cwd().parent /'inputs'/'train_ferc1_to_eia.csv'
file_path_mul = pathlib.Path().cwd().parent /'outputs' / 'master_unit_list.pkl.gz'
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS',fill_fuel_cost=True,roll_fuel_cost=True,fill_net_gen=True)

## **Part 1:** Generate Override Tools

### 1.1 Get current FERC-EIA & MUL tables
This is going to look a lot like the `connect_ferc1_to_eia.ipynb`.

In [7]:
# to make a new master unit list if necessary
#make_plant_parts_eia.get_master_unit_list_eia(file_path_mul, pudl_out, clobber=True)

Master unit list not found /Users/aesharpe/Desktop/Work/Catalyst_Coop/rmi-ferc1-eia/outputs/master_unit_list.pkl.gzGenerating a new master unit list. This should take ~10 minutes.
Generating the mega generator table with ownership.
Allocating net generation from the generation_fuel_eia923 to the generator level instead of using the less complete generation_eia923 table.
Removing 4060 generators that retired mid-year out of 459975
No records found with fuel-only records. This is expected.
Ratio calc types: 
   All gens w/in generation table:  71405#, 1.2e+07 MW
   Some gens w/in generation table: 2572#, 1.6e+05 MW
   No gens w/in generation table:   418818#, 2e+07 MW
   GF table records have no PM:     0#
2.253% of records have are partially off from their 'IDX_PM_FUEL' group
gen v fuel table net gen diff:      39.9%
new v fuel table net gen diff:      99.6%
new v fuel table fuel (mmbtu) diff: 99.5%
6.63% of generator records are more that 5% off from the net generation table
filling in

Unnamed: 0_level_0,plant_id_eia,report_date,plant_part,generator_id,unit_id_pudl,prime_mover_code,energy_source_code_1,technology_description,ferc_acct_name,utility_id_eia,true_gran,appro_part_label,appro_record_id_eia,capacity_factor,capacity_mw,capacity_mw_eoy,fraction_owned,fuel_cost_per_mmbtu,fuel_cost_per_mwh,fuel_type_code_pudl,heat_rate_mmbtu_mwh,installation_year,net_generation_mwh,operational_status,operational_status_pudl,ownership,ownership_dupe,planned_retirement_date,plant_id_pudl,plant_name_eia,plant_name_new,plant_part_id_eia,record_count,retirement_date,total_fuel_cost,total_mmbtu,utility_id_pudl,report_year,plant_id_report_year
record_id_eia,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
55439_2003_plant_owned_0_proposed,55439,2003-01-01,plant,,,,NG,,,0,True,plant,55439_2003_plant_owned_0_proposed,,183.300,0.0,0.162299,,,gas,,,,proposed,proposed,owned,False,NaT,4540,Tenaska Virginia Generating,Tenaska Virginia Generating,55439_plant_owned_0_proposed,1,NaT,,,,2003,4540_2003
55439_2003_plant_owned_34403_proposed,55439,2003-01-01,plant,,,,NG,,,34403,True,plant,55439_2003_plant_owned_34403_proposed,,9.461,0.0,0.008377,,,gas,,,,proposed,proposed,owned,False,NaT,4540,Tenaska Virginia Generating,Tenaska Virginia Generating,55439_plant_owned_34403_proposed,1,NaT,,,,2003,4540_2003
55439_2003_plant_owned_34434_proposed,55439,2003-01-01,plant,,,,NG,,,34434,True,plant,55439_2003_plant_owned_34434_proposed,,936.639,0.0,0.829324,,,gas,,,,proposed,proposed,owned,False,NaT,4540,Tenaska Virginia Generating,Tenaska Virginia Generating,55439_plant_owned_34434_proposed,1,NaT,,,,2003,4540_2003
56212_2003_plant_owned_6_proposed,56212,2003-01-01,plant,SW2,,WT,WND,,,6,True,plant,56212_2003_plant_owned_6_proposed,,12.444,0.0,0.136000,,,wind,,,,proposed,proposed,owned,False,NaT,4851,Sweetwater Wind 2 LLC,Sweetwater Wind 2 LLC,56212_plant_owned_6_proposed,1,NaT,,,,2003,4851_2003
56212_2003_plant_owned_49882_proposed,56212,2003-01-01,plant,SW2,,WT,WND,,,49882,True,plant,56212_2003_plant_owned_49882_proposed,,12.444,0.0,0.136000,,,wind,,,,proposed,proposed,owned,False,NaT,4851,Sweetwater Wind 2 LLC,Sweetwater Wind 2 LLC,56212_plant_owned_49882_proposed,1,NaT,,,,2003,4851_2003
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61257_zv204_2020_plant_gen_total_60865,61257,2020-01-01,plant_gen,ZV204,,PV,SUN,Solar Photovoltaic,Other,60865,False,plant,61257_2020_plant_total_60865,,4.900,4.9,1.000000,,,solar,,,,existing,operating,total,False,NaT,10876,"ZV Solar 2, LLC","ZV Solar 2, LLC ZV204",61257_ZV204_plant_gen_total_60865,1,NaT,,,5833,2020,10876_2020
60549_zv3_2020_plant_gen_total_61119,60549,2020-01-01,plant_gen,ZV3,,PV,SUN,Solar Photovoltaic,Other,61119,False,plant,60549_2020_plant_total_61119,,5.000,5.0,1.000000,,,solar,,,,existing,operating,total,False,NaT,9490,"ZV Solar 3, LLC","ZV Solar 3, LLC ZV3",60549_ZV3_plant_gen_total_61119,1,NaT,,,5830,2020,9490_2020
60220_zwed1_2020_plant_gen_total_60003,60220,2020-01-01,plant_gen,ZWED1,,IC,OBG,Other Waste Biomass,Other,60003,True,plant_gen,60220_zwed1_2020_plant_gen_total_60003,,0.800,0.8,1.000000,,,gas,,2015,,existing,operating,total,False,NaT,7555,Zero Waste Energy Development Co LLC,Zero Waste Energy Development Co LLC ZWED1,60220_ZWED1_plant_gen_total_60003,2,NaT,,,3836,2020,7555_2020
60220_zwed2_2020_plant_gen_total_60003,60220,2020-01-01,plant_gen,ZWED2,,IC,OBG,Other Waste Biomass,Other,60003,True,plant_gen,60220_zwed2_2020_plant_gen_total_60003,,0.800,0.8,1.000000,,,gas,,2015,,existing,operating,total,False,NaT,7555,Zero Waste Energy Development Co LLC,Zero Waste Energy Development Co LLC ZWED2,60220_ZWED2_plant_gen_total_60003,2,NaT,,,3836,2020,7555_2020


In [9]:
inputs = InputManager(file_path_training, file_path_mul, pudl_out)
features_all = (Features(feature_type='all', inputs=inputs)
                .get_features(clobber=False))
features_train = (Features(feature_type='training', inputs=inputs)
                  .get_features(clobber=False))
tuner = ModelTuner(features_train, inputs.get_train_index(), n_splits=10)

matcher = MatchManager(best=tuner.get_best_fit_model(), inputs=inputs)
matches_best = matcher.get_best_matches(features_train, features_all)

Preparing the FERC1 tables.
loading steam table
loading small gens table
loading hydro table
loading pumped storage table
prepping steam table
prepping hydro tables
combining all tables
Reading the master unit list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/rmi-ferc1-eia/outputs/master_unit_list.pkl.gz
Generated 155502 all candidate features.
Generated 945 training candidate features.
We are about to test hyper parameters of the model while doing k-fold cross validation. This takes a few minutes....
Scores from the best model hyperparameters:
  F-Score:   0.87
  Precision: 0.89
  Accuracy:  0.74
Fit and predict a model w/ the highest scoring hyperparameters.
Get the top scoring match for each FERC1 steam record.
Winning match stats:
        matches vs ferc:      69.21%
        best match v ferc:    59.83%
        best match vs matches:86.45%
        murk vs matches:      9.84%
        ties vs matches:      1.15%
Overridden records:       22.3%
New best match v ferc:    60.25%


In [10]:
file_path_deprish = pathlib.Path().cwd().parent/'inputs'/'depreciation_rmi.xlsx'
sheet_name_deprish='Depreciation Studies Raw'
transformer = pudl_rmi.deprish.Transformer(
    pudl_rmi.deprish.Extractor(
        file_path=file_path_deprish,
        sheet_name=sheet_name_deprish
    ).execute())

Reading the depreciation data from /Users/aesharpe/Desktop/Work/Catalyst_Coop/rmi-ferc1-eia/inputs/depreciation_rmi.xlsx


In [11]:
connects_ferc1_eia = (
    prettyify_best_matches(
        matches_best, 
        plant_parts_true_df=inputs.plant_parts_true_df,
        steam_df=inputs.all_plants_ferc1_df,
        train_df=inputs.train_df)
    .copy()
)

jsuk there are some FERC-EIA matches that aren't in the steam                 table but this is because they are linked to retired EIA generators.
Coverage for matches during EIA working years:
    Fuel type: 97.7%
    Tech type: 32.6%

Coverage for all steam table records during EIA working years:
    EIA matches: 68.0

Coverage for all small gen table records during EIA working years:
    EIA matches: 37.7

Coverage for all hydro table records during EIA working years:
    EIA matches: 81.4

Coverage for all pumped storage table records during EIA working years:
    EIA matches: 47.0
Matches with consistency across years of all matches is 76.9%
Matches with completely consistent FERC capacity have a consistency of 91.9%
Matches with consistency across years of overrides matches is 42.8%
Matches with completely consistent FERC capacity have a consistency of 85.9%


In [12]:
# mul but add utility name
mul = (
    make_plant_parts_eia.get_master_unit_list_eia(file_path_mul, pudl_out).reset_index()
    .merge(
        pudl_out.utils_eia860()[['utility_id_eia', 'utility_name_eia', 'report_date']].copy(), 
        on=['utility_id_eia', 'report_date'],
        how='left',
        validate='m:1')
    .copy()
)

Reading the master unit list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/rmi-ferc1-eia/outputs/master_unit_list.pkl.gz


In [13]:
deprish_df = transformer.execute()

# of reserve_rate over 1 (100%): 1. Higher #s here may indicate an issue with the original data or the fill_in method
2.11% of records have correctable zero net_salvage_rate
2.1% of records have postive net_salvage_rate
0.03% of records have correctable zero net_salvage
2.3% of records have postive net_salvage
Added 16999 ferc_acct_name's out of 17210 options
aggregating to: ['report_date', 'plant_id_pudl', 'plant_part_name', 'ferc_acct', 'utility_id_pudl', 'data_source', 'line_id', 'utility_name_ferc1']
overriding auto-generated common associations with 470 mannual associations
grabbed 1157 common records
grabbed 1923 common reocrds and 14330 atomic records
allocating common for plant_balance
The resulting plant_balance allocated is 99.42% of the original
allocating common for book_reserve
The resulting book_reserve allocated is 99.87% of the original
allocating common for unaccrued_balance
The resulting unaccrued_balance allocated is 100.00% of the original
allocating common for net_

### 1.2 Curate columns

Here we'll limit the columns in the output file to those that will be useful for analysing match correctness. We'll also add some columns for you to use during the match verification process. All match types are included in the outputs (even those that have been correctly mapped according the current overrides) just incase there is a discrepancy or error that we want to fix.

**Match Types:**

* `prediction`: prediction based on the training data.
* `correct_prediction`: prediction based on training data that matches record in the training data.
* `no prediction; training`: not filled in by the prediction algorithm but filled in by the training data.
* `overridden`: incorrectly filled in my prediction algorithm and corrected by training data.
* `no_match`: a reviewer has found there to be no verified EIA match for the given FERC record.
* `NaN`: not filled in by the training data or the prediction algorithm. 

In [14]:
# Grab the MUL and FERC-EIA connections that show the comparison between FERC and EIA values
check_connections = connects_ferc1_eia[relevant_cols_ferc_eia].copy()
mul = mul[relevant_cols_mul].copy()

# Add a column to tell whether it's a good match, who verified / made the match,
# and any notes about weirdness.
check_connections.insert(0, "verified", np.nan)
check_connections.insert(1, "used_match_record", np.nan)
check_connections.insert(2, "signature_1", np.nan)
check_connections.insert(3, "signature_2", np.nan)
check_connections.insert(4, "notes", np.nan)
check_connections.insert(6, "record_id_eia_override_1", np.nan)
check_connections.insert(7, "record_id_eia_override_2", np.nan)
check_connections.insert(8, "record_id_eia_override_3", np.nan)
check_connections.insert(9, "best_match", np.nan)


# put these in the right order to be filled in by pct_diff
check_connections.insert(26, "fuel_type_code_pudl_diff", np.nan)
check_connections.insert(29, "net_generation_mwh_pct_diff", np.nan)
check_connections.insert(32, "capacity_mw_pct_diff", np.nan)
check_connections.insert(35, "capacity_factor_pct_diff", np.nan)
check_connections.insert(38, "total_fuel_cost_pct_diff", np.nan)
check_connections.insert(41, "total_mmbtu_pct_diff", np.nan)
check_connections.insert(44, "fuel_cost_per_mmbtu_pct_diff", np.nan)
check_connections.insert(47, "installation_year_diff", np.nan)

# Fix some column names
check_connections.rename(
    columns={'utility_id_pudl_ferc1': 'utility_id_pudl', 
             'plant_id_pudl_ferc1': 'plant_id_pudl',
             'plant_name_new': 'plant_name_eia'}, inplace=True)

def pct_diff(df, col):
    df.loc[(df[f"{col}_eia"] > 0) & (df[f"{col}_ferc1"] > 0), f"{col}_pct_diff"] = (
        round(((df[f"{col}_ferc1"] - df[f"{col}_eia"]) / df[f"{col}_ferc1"] * 100), 2)
    )

# Add pct diff columns
for col in ['net_generation_mwh', 'capacity_mw', 'capacity_factor', 
            'total_fuel_cost', 'total_mmbtu', 'fuel_cost_per_mmbtu']:
    pct_diff(check_connections, col)
    
# Add qualitative similarity columns (fuel_type_code_pudl)
check_connections.loc[
    (check_connections.fuel_type_code_pudl_eia.notna())
    & (check_connections.fuel_type_code_pudl_ferc1.notna()),
    "fuel_type_code_pudl_diff"
] = check_connections.fuel_type_code_pudl_eia == check_connections.fuel_type_code_pudl_ferc1

# Add quantitative similarity columns (installation year)
check_connections.loc[:, "installation_year_ferc1"] = check_connections.installation_year_ferc1.astype("Int64")
check_connections.loc[
    (check_connections.installation_year_eia.notna())
    & (check_connections.installation_year_ferc1.notna()),
    "installation_year_diff"
] = check_connections.installation_year_eia - check_connections.installation_year_ferc1

# Move record_id_ferc1
record_id_ferc1 = check_connections.pop('record_id_ferc1')
check_connections.insert(9, "record_id_ferc1", record_id_ferc1)

# Add utility name eia
utils = pudl_out.utils_eia860().assign(report_year=lambda x: x.report_date.dt.year)[['utility_id_eia', 'utility_name_eia', 'report_year']]

check_connections = (
    pd.merge(check_connections, utils,
             on=['utility_id_eia', 'report_year'], how='left', validate='m:1')
    .rename(columns={'utility_name_eia': 'utility_name'})
)
check_connections.insert(19, "utility_name_eia", check_connections.utility_name)
check_connections = check_connections.drop(columns=['utility_name'])

In [15]:
def is_best_match2(df):
    message = []
    if abs(df.capacity_mw_pct_diff) < 6:
        message.append('cap')
    if abs(df.net_generation_mwh_pct_diff) < 6:
        message.append('net-gen')
    if abs(df.installation_year_diff) < 3:
        message.append('inst-y')

    return '_'.join(message)
        

check_connections['best_match'] = check_connections.apply(lambda x: is_best_match2(x), axis=1)

### 1.3 Get utility and year subsets for editing

Not sure which PUDL ID you need? Use this cell to search for them by name:

In [16]:
def prep_inputs(check_connections_df, utilities='largest', amount=5, years='all'):
    
        all_plants_ferc1 = pudl_out.all_plants_ferc1().copy()
        max_year = all_plants_ferc1.report_year.max()
        min_year = all_plants_ferc1.report_year.min()

        if years != 'all':
            assert type(years) == list, "years must be reported as a list if not 'all'"
            assert len([year for year in years if year in range(min_year, max_year+1)]) == len(years), \
                "years must be 'all' or a valid year integer within the bounds of FERC reporting years"
        if years == 'all':
            years = range(min_year, max_year+1)
            
        check_years = check_connections_df[check_connections_df['report_year'].isin(years)]
        
        if utilities == 'largest':
            logger.info(f"getting pudl ids for the top {amount} largest utilities")
            utilities = (
                check_years
                .groupby(['utility_id_pudl_eia', 'utility_name_ferc1'])['capacity_mw_ferc1']
                .sum()
                .reset_index()
                .sort_values('capacity_mw_ferc1', ascending=False)
                .head(amount)
                .utility_id_pudl_eia
                .tolist()
            )
        else:
            assert type(utilities) == dict, "if not 'largest', utilities must be presented as a dict of PUDL IDs and EIA IDs"
            
        return utilities, years

In [17]:
def get_ferc_eia_utilities_subset(check_connections_df, utilities, years):
    logger.info("retreiving the ferc-eia connection for the given utilities")
    check_years = check_connections_df[check_connections_df['report_year'].isin(years)]
    utils_pudl = utilities['utility_id_pudl']
    util_output = check_years[check_years['utility_id_pudl'].isin(utils_pudl)].copy()
    return util_output
    
def get_mul_subset(mul, utilities, years):
    logger.info("retreiving the MUL for the given utilities")
    mul_years = mul[mul['report_year'].isin(years)]
    utils_eia = utilities['utility_id_eia']
    utils_pudl = utilities['utility_id_pudl']
    mul_output = mul_years[(mul_years['utility_id_eia'].isin(utils_eia)) | (mul_years['utility_id_pudl'].isin(utils_pudl))]
    return mul_output

def get_deprish_subset(deprish_df, utilities):
    logger.info("retrieving depreciation data for the given utilities")
    deprish_output = deprish_df[deprish_df['utility_id_pudl'].isin(utilities)]
    return deprish_output

In [18]:
def output_override_tools(check_connections_df, mul, deprish_df, util_name, utilities='largest', amount=5, years='all'):
    
    logger.info(f"Making override file for {util_name}")
    
    utilities, years = prep_inputs(check_connections_df, utilities, amount, years)
    
    ferc_eia_util_subset = get_ferc_eia_utilities_subset(check_connections_df, utilities, years)
    # Add some functions to it
    ferc_eia_util_subset = (
        ferc_eia_util_subset.reset_index(drop=True)
        .assign(used_match_record=lambda x: "=(F" + (x.index+2).astype('str') + "=K" + (x.index+2).astype('str') + ")" )
    )
    
    mul_util_subset = get_mul_subset(mul, utilities, years)
    deprish_util_subset = get_deprish_subset(deprish_df, utilities)
    
    # Make sure mul subset isn't too big
    assert len(mul_util_subset) < 500000, "Your MUL subset is more than 500,000 rows...this is going to make excel \
        reaaalllllyyy slow. Try entering a smaller utility or year subset"
    
    # Create a dict of each df and the tab name you want to give it in the output
    tool_dict = {
        'ferc_eia_util_subset': ferc_eia_util_subset,
        'mul_util_subset': mul_util_subset,
        'deprish_util_subset': deprish_util_subset
    }
    
    output_path = pathlib.Path().cwd().parent / 'outputs'
    
    # Make sure overrides dir exists
    if not os.path.isdir(output_path / 'overrides'):
        os.mkdir(output_path / 'overrides')
        
    # Enable unique file names and put all files in directory called overrides
    new_output_path = output_path / 'overrides' / f'{util_name}_fix_FERC-EIA_overrides.xlsx'
    
    logger.info("outputing override tools to tabs\n")
    #pudl_rmi.connect_deprish_to_eia.save_to_workbook(output_path, tool_dict)
    
    # output file to a folder called overrides
    writer = pd.ExcelWriter(new_output_path, engine='xlsxwriter')
    for tab, df in tool_dict.items():
        df.to_excel(writer, sheet_name=tab, index=False)
    writer.save()
    
    
    return ferc_eia_util_subset

## **Part 2:** Re-incorporating Matched Records

Now that you've marked the correctly matched records as `TRUE`, we'll want to incorporate those into the perminant override list. All you have to do is move the `fix_FERC-EIA_overrides.xlsx` file to the `add_to_training` directory, run the following cells, and then run...

### 2.1 Update training data

In [276]:
def validate_override_fixes(validated_connections, connects_ferc1_eia, training_data, expect_override_overrides=False):
    """Process the verified and/or fixed matches.
    
    Args: 
        validated_connections (pd.DataFrame): A dataframe in the add_to_training directory that is
            ready to be added to validated and subsumed into the training data.
        connects_ferc1_eia (pd.DataFrame): The current FERC-EIA table
        expect_override_overrides (boolean): Whether you expect the tables to have overridden matches
            already in the training data.
    Raises:
        AssertionError: If there are ferc record ids that aren't in the original FERC-EIA connection
        AssertionError: If there are eia override id records that aren't in the original FERC-EIA connection
        AssertionError: If there are eia override id records that don't correspond to the correct report year
        AssertionError: If you didn't expect to override overrides but the data does
    Returns:
        pd.DataFrame: The validated FERC-EIA dataframe you're trying to add to the training data.
    
    """
    logger.info("Validating overrides")
    
    # Make sure that there are no rouge descriptions in the verified field (besides TRUE)
    match_language = validated_connections.verified.unique()
    assert len(outliers:=[x for x in match_language if x not in [True, False, pd.NA]]) == 0, \
        f"All correct matches must be marked TRUE; found {outliers}"

    # Get TRUE records
    true_connections = validated_connections[validated_connections['verified']].copy()

    # Make sure that the eia and ferc ids haven't been tampered with
    assert len(bad_eia := [x for x in true_connections.dropna().record_id_eia_override_1.unique()
                        if x not in connects_ferc1_eia.record_id_eia.unique()]) == 0, \
        f"Found record_id_eia_override_1 values that aren't in the existing FERC-EIA connection: {bad_eia}"
    assert len(bad_ferc := [x for x in true_connections.dropna().record_id_ferc1.unique()
                        if x not in connects_ferc1_eia.record_id_ferc1.unique()]) == 0, \
        f"Found record_id_ferc1 values that aren't in the existing FERC-EIA connection: {bad_ferc}"

    # Make sure the year in the suggested eia id overrides match the year in the report_year column
    year_ser = true_connections.record_id_eia_override_1.str.extract(r'(_20\d{2})')[0].str.replace('_', '')
    year_ser_int = pd.to_numeric(year_ser, errors='coerce').astype('Int64')
    assert len(bad_eia := true_connections['report_year'].compare(year_ser_int)) == 0, \
        f"Found record_id_eia_override_1 values that don't correspond to the right report year: {bad_eia}"

    if not expect_override_overrides:
        # Make sure that these aren't already in the overrides (this should be impossible, but just in case)
        assert len(bad_eia := [x for x in true_connections.record_id_eia_override_1.unique()
                            if x in training_data.dropna(subset=['record_id_eia']).record_id_eia.unique()]) == 0,  \
            f"Found record_id_eia_override_1 values that are already in the existing FERC-EIA training data: {bad_eia}"
        assert len(bad_ferc := [x for x in true_connections.record_id_ferc1.unique()
                            if x in training_data.dropna(subset=['record_id_eia']).record_id_ferc1.unique()]) == 0, \
            f"Found record_id_ferc1 values that are already in the existing FERC-EIA training data: {bad_ferc}"
    
    return true_connections

In [269]:
def validate_and_combine_new_overrides(expect_override_overrides, connects_ferc1_eia, training_data):
    """Validate and combine all the override matches into one blob.
    
    Validating and combinging the records so you only have to loop through the files once.
    
    Args:
        expect_override_overrides (bool): This value is explicitly assigned at the top of the notebook.
        connects_ferc1_eia (pandas.DataFrame): The current FERC-EIA table
        
    Returns:
        pandas.DataFrame: A DataFrame with all of the new overrides combined.
    """
    all_overrides = pd.DataFrame(columns=['record_id_eia', 'record_id_ferc1', 'signature_1', 'notes'])
    all_files = os.listdir(fixed_overrides_path)
    good_files = [file for file in all_files if file.endswith('.xlsx')]
    for file in good_files:
        logger.info(f"Processing fixes in {file}")
        file_df = (
            pd.read_excel(
                (fixed_overrides_path / file), 
                sheet_name='ferc_eia_util_subset')
            .assign(
                verified=lambda x: x.verified.astype('boolean').fillna(False))
            .pipe(validate_override_fixes, connects_ferc1_eia, training_data, expect_override_overrides=expect_override_overrides)
            .rename(columns={'record_id_eia': 'record_id_eia_old', 
                             'record_id_eia_override_1': 'record_id_eia'}))
        all_overrides = all_overrides.append(file_df[['record_id_eia', 'record_id_ferc1',
                                                      'signature_1', 'notes']])
    return all_overrides

In [270]:
def add_to_training(new_overrides, training_df):
    """Add the new overrides to the old override sheet."""
    logger.info("Combining all new overrides with existing training data")
    training_data_out = (
        training_df.append(
            new_overrides[['record_id_eia', 'record_id_ferc1', 'signature_1', 'notes']])
        .set_index(['record_id_eia', 'record_id_ferc1'])
    )
    return training_data_out