# Add Overrides to Train FERC-EIA Connecter

This notebook is intended to help with adding overrides to the FERC-EIA connection csv. Adding new connections will fill in gaps and improve the program's ability to predict other matches. To adequately check each of the connections, we'll provide you with subsets from *three* different spreadsheets:

1) **The current FERC-EIA connection:** to look for good, bad, and empty links between FERC and EIA records.
2) **The Master Unit List:** to confirm or disprove those connections.
3) **Depreciation data** from our previous work.

Downloading all the files at once will overwhelm excel, so we need to make edits in segments. This notebook will help you:

1) **Download useful utility-based subsets of each table for review.**
2) **Update the old training data with new verified matches.**

Once you edit the inputs below, run the entire notebook. Next, pick whether you want to [download tools to verify connection between EIA and FERC1](#verify-tools) or [upload changes to training data](#upload-overrides) and run the relevant functions.

## Edit Inputs

It's time to choose what subset of the data you'd like to wrangle first. We'll only download data from specific utilities and years if you say so (we highly recommend this so you don't crash excel). If you're not sure which PUDL IDs refer to which utilities, scroll down to section 1.3.

Current inputs: 
* Dominion: 
    * utility_id_pudl:`[292, 293, 349]`
    * utility_id_eia: `[17539, 17554, 19876]`
* Evergy: 
    * utility_id_pudl: `[159, 160, 161, 359]`
    * utility_id_eia: `[10000, 10005, 56211, 3702, 22500]`
* IDACORP: 
    * utility_id_pudl:`[140]`
    * utility_id_eia: `[9191]`
* Duke:
    * utility_id_pudl: `[90, 91, 92, 93, 96, 97]`
    * utility_id_eia: `[5416, 6455, 15470, 55729, 3542, 3046]`
* BHE:
    * utility_id_pudl: `[185, 246, 204, 287]`
    * utility_id_eia: `[12341, 14354, 13407, 17166]` 
* Southern:
    * utility_id_pudl: `[123, 18, 190]`
    * utility_id_eia: `[7140, 195, 12686, 17622]`
* NextEra:
    * utility_id_pudl: `[121, 130]`
    * utility_id_eia: `[6452, 7801]`
* AEP:
    * utility_id_pudl: `[29, 301, 144, 275, 162, 7]`
    * utility_id_eia: `[733, 17698, 9324, 15474, 22053, 20521, 343]`
* Entergy:
    * utility_id_pudl: `[107, 106, 311, 113, 110]`
    * utility_id_eia: `[11241, 814, 12465, 55937, 13478]`
* Xcel:
    * utility_id_pudl: `[224, 302, 272]`
    * utility_id_eia: `[13781, 13780, 17718, 15466]`

In [107]:
# This can be 'largest' or a list of pudl ids, ex: [1, 2, 3]
specified_utilities = {
    'Dominion': {'utility_id_pudl': [292, 293, 349],
                 'utility_id_eia': [17539, 17554, 19876]},
    'Evergy': {'utility_id_pudl': [159, 160, 161, 359],
               'utility_id_eia': [10000, 10005, 56211, 3702, 22500]},
    'IDACORP': {'utility_id_pudl': [140],
                'utility_id_eia': [9191]},
    'Duke': {'utility_id_pudl': [90, 91, 92, 93, 96, 97],
             'utility_id_eia': [5416, 6455, 15470, 55729, 3542, 3046]},
    'BHE': {'utility_id_pudl': [185, 246, 204, 287],
            'utility_id_eia': [12341, 14354, 13407, 17166]},
    'Southern': {'utility_id_pudl': [123, 18, 190],
                 'utility_id_eia': [7140, 195, 12686, 17622]},
    'NextEra': {'utility_id_pudl': [121, 130],
                'utility_id_eia': [6452, 7801]},
    'AEP': {'utility_id_pudl': [29, 301, 144, 275, 162, 7],
            'utility_id_eia': [733, 17698, 9324, 15474, 22053, 20521, 343]},
    'Entergy': {'utility_id_pudl': [107, 106, 311, 113, 110],
                'utility_id_eia': [11241, 814, 12465, 55937, 13478]},
    'Xcel': {'utility_id_pudl': [224, 302, 272],
             'utility_id_eia': [13781, 13780, 17718, 15466]}
}

# You can change this to any integer. This represents the number of utilities you'd like
# to review (only applies when specified_utilities='largest').
specified_amount = 2 

# This can be 'all' or a list of any years within the FERC data, ex: [2006, 2007]
# These are the years you would like to consider fixing AND the years you would like to 
# consider for detmining largest capacity (the latter is only used when `utilities = largest`.
specified_years = [2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019] 

**Now, run the entire notebook to prep the rest of the tools!**

<a id='verify-tools'></a>
## Download tools to verify connection between EIA and FERC1
When you un-comment and run the following function, you'll find a new excel file called `fix_FERC-EIA_overrides.xlsx` in the `outputs` directory created based on the inputs you specified above. Read the [Override Instructions](https://docs.google.com/document/d/1nJfmUtbSN-RT5U2Z3rJKfOIhWsRFUPNxs9NKTes0SRA/edit#) to learn how to begin fixing/verifying the FERC-EIA connections.

**Warning:** Running this funcion will REPLACE any override tools you currently have saved (unless you have changed their name or location). DO NOT run this function if you are in the middle of working on one of the output files OR move/rename the file you are working on.

**Note:** If you choose to move the override file you're working on, make sure to *copy* and paste it in a new location. The following function depends on the existance of an excel file called `fix_FERC-EIA_overrides.xlsx` in the `output` directory in order to run.

In [111]:
%%time
for util_name, id_dict in specified_utilities.items():
    output_override_tools(
        check_connections, 
        mul, 
        deprish_df,
        util_name=util_name,
        utilities=id_dict,
        amount=specified_amount,
        years=specified_years,
    )

Making override file for Dominion
retreiving the ferc-eia connection for the given utilities
retreiving the MUL for the given utilities
retrieving depreciation data for the given utilities
outputing override tools to tabs

Making override file for Evergy
retreiving the ferc-eia connection for the given utilities
retreiving the MUL for the given utilities
retrieving depreciation data for the given utilities
outputing override tools to tabs

Making override file for IDACORP
retreiving the ferc-eia connection for the given utilities
retreiving the MUL for the given utilities
retrieving depreciation data for the given utilities
outputing override tools to tabs

Making override file for Duke
retreiving the ferc-eia connection for the given utilities
retreiving the MUL for the given utilities
retrieving depreciation data for the given utilities
outputing override tools to tabs

Making override file for BHE
retreiving the ferc-eia connection for the given utilities
retreiving the MUL for the 

In [119]:
test = check_connections[check_connections['plant_id_pudl']==246]
test.match_type.unique()
check_connections[check_connections['match_type'].isna()]

Unnamed: 0,verified,used_match_record,signature_1,signature_2,notes,record_id_eia_override_1,record_id_eia_override_2,record_id_eia_override_3,best_match,record_id_ferc1,record_id_eia,true_gran,report_year,match_type,plant_part,ownership,utility_id_eia,utility_id_pudl,utility_name_ferc1,utility_name_eia,plant_id_pudl,unit_id_pudl,generator_id,plant_name_ferc1,plant_name_eia,fuel_type_code_pudl_ferc1,fuel_type_code_pudl_eia,fuel_type_code_pudl_diff,net_generation_mwh_ferc1,net_generation_mwh_eia,net_generation_mwh_pct_diff,capacity_mw_ferc1,capacity_mw_eia,capacity_mw_pct_diff,capacity_factor_ferc1,capacity_factor_eia,capacity_factor_pct_diff,total_fuel_cost_ferc1,total_fuel_cost_eia,total_fuel_cost_pct_diff,total_mmbtu_ferc1,total_mmbtu_eia,total_mmbtu_pct_diff,fuel_cost_per_mmbtu_ferc1,fuel_cost_per_mmbtu_eia,fuel_cost_per_mmbtu_pct_diff,installation_year_ferc1,installation_year_eia,installation_year_diff
16042,,,,,,,,,,f1_steam_1994_12_1_0_1,,,1994,,,,,,AEP Generating Company,,,,,rockport unit 1,,,,,4668184.0,,,650.0,,,0.819843,,,,,,,,,,,,1984,,
16043,,,,,,,,,,f1_steam_1994_12_1_0_2,,,1994,,,,,,AEP Generating Company,,,,,rockport unit 2,,,,,4451312.0,,,650.0,,,0.781755,,,,,,,,,,,,1989,,
16044,,,,,,,,,,f1_steam_1994_12_1_0_3,,,1994,,,,,,AEP Generating Company,,,,,rockport,,coal,,,9119496.0,,,1300.0,,,0.800799,,,9.996752e+07,,,8.921254e+07,,,1.120555,,,1989,,
16045,,,,,,,,,,f1_steam_1994_12_1_0_4,,,1994,,,,,,AEP Generating Company,,,,,rockport total plant,,coal,,,17793158.0,,,2600.0,,,0.781224,,,1.948474e+08,,,1.739994e+08,,,1.119817,,,1989,,
16046,,,,,,,,,,f1_steam_1996_12_1_0_4,,,1996,,,,,,AEP Generating Company,,,,,rockport total plant,,coal,,,,,,2600.0,,,,,,1.841852e+08,,,1.687690e+08,,,1.091345,,,1989,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49602,,,,,,,,,,f1_pumped_storage_2017_12_519_0_1,,,2017,,,,,,upper michigan energy resources company (pudl ...,,,,,none,,,,,,,,0.0,,,,,,,,,,,,,,,,,
49603,,,,,,,,,,f1_pumped_storage_2018_12_519_0_1,,,2018,,,,,,upper michigan energy resources company (pudl ...,,,,,none,,,,,,,,0.0,,,,,,,,,,,,,,,,,
49604,,,,,,,,,,f1_pumped_storage_2018_12_152_0_1,,,2018,,,,,,Rockland Electric Company,,,,,non-applicable,,,,,,,,0.0,,,,,,,,,,,,,,,,,
49605,,,,,,,,,,f1_pumped_storage_2019_12_152_0_1,,,2019,,,,,,Rockland Electric Company,,,,,non-applicable,,,,,,,,0.0,,,,,,,,,,,,,,,,,


<a id='upload-overrides'></a>
## Upload changes to training data
When you've finished editing the `fix_FERC-EIA_overrides.xlsx` and want to add your changes to the official override csv, move your file to the directory called `overrides` and then uncomment and run the following functions. 

**Note:** If you have changed or marked TRUE any records that have already been overridden and included in the training data, you will want to set `expect_override_overrides = True`. Otherwise, the function will check to see if you have accidentally tampered with values that have already been matched.

In [2]:
expect_override_overrides = True

In [5]:
training_data_out = (
    combine_new_overrides(expect_override_overrides)
    .pipe(combine_all_overrides, training_data)
)

#training_data_out.to_csv(training_path)

NameError: name 'combine_new_overrides' is not defined

In [31]:
training_data_out.to_excel('/Users/aesharpe/Desktop/updated_overrides.xlsx')

----------

## Notebook Setup

In [81]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [82]:
import pandas as pd
import numpy as np
import pathlib
import pudl
import pudl.constants as pc
import pudl.extract.ferc1
import sqlalchemy as sa
import logging
import sys
import copy
from copy import deepcopy
import scipy
import statistics
import yaml
import os

import recordlinkage as rl
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

In [83]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

In [84]:
sys.path.append("../")
from pudl.output.ferc1 import *
from pudl_rmi.connect_ferc1_to_eia import *
from pudl_rmi.make_plant_parts_eia import *
import pudl_rmi.connect_ferc1_to_eia
pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
ferc_engine = sa.create_engine(pudl_settings['ferc1_db'])
pd.options.display.max_columns = None

In [85]:
relevant_cols_ferc_eia = [
    'record_id_ferc1',
    'record_id_eia',
    'true_gran',
    'report_year',
    'match_type',
    'plant_part',
    'ownership',
    'utility_id_eia',
    'utility_id_pudl',
    'utility_name_ferc1',
    'plant_id_pudl',
    'unit_id_pudl',
    'generator_id',
    'plant_name_ferc1',
    'plant_name_new',
    'fuel_type_code_pudl_ferc1',
    'fuel_type_code_pudl_eia',
    'net_generation_mwh_ferc1',
    'net_generation_mwh_eia',
    'capacity_mw_ferc1',
    'capacity_mw_eia',
    'capacity_factor_ferc1',
    'capacity_factor_eia',
    'total_fuel_cost_ferc1',
    'total_fuel_cost_eia',
    'total_mmbtu_ferc1',
    'total_mmbtu_eia',
    'fuel_cost_per_mmbtu_ferc1',
    'fuel_cost_per_mmbtu_eia',
    'installation_year_ferc1',
    'installation_year_eia',
]

relevant_cols_mul = [
    'record_id_eia',
    'report_year',
    'utility_id_pudl',
    'utility_id_eia',
    'utility_name_eia', # I add this in from the utils_eia860() table
    'operational_status_pudl',
    'true_gran',
    'plant_part',
    'ownership_dupe',
    'fraction_owned',
    'plant_id_eia',
    'plant_id_pudl',
    'plant_name_new',
    'generator_id',
    'capacity_mw',
    'capacity_factor',
    'net_generation_mwh',
    'installation_year',
    'fuel_type_code_pudl',
    'total_fuel_cost',
    'total_mmbtu',
    'fuel_cost_per_mmbtu',
    'heat_rate_mmbtu_mwh',
]

## **Part 1:** Generate Override Tools

### 1.1 Get current FERC-EIA & MUL tables
This is going to look a lot like the `connect_ferc1_to_eia.ipynb`.

In [86]:
#make_plant_parts_eia.get_master_unit_list_eia(file_path_mul, pudl_out, clobber=True)

In [87]:
file_path_training = pathlib.Path().cwd().parent /'inputs'/'train_ferc1_to_eia.csv'
file_path_mul = pathlib.Path().cwd().parent /'outputs' / 'master_unit_list.pkl.gz'
# pudl output object for ferc data
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS',fill_fuel_cost=True,roll_fuel_cost=True,fill_net_gen=True)

In [145]:
inputs = InputManager(file_path_training, file_path_mul, pudl_out)
features_all = (Features(feature_type='all', inputs=inputs)
                .get_features(clobber=False))
features_train = (Features(feature_type='training', inputs=inputs)
                  .get_features(clobber=False))
tuner = ModelTuner(features_train, inputs.get_train_index(), n_splits=10)

matcher = MatchManager(best=tuner.get_best_fit_model(), inputs=inputs)
matches_best = matcher.get_best_matches(features_train, features_all)

Preparing the FERC1 tables.
Reading the master unit list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/rmi-ferc1-eia/outputs/master_unit_list.pkl.gz
Generated 133268 all candidate features.
Generated 933 training candidate features.
We are about to test hyper parameters of the model while doing k-fold cross validation. This takes a few minutes....
Scores from the best model hyperparameters:
  F-Score:   0.86
  Precision: 0.9
  Accuracy:  0.71
Fit and predict a model w/ the highest scoring hyperparameters.
Get the top scoring match for each FERC1 steam record.
Winning match stats:
        matches vs ferc:      49.75%
        best match v ferc:    47.00%
        best match vs matches:94.48%
        murk vs matches:      2.95%
        ties vs matches:      1.05%
Overridden records:       19.2%
New best match v ferc:    47.44%


In [146]:
file_path_deprish = pathlib.Path().cwd().parent/'inputs'/'depreciation_rmi.xlsx'
sheet_name_deprish='Depreciation Studies Raw'
transformer = pudl_rmi.deprish.Transformer(
    pudl_rmi.deprish.Extractor(
        file_path=file_path_deprish,
        sheet_name=sheet_name_deprish
    ).execute())

Reading the depreciation data from /Users/aesharpe/Desktop/Work/Catalyst_Coop/rmi-ferc1-eia/inputs/depreciation_rmi.xlsx


In [147]:
connects_ferc1_eia = (
    prettyify_best_matches(
        matches_best, 
        plant_parts_true_df=inputs.plant_parts_true_df,
        steam_df=inputs.all_plants_ferc1_df,
        train_df=inputs.train_df)
    .copy()
)

jsuk there are some FERC-EIA matches that aren't in the steam                 table but this is because they are linked to retired EIA generators.
Coverage for matches during EIA working years:
    Fuel type: 97.1%
    Tech type: 31.0%

Coverage for all steam table records during EIA working years:
    EIA matches: 66.9

Coverage for all small gen table records during EIA working years:
    EIA matches: 0.2

Coverage for all hydro table records during EIA working years:
    EIA matches: 79.8

Coverage for all pumped storage table records during EIA working years:
    EIA matches: 13.7
Matches with consistency across years of all matches is 79.5%
Matches with completely consistent FERC capacity have a consistency of 92.9%
Matches with consistency across years of overrides matches is 48.9%
Matches with completely consistent FERC capacity have a consistency of 87.2%


In [97]:
mul = (
    make_plant_parts_eia.get_master_unit_list_eia(file_path_mul, pudl_out).reset_index()
    .copy()
)

Reading the master unit list from /Users/aesharpe/Desktop/Work/Catalyst_Coop/rmi-ferc1-eia/outputs/master_unit_list.pkl.gz


In [98]:
# Add utility_name
mul = pd.merge(
    mul, 
    pudl_out.utils_eia860()[['utility_id_eia', 'utility_name_eia', 'report_date']].copy(), 
    on=['utility_id_eia', 'report_date'],
    how='left',
    validate='m:1'
)

In [99]:
deprish_df = transformer.execute()

# of reserve_rate over 1 (100%): 1. Higher #s here may indicate an issue with the original data or the fill_in method
2.11% of records have correctable zero net_salvage_rate
2.1% of records have postive net_salvage_rate
0.03% of records have correctable zero net_salvage
2.3% of records have postive net_salvage
Added 16999 ferc_acct_name's out of 17210 options
aggregating to: ['report_date', 'plant_id_pudl', 'plant_part_name', 'ferc_acct', 'utility_id_pudl', 'data_source', 'line_id', 'utility_name_ferc1']
overriding auto-generated common associations with 470 mannual associations
grabbed 1157 common records
grabbed 1923 common reocrds and 14330 atomic records
allocating common for plant_balance
The resulting plant_balance allocated is 99.42% of the original
allocating common for book_reserve
The resulting book_reserve allocated is 99.87% of the original
allocating common for unaccrued_balance
The resulting unaccrued_balance allocated is 100.00% of the original
allocating common for net_

In [None]:
len(connects_ferc1_eia[connects_ferc1_eia['record_id_ferc1'].duplicated(keep=False)])/2

### 1.2 Curate columns

Here we'll limit the columns in the output file to those that will be useful for analysing match correctness. We'll also add some columns for you to use during the match verification process. All match types are included in the outputs (even those that have been correctly mapped according the current overrides) just incase there is a discrepancy or error that we want to fix.

**Match Types:**

* `prediction`: prediction based on the training data.
* `correct_prediction`: prediction based on training data that matches record in the training data.
* `no prediction; training`: not filled in by the prediction algorithm but filled in by the training data.
* `overridden`: incorrectly filled in my prediction algorithm and corrected by training data.
* `no_match`: a reviewer has found there to be no verified EIA match for the given FERC record.
* `NaN`: not filled in by the training data or the prediction algorithm. 

In [109]:
# Grab the MUL and FERC-EIA connections that show the comparison between FERC and EIA values
check_connections = connects_ferc1_eia[relevant_cols_ferc_eia].copy()
mul = mul[relevant_cols_mul].copy()

# Add a column to tell whether it's a good match, who verified / made the match,
# and any notes about weirdness.
check_connections.insert(0, "verified", np.nan)
check_connections.insert(1, "used_match_record", np.nan)
check_connections.insert(2, "signature_1", np.nan)
check_connections.insert(3, "signature_2", np.nan)
check_connections.insert(4, "notes", np.nan)
check_connections.insert(6, "record_id_eia_override_1", np.nan)
check_connections.insert(7, "record_id_eia_override_2", np.nan)
check_connections.insert(8, "record_id_eia_override_3", np.nan)
check_connections.insert(9, "best_match", np.nan)


# put these in the right order to be filled in by pct_diff
check_connections.insert(26, "fuel_type_code_pudl_diff", np.nan)
check_connections.insert(29, "net_generation_mwh_pct_diff", np.nan)
check_connections.insert(32, "capacity_mw_pct_diff", np.nan)
check_connections.insert(35, "capacity_factor_pct_diff", np.nan)
check_connections.insert(38, "total_fuel_cost_pct_diff", np.nan)
check_connections.insert(41, "total_mmbtu_pct_diff", np.nan)
check_connections.insert(44, "fuel_cost_per_mmbtu_pct_diff", np.nan)
check_connections.insert(47, "installation_year_diff", np.nan)

# Fix some column names
check_connections.rename(
    columns={'utility_id_pudl_ferc1': 'utility_id_pudl', 
             'plant_id_pudl_ferc1': 'plant_id_pudl',
             'plant_name_new': 'plant_name_eia'}, inplace=True)

def pct_diff(df, col):
    df.loc[(df[f"{col}_eia"] > 0) & (df[f"{col}_ferc1"] > 0), f"{col}_pct_diff"] = (
        round(((df[f"{col}_ferc1"] - df[f"{col}_eia"]) / df[f"{col}_ferc1"] * 100), 2)
    )

# Add pct diff columns
for col in ['net_generation_mwh', 'capacity_mw', 'capacity_factor', 
            'total_fuel_cost', 'total_mmbtu', 'fuel_cost_per_mmbtu']:
    pct_diff(check_connections, col)
    
# Add qualitative similarity columns (fuel_type_code_pudl)
check_connections.loc[
    (check_connections.fuel_type_code_pudl_eia.notna())
    & (check_connections.fuel_type_code_pudl_ferc1.notna()),
    "fuel_type_code_pudl_diff"
] = check_connections.fuel_type_code_pudl_eia == check_connections.fuel_type_code_pudl_ferc1

# Add quantitative similarity columns (installation year)
check_connections.loc[:, "installation_year_ferc1"] = check_connections.installation_year_ferc1.astype("Int64")
check_connections.loc[
    (check_connections.installation_year_eia.notna())
    & (check_connections.installation_year_ferc1.notna()),
    "installation_year_diff"
] = check_connections.installation_year_eia - check_connections.installation_year_ferc1

# Move record_id_ferc1
record_id_ferc1 = check_connections.pop('record_id_ferc1')
check_connections.insert(9, "record_id_ferc1", record_id_ferc1)

# Add utility name eia
utils = pudl_out.utils_eia860().assign(report_year=lambda x: x.report_date.dt.year)[['utility_id_eia', 'utility_name_eia', 'report_year']]

check_connections = (
    pd.merge(check_connections, utils,
             on=['utility_id_eia', 'report_year'], how='left', validate='m:1')
    .rename(columns={'utility_name_eia': 'utility_name'})
)
check_connections.insert(19, "utility_name_eia", check_connections.utility_name)
check_connections = check_connections.drop(columns=['utility_name'])

In [120]:
def is_best_match2(df):
    message = []
    if abs(df.capacity_mw_pct_diff) < 6:
        message.append('cap')
    if abs(df.net_generation_mwh_pct_diff) < 6:
        message.append('net-gen')
    if abs(df.installation_year_diff) < 3:
        message.append('inst-y')

    return '_'.join(message)
        

check_connections['best_match'] = check_connections.apply(lambda x: is_best_match2(x), axis=1)

In [124]:
check_connections[check_connections['match_type'].isna()].utility_id_pudl.unique()

<IntegerArray>
[<NA>]
Length: 1, dtype: Int64

### 1.3 Get utility and year subsets for editing

Not sure which PUDL ID you need? Use this cell to search for them by name:

In [66]:
def prep_inputs(check_connections_df, utilities='largest', amount=5, years='all'):
    
        all_plants_ferc1 = pudl_out.all_plants_ferc1().copy()
        max_year = all_plants_ferc1.report_year.max()
        min_year = all_plants_ferc1.report_year.min()

        if years != 'all':
            assert type(years) == list, "years must be reported as a list if not 'all'"
            assert len([year for year in years if year in range(min_year, max_year+1)]) == len(years), \
                "years must be 'all' or a valid year integer within the bounds of FERC reporting years"
        if years == 'all':
            years = range(min_year, max_year+1)
            
        check_years = check_connections_df[check_connections_df['report_year'].isin(years)]
        
        if utilities == 'largest':
            logger.info(f"getting pudl ids for the top {amount} largest utilities")
            utilities = (
                check_years
                .groupby(['utility_id_pudl_eia', 'utility_name_ferc1'])['capacity_mw_ferc1']
                .sum()
                .reset_index()
                .sort_values('capacity_mw_ferc1', ascending=False)
                .head(amount)
                .utility_id_pudl_eia
                .tolist()
            )
        else:
            assert type(utilities) == dict, "if not 'largest', utilities must be presented as a dict of PUDL IDs and EIA IDs"
            
        return utilities, years

In [67]:
def get_ferc_eia_utilities_subset(check_connections_df, utilities, years):
    logger.info("retreiving the ferc-eia connection for the given utilities")
    check_years = check_connections_df[check_connections_df['report_year'].isin(years)]
    utils_pudl = utilities['utility_id_pudl']
    util_output = check_years[check_years['utility_id_pudl'].isin(utils_pudl)].copy()
    return util_output
    
def get_mul_subset(mul, utilities, years):
    logger.info("retreiving the MUL for the given utilities")
    mul_years = mul[mul['report_year'].isin(years)]
    utils_eia = utilities['utility_id_eia']
    utils_pudl = utilities['utility_id_pudl']
    mul_output = mul_years[(mul_years['utility_id_eia'].isin(utils_eia)) | (mul_years['utility_id_pudl'].isin(utils_pudl))]
    return mul_output

def get_deprish_subset(deprish_df, utilities):
    logger.info("retrieving depreciation data for the given utilities")
    deprish_output = deprish_df[deprish_df['utility_id_pudl'].isin(utilities)]
    return deprish_output

In [101]:
def output_override_tools(check_connections_df, mul, deprish_df, util_name, utilities='largest', amount=5, years='all'):
    
    logger.info(f"Making override file for {util_name}")
    
    utilities, years = prep_inputs(check_connections_df, utilities, amount, years)
    
    ferc_eia_util_subset = get_ferc_eia_utilities_subset(check_connections_df, utilities, years)
    # Add some functions to it
    ferc_eia_util_subset = (
        ferc_eia_util_subset.reset_index(drop=True)
        .assign(used_match_record=lambda x: "=(F" + (x.index+2).astype('str') + "=K" + (x.index+2).astype('str') + ")" )
    )
    
    mul_util_subset = get_mul_subset(mul, utilities, years)
    deprish_util_subset = get_deprish_subset(deprish_df, utilities)
    
    # Make sure mul subset isn't too big
    assert len(mul_util_subset) < 500000, "Your MUL subset is more than 500,000 rows...this is going to make excel \
        reaaalllllyyy slow. Try entering a smaller utility or year subset"
    
    # Create a dict of each df and the tab name you want to give it in the output
    tool_dict = {
        'ferc_eia_util_subset': ferc_eia_util_subset,
        'mul_util_subset': mul_util_subset,
        'deprish_util_subset': deprish_util_subset
    }
    
    output_path = pathlib.Path().cwd().parent / 'outputs'
    
    # Make sure overrides dir exists
    if not os.path.isdir(output_path / 'overrides'):
        os.mkdir(output_path / 'overrides')
        
    # Enable unique file names and put all files in directory called overrides
    new_output_path = output_path / 'overrides' / f'{util_name}_fix_FERC-EIA_overrides.xlsx'
    
    logger.info("outputing override tools to tabs\n")
    #pudl_rmi.connect_deprish_to_eia.save_to_workbook(output_path, tool_dict)
    
    # output file to a folder called overrides
    writer = pd.ExcelWriter(new_output_path, engine='xlsxwriter')
    for tab, df in tool_dict.items():
        df.to_excel(writer, sheet_name=tab, index=False)
    writer.save()
    
    
    return ferc_eia_util_subset

## **Part 2:** Re-incorporating Matched Records

Now that you've marked the correctly matched records as `TRUE`, we'll want to incorporate those into the perminant override list. All you have to do is move the `fix_FERC-EIA_overrides.xlsx` file to the `overrides` directory, run the following cells, and then run...

### 2.1 Update training data

In [69]:
fixed_overrides_path = pathlib.Path().cwd().parent / 'overrides' #/ #'fix_FERC-EIA_overrides.xlsx'
training_path = pathlib.Path().cwd().parent / 'inputs' / 'train_ferc1_to_eia_copy.csv'
training_data = pd.read_csv(training_path)

In [22]:
def validate_override_fixes(validated_connections, expect_override_overrides=False):
    """Process the verified / fixed matches."""
    logger.info("validating override fixes")
    
    # Make sure that there are no rouge descriptions in the verified field (besides TRUE)
    match_language = validated_connections.verified.unique()
    assert len(outliers:=[x for x in match_language if x not in [True, False]]) == 0, \
        f"All correct matches must be marked TRUE; found {outliers}"

    # Make it a boolean column
    validated_connections.loc[:, "verified"] = (
        validated_connections.verified.astype('bool'))

    # Get TRUE records
    true_connections = validated_connections[validated_connections['verified']].copy()
    #print(true_connections.columns.tolist())

    # Make sure that the eia and ferc ids haven't been tampered with
    assert len(bad_eia := [x for x in true_connections.dropna().record_id_eia.unique()
                        if x not in connects_ferc1_eia.record_id_eia.unique()]) == 0, \
        f"Found record_id_eia values that aren't in the existing FERC-EIA connection: {bad_eia}"
    assert len(bad_ferc := [x for x in true_connections.dropna().record_id_ferc1.unique()
                        if x not in connects_ferc1_eia.record_id_ferc1.unique()]) == 0, \
        f"Found record_id_ferc1 values that aren't in the existing FERC-EIA connection: {bad_ferc}"

    if not expect_override_overrides:
        # Make sure that these aren't already in the overrides (this should be impossible, but just in case)
        assert len(bad_eia := [x for x in true_connections.record_id_eia.unique()
                            if x in training_data.dropna(subset=['record_id_eia']).record_id_eia.unique()]) == 0,  \
            f"Found record_id_eia values that are already in the existing FERC-EIA training data: {bad_eia}"
        assert len(bad_ferc := [x for x in true_connections.record_id_ferc1.unique()
                            if x in training_data.dropna(subset=['record_id_eia']).record_id_ferc1.unique()]) == 0, \
            f"Found record_id_ferc1 values that are already in the existing FERC-EIA training data: {bad_ferc}"
    
    return true_connections

In [102]:
def combine_new_overrides(expect_override_overrides):
    logger.info("combining all new override files")
    all_fixes = pd.DataFrame(columns=['record_id_eia', 'record_id_ferc1', 'signature_1', 'notes'])
    all_files = os.listdir(fixed_overrides_path)
    files = [file for file in all_files if not file.startswith('.')]
    for file in files:
        assert (file.endswith('.xlsx'), 'fixing the overrides can only read .xslx \
            files; found other file types in the overrides directory')
    for file in files:
        logger.info(f"Processing fixes in {file}")
        file_df = (
            pd.read_excel(
                (fixed_overrides_path / file), 
                sheet_name='ferc_eia_util_subset')
            .rename(columns={'record_id_eia': 'record_id_eia_old', 'record_id_eia_override_1': 'record_id_eia'})
            .assign(
                verified=lambda x: x.verified.replace({'TRUE':True, np.nan: False}))
            .pipe(validate_override_fixes, expect_override_overrides=expect_override_overrides))
        all_fixes = all_fixes.append(file_df[['record_id_eia', 'record_id_ferc1',
                                              'signature_1', 'notes']])
    return all_fixes

In [103]:
def combine_all_overrides(new_overrides, training_df):
    logger.info("combining all overrides")
    training_data_out = (
        training_df.append(
            new_overrides[['record_id_eia', 'record_id_ferc1', 'signature_1', 'notes']])
        .set_index(['record_id_eia', 'record_id_ferc1'])
    )
    return training_data_out

### 2.2 Export updated data

Move your updated version of the `fix_FERC-EIA_overrides.xlsx` file into the directory called `overrides`. This notebook will only process files to supplement the existing training data that are located in that folder.

## Explore

### Check Best Matches

In [320]:
check_connections.best_match.unique()

array(['cap', 'cap_net-gen', 'net-gen', '', 'cap_inst-y', 'inst-y',
       'cap_net-gen_inst-y', 'net-gen_inst-y'], dtype=object)

In [328]:
full_

62.97580661515686

In [342]:
full_len = len(check_connections)
matched_len = len(check_connections[check_connections['match_type'].isna()])

print("Best Match All %:")
print(round(
    (len(check_connections[check_connections['best_match']=='cap_net-gen_inst-y']) 
    / full_len * 100), 2), "pct of all records")
print(round(
    (len(check_connections[check_connections['best_match']=='cap_net-gen_inst-y']) 
    / matched_len * 100), 2), "pct of matched records \n")


print("Best Match Cap %:")
print(round(
    (len(check_connections[check_connections['best_match'].str.contains('cap')]) 
    / full_len * 100), 2), "pct of all records")
print(round(
    (len(check_connections[check_connections['best_match'].str.contains('cap')]) 
    / matched_len * 100), 2), "pct of matched records \n")


print("Best Match Cap, Net Gen %:")
print(round(
    (len(check_connections[check_connections['best_match'].str.contains(r'^(?=.*cap)(?=.*net-gen)')]) 
    / full_len * 100), 2), "pct of all records")
print(round(
    (len(check_connections[check_connections['best_match'].str.contains(r'^(?=.*cap)(?=.*net-gen)')]) 
    / matched_len * 100), 2), "pct of matched records \n")


print("Best Match Cap, Inst Year %:")
print(round(
    (len(check_connections[check_connections['best_match'].str.contains(r'^(?=.*cap)(?=.*inst-y)')]) 
    / full_len * 100), 2), "pct of all records")
print(round(
    (len(check_connections[check_connections['best_match'].str.contains(r'^(?=.*cap)(?=.*inst-y)')]) 
    / matched_len * 100), 2), "pct of matched records \n")

print("Best Match Net Gen %:")
print(round(
    (len(check_connections[check_connections['best_match'].str.contains('net-gen')]) 
    / full_len * 100), 2), "pct of all records")
print(round(
    (len(check_connections[check_connections['best_match'].str.contains('net-gen')]) 
    / matched_len * 100), 2), "pct of matched records \n")

print("Best Match Net Gen, Inst Year %:")
print(round(
    (len(check_connections[check_connections['best_match'].str.contains(r'^(?=.*net-gen)(?=.*inst-y)')]) 
    / full_len * 100), 2), "pct of all records")
print(round(
    (len(check_connections[check_connections['best_match'].str.contains(r'^(?=.*net-gen)(?=.*inst-y)')]) 
    / matched_len * 100), 2), "pct of matched records")

Best Match All %:
17.1 pct of all records
27.16 pct of matched records 

Best Match Cap %:
30.17 pct of all records
47.91 pct of matched records 

Best Match Cap, Net Gen %:
20.59 pct of all records
32.69 pct of matched records 

Best Match Cap, Inst Year %:
21.67 pct of all records
34.41 pct of matched records 

Best Match Net Gen %:
25.11 pct of all records
39.88 pct of matched records 

Best Match Net Gen, Inst Year %:
20.43 pct of all records
32.45 pct of matched records


### Check overrides

In [168]:
test = check_connections[check_connections['match_type'].isin(['override', 'correct prediction'])].copy()
test.to_excel('/Users/aesharpe/Desktop/test.xlsx')

In [231]:
mul[mul['record_id_eia']=='55202_2018_plant_total_19436']
mul[(mul['plant_id_eia']==55202) & (mul['report_year']==2018)].sort_values('capacity_mw')#.capacity_mw.unique()

Unnamed: 0,record_id_eia,plant_id_eia,report_date,plant_part,generator_id,unit_id_pudl,prime_mover_code,energy_source_code_1,technology_description,ferc_acct_name,utility_id_eia,true_gran,appro_part_label,appro_record_id_eia,capacity_factor,capacity_mw,capacity_mw_eoy,fraction_owned,fuel_cost_per_mmbtu,fuel_cost_per_mwh,fuel_type_code_pudl,heat_rate_mmbtu_mwh,installation_year,net_generation_mwh,operational_status,operational_status_pudl,ownership,ownership_dupe,planned_retirement_date,plant_id_pudl,plant_name_eia,plant_name_new,record_count,retirement_date,total_fuel_cost,total_mmbtu,utility_id_pudl,report_year,plant_id_report_year,utility_name_eia
28536616,55202_3_2018_plant_gen_owned_19436,55202,2018-01-01,plant_gen,3,,GT,NG,Natural Gas Fired Combustion Turbine,Other,19436,True,plant_gen,55202_3_2018_plant_gen_owned_19436,0.044016,45.0,45.0,1.0,,,gas,,2000,17350.934211,existing,operating,owned,True,NaT,452,Pinckneyville,Pinckneyville 3,8,NaT,,,334,2018,452_2018,Union Electric Co
28820997,55202_1_2018_plant_gen_total_19436,55202,2018-01-01,plant_gen,1,,GT,NG,Natural Gas Fired Combustion Turbine,Other,19436,True,plant_gen,55202_1_2018_plant_gen_total_19436,0.044016,45.0,45.0,1.0,,,gas,,2000,17350.934211,existing,operating,total,False,NaT,452,Pinckneyville,Pinckneyville 1,8,NaT,,,334,2018,452_2018,Union Electric Co
28820996,55202_1_2018_plant_gen_total_19436,55202,2018-01-01,plant_gen,1,,GT,NG,Natural Gas Fired Combustion Turbine,Other,19436,True,plant_gen,55202_1_2018_plant_gen_total_19436,0.044016,45.0,45.0,1.0,,,gas,,2000,17350.934211,existing,operating,total,False,NaT,452,Pinckneyville,Pinckneyville 1,8,NaT,,,334,2018,452_2018,Union Electric Co
28820995,55202_1_2018_plant_gen_total_19436,55202,2018-01-01,plant_gen,1,,GT,NG,Natural Gas Fired Combustion Turbine,Other,19436,True,plant_gen,55202_1_2018_plant_gen_total_19436,0.044016,45.0,45.0,1.0,,,gas,,2000,17350.934211,existing,operating,total,False,NaT,452,Pinckneyville,Pinckneyville 1,8,NaT,,,334,2018,452_2018,Union Electric Co
28820994,55202_1_2018_plant_gen_total_19436,55202,2018-01-01,plant_gen,1,,GT,NG,Natural Gas Fired Combustion Turbine,Other,19436,True,plant_gen,55202_1_2018_plant_gen_total_19436,0.044016,45.0,45.0,1.0,,,gas,,2000,17350.934211,existing,operating,total,False,NaT,452,Pinckneyville,Pinckneyville 1,8,NaT,,,334,2018,452_2018,Union Electric Co
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27899736,55202_ng_2018_plant_prime_fuel_owned_19436,55202,2018-01-01,plant_prime_fuel,,,GT,NG,Natural Gas Fired Combustion Turbine,Other,19436,False,plant,55202_2018_plant_owned_19436,0.044016,380.0,380.0,1.0,,,gas,,2001,146519.000000,existing,operating,owned,True,NaT,452,Pinckneyville,Pinckneyville NG,1,NaT,,,334,2018,452_2018,Union Electric Co
27899735,55202_ng_2018_plant_prime_fuel_owned_19436,55202,2018-01-01,plant_prime_fuel,,,GT,NG,Natural Gas Fired Combustion Turbine,Other,19436,False,plant,55202_2018_plant_owned_19436,0.044016,380.0,380.0,1.0,,,gas,,2001,146519.000000,existing,operating,owned,True,NaT,452,Pinckneyville,Pinckneyville NG,1,NaT,,,334,2018,452_2018,Union Electric Co
27899734,55202_ng_2018_plant_prime_fuel_owned_19436,55202,2018-01-01,plant_prime_fuel,,,GT,NG,Natural Gas Fired Combustion Turbine,Other,19436,False,plant,55202_2018_plant_owned_19436,0.044016,380.0,380.0,1.0,,,gas,,2001,146519.000000,existing,operating,owned,True,NaT,452,Pinckneyville,Pinckneyville NG,1,NaT,,,334,2018,452_2018,Union Electric Co
27768900,55202_natural_gas_fired_combustion_turbine_201...,55202,2018-01-01,plant_technology,,,GT,NG,Natural Gas Fired Combustion Turbine,Other,19436,False,plant,55202_2018_plant_total_19436,0.044016,380.0,380.0,1.0,,,gas,,2001,146519.000000,existing,operating,total,False,NaT,452,Pinckneyville,Pinckneyville Natural Gas Fired Combustion Tur...,1,NaT,,,334,2018,452_2018,Union Electric Co


In [234]:
test = mul[mul['plant_id_eia']==55202][['report_year', 'plant_part', 'plant_name_eia', 'capacity_mw', 'utility_id_pudl', 'utility_name_eia']]#.capacity_mw.unique()
test2 = test[test['plant_part']=='plant_gen']
test[test['report_year']==2018].drop_duplicates()
#test

Unnamed: 0,report_year,plant_part,plant_name_eia,capacity_mw,utility_id_pudl,utility_name_eia
26957790,2018,plant,Pinckneyville,380.0,334,Union Electric Co
27271387,2018,plant_prime_mover,Pinckneyville,380.0,334,Union Electric Co
27607821,2018,plant_technology,Pinckneyville,380.0,334,Union Electric Co
27899727,2018,plant_prime_fuel,Pinckneyville,380.0,334,Union Electric Co
28201282,2018,plant_ferc_acct,Pinckneyville,380.0,334,Union Electric Co
28461240,2018,plant_gen,Pinckneyville,45.0,334,Union Electric Co
28567313,2018,plant_gen,Pinckneyville,50.0,334,Union Electric Co


In [233]:
aa = check_connections[check_connections['plant_id_pudl']==452].sort_values(['report_year'])[['report_year','plant_part', 'plant_name_ferc1', 'capacity_mw_ferc1', 'utility_id_eia', 'utility_id_pudl', 'utility_name_ferc1']].copy()
#aa[aa['report_year']==2018]
aa

Unnamed: 0,report_year,plant_part,plant_name_ferc1,capacity_mw_ferc1,utility_id_eia,utility_id_pudl,utility_name_ferc1
7256,2005,plant,pickneyville,404.8,19436,334,UNION ELECTRIC COMPANY
7943,2006,plant,pickneyville,404.8,19436,334,UNION ELECTRIC COMPANY
8639,2007,plant,pickneyville,404.8,19436,334,UNION ELECTRIC COMPANY
9362,2008,plant,pickneyville,404.8,19436,334,UNION ELECTRIC COMPANY
10125,2009,plant,pickneyville,404.8,19436,334,UNION ELECTRIC COMPANY
10911,2010,plant,pickneyville,404.8,19436,334,UNION ELECTRIC COMPANY
11704,2011,plant,pickneyville,404.8,19436,334,UNION ELECTRIC COMPANY
12476,2012,plant,pickneyville,404.8,19436,334,UNION ELECTRIC COMPANY
13240,2013,plant,pickneyville,404.8,19436,334,UNION ELECTRIC COMPANY
14013,2014,plant,pickneyville,488.4,19436,334,UNION ELECTRIC COMPANY


### Backfilling

See how well the `connects_ferc1_eia` table predicts back-fillable values

In [12]:
connects_ferc1_eia

Unnamed: 0,record_id_ferc1,record_id_eia,match_type,plant_name_new,plant_part,report_year,ownership,plant_name_eia,plant_id_eia,generator_id,unit_id_pudl,prime_mover_code,energy_source_code_1,technology_description,ferc_acct_name,utility_id_eia,utility_id_pudl_eia,true_gran,appro_part_label,appro_record_id_eia,record_count,fraction_owned,ownership_dupe,plant_id_pudl_eia,total_fuel_cost_eia,fuel_cost_per_mmbtu_eia,net_generation_mwh_eia,capacity_mw_eia,capacity_factor_eia,total_mmbtu_eia,heat_rate_mmbtu_mwh_eia,fuel_type_code_pudl_eia,installation_year_eia,utility_id_ferc1,utility_id_pudl_ferc1,utility_name_ferc1,plant_id_pudl_ferc1,plant_id_ferc1,plant_name_ferc1,asset_retirement_cost,avg_num_employees,capacity_factor_ferc1,capacity_mw_ferc1,capex_equipment,capex_land,capex_per_mw,capex_structures,capex_total,construction_type,construction_year,installation_year_ferc1,net_generation_mwh_ferc1,not_water_limited_capacity_mw,opex_allowances,opex_boiler,opex_coolants,opex_electric,opex_engineering,opex_fuel,fuel_cost_per_mwh,opex_misc_power,opex_misc_steam,opex_nonfuel,opex_nonfuel_per_mwh,opex_operations,opex_per_mwh,opex_plant,opex_production_total,opex_rents,opex_steam,opex_steam_other,opex_structures,opex_transfer,peak_demand_mw,plant_capability_mw,plant_hours_connected_while_generating,plant_type,water_limited_capacity_mw,total_fuel_cost_ferc1,total_mmbtu_ferc1,fuel_type_code_pudl_ferc1,plant_name_original,ferc_license_id,fuel_cost_per_mmbtu_ferc1,fuel_type,opex_maintenance,opex_total,total_cost_of_plant,capex_facilities,capex_roads,net_capacity_adverse_conditions_mw,net_capacity_favorable_conditions_mw,opex_dams,opex_generation_misc,opex_hydraulic,opex_misc_plant,opex_water_for_power,capex_equipment_electric,capex_equipment_misc,capex_wheels_turbines_generators,energy_used_for_pumping_mwh,net_load_mwh,opex_production_before_pumping,opex_pumped_storage,opex_pumping,heat_rate_mmbtu_mwh_ferc1,plant_id_report_year,plant_id_report_year_util_id,_merge,report_date
0,f1_gnrt_plant_2004_12_115_0_8,2528_2004_plant_total_13511,prediction,Harris Lake,plant,2004,total,Harris Lake,2528,1,,IC,DFO,Petroleum Liquids,Other,13511,213,True,plant,2528_2004_plant_total_13511,1.0,1.0,False,2087,,,78.002,1.7,0.005224,,,oil,2017,115,213,New York State Electric & Gas Corporation,2087,,harris lake,,,0.010207,1.70,,,,,,,1967.0,,152.0,,,,,,,17991.0,118.361842,,,,,,,,,,,,,,1.8,,,internal_combustion,,,,,harris lake,,,diesel,97514.0,26527.0,391459.0,,,,,,,,,,,,,,,,,,,2087_2004,2087_2004_213,both,2004-01-01
1,f1_gnrt_plant_2004_12_115_0_9,8009_2004_plant_total_13511,prediction,Auburn State Street,plant,2004,total,Auburn State Street,8009,1,,GT,NG,Natural Gas Fired Combustion Turbine,Other,13511,213,True,plant,8009_2004_plant_total_13511,1.0,1.0,False,675,,,43.000,7.0,0.000699,,,gas,2000,115,213,New York State Electric & Gas Corporation,675,,auburn gas turbine,,,0.000750,7.00,,,,,,,2000.0,,46.0,,,,,,,,,,,,,,,,,,,,,,7.2,,,gas_turbine,,,,,auburn gas turbine,,,natural gas,1700.0,824614.0,,,,,,,,,,,,,,,,,,,,675_2004,675_2004_213,both,2004-01-01
2,f1_gnrt_plant_2004_12_119_0_1,998_2004_plant_total_13756,prediction,Norway (IN),plant,2004,total,Norway (IN),998,,,HY,WAT,Conventional Hydroelectric,Hydraulic,13756,222,True,plant,998_2004_plant_total_13756,1.0,1.0,False,785,,,28629.000,7.2,0.452670,,,hydro,1923,119,222,Northern Indiana Public Service Company,785,,norway,,,466.870515,7.00,,,867072.0,,,,1923.0,,28628500.0,,,,,,,,,,,,,,,,,,,,,,7.0,,,hydro,,,,,norway,,,hydro,207675.0,99461.0,6069505.0,,,,,,,,,,,,,,,,,,,785_2004,785_2004_222,both,2004-01-01
3,f1_gnrt_plant_2004_12_120_0_3,1926_2004_plant_total_13781,prediction,Red Wing,plant,2004,total,Red Wing,1926,,,ST,MSW,Municipal Solid Waste,Other,13781,224,True,plant,1926_2004_plant_total_13781,1.0,1.0,False,800,,,113279.995,23.0,0.560703,,,waste,1949,120,224,Northern States Power Company (Minnesota),800,,red wing,,,0.569282,23.00,,,1821539.0,,,,1949.0,,114699.0,,,,,,,2458045.0,21.430396,,,,,,,,,,,,,,20.0,,,steam_heat,,,,,red wing,,,"rdf, gas",1645170.0,4658364.0,41979900.0,,,,,,,,,,,,,,,,,,,800_2004,800_2004_224,both,2004-01-01
4,f1_gnrt_plant_2004_12_122_0_3,3341_2004_plant_total_13809,prediction,Clark (SD),plant,2004,total,Clark (SD),3341,1,,IC,DFO,Petroleum Liquids,Other,13809,226,True,plant,3341_2004_plant_total_13809,1.0,1.0,False,2303,,,-60.996,2.7,-0.002572,,,oil,1970,122,226,NorthWestern Corporation,2303,,clark,,,-0.002532,2.75,,,131508.0,,,,1970.0,,-61.0,,,,,,,5777.0,-94.704918,,,,,,,,,,,,,,2.2,,,internal_combustion,,,,,clark,,,oil,903.0,2614.0,361646.0,,,,,,,,,,,,,,,,,,,2303_2004,2303_2004_226,both,2004-01-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49306,f1_pumped_storage_2017_12_519_0_1,,,,,2017,,,,,,,,,,,,,,,,,,,,,,,,,,,,519,5515,upper michigan energy resources company (pudl ...,14204,,none,,,,0.00,,,0.0,,,unknown,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,14204_2017,14204_2017_5515,right_only,2017-01-01
49307,f1_pumped_storage_2018_12_519_0_1,,,,,2018,,,,,,,,,,,,,,,,,,,,,,,,,,,,519,5515,upper michigan energy resources company (pudl ...,14204,,none,,,,0.00,,,0.0,,,unknown,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,14204_2018,14204_2018_5515,right_only,2018-01-01
49308,f1_pumped_storage_2018_12_152_0_1,,,,,2018,,,,,,,,,,,,,,,,,,,,,,,,,,,,152,281,Rockland Electric Company,11330,,non-applicable,,,,0.00,,,0.0,,,unknown,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,11330_2018,11330_2018_281,right_only,2018-01-01
49309,f1_pumped_storage_2019_12_152_0_1,,,,,2019,,,,,,,,,,,,,,,,,,,,,,,,,,,,152,281,Rockland Electric Company,11330,,non-applicable,,,,0.00,,,0.0,,,unknown,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,11330_2019,11330_2019_281,right_only,2019-01-01


In [15]:
override_ferc_ids = (
    connects_ferc1_eia[connects_ferc1_eia['match_type'].isin(['overridden', 'correct_prediction'])]
    .dropna(subset=['plant_id_ferc1'])
)

override_dict = dict(zip(override_ferc_ids['plant_id_ferc1'], override_ferc_ids['record_id_eia']))


In [28]:
# Why are there some overridden = NA rows? 
# I checked the training csv and these ferc records have an EIA record associated with them, 
# it's just not showing up. I wonder if it's because these are the retired ones...
override_ferc_ids[override_ferc_ids['record_id_eia'].isna()].record_id_ferc1

17123    f1_steam_2018_12_191_0_1
17287     f1_steam_2018_12_45_2_1
17288     f1_steam_2018_12_45_2_2
17408     f1_steam_2018_12_79_0_1
18215     f1_steam_2018_12_51_0_1
18219    f1_steam_2018_12_182_0_1
18223     f1_steam_2018_12_45_1_3
18224     f1_steam_2018_12_45_1_2
18225     f1_steam_2018_12_45_1_1
18226     f1_steam_2018_12_45_1_5
18227     f1_steam_2018_12_45_1_4
18248     f1_steam_2018_12_56_2_3
Name: record_id_ferc1, dtype: object

In [40]:
# Ya, looks like these might be the retired records! lets double check all of them...
mul[mul['record_id_eia'].str.contains('617_gt_2018_plant_prime_mover_owned_6452')]

Unnamed: 0,record_id_eia,plant_id_eia,report_date,plant_part,generator_id,unit_id_pudl,prime_mover_code,energy_source_code_1,technology_description,ferc_acct_name,utility_id_eia,true_gran,appro_part_label,appro_record_id_eia,capacity_factor,capacity_mw,capacity_mw_eoy,fraction_owned,fuel_cost_per_mmbtu,fuel_cost_per_mwh,fuel_type_code_pudl,heat_rate_mmbtu_mwh,installation_year,net_generation_mwh,operational_status,operational_status_pudl,ownership,ownership_dupe,planned_retirement_date,plant_id_pudl,plant_name_eia,plant_name_new,record_count,retirement_date,total_fuel_cost,total_mmbtu,utility_id_pudl,report_year,plant_id_report_year
2136696,617_gt_2018_plant_prime_mover_owned_6452_retired,617,2018-01-01,plant_prime_mover,,,GT,NG,Natural Gas Fired Combustion Turbine,Other,6452,True,plant_prime_mover,617_GT_2018_plant_prime_mover_owned_6452_retired,,410.4,0.0,1.0,,,gas,,,,retired,retired,owned,True,NaT,461,Port Everglades,Port Everglades GT,2,2016-12-01,,,121,2018,461_2018


In [None]:
# # For the override plants, how often are the other records of the same ferc id matched?

# override_ferc_ids = (
#     connects_ferc1_eia[connects_ferc1_eia['match_type'].isin(['overridden', 'correct_prediction'])]
#     .dropna(subset=['plant_id_ferc1'])
# )

# override_ferc_ids.assign(test=lambda x: tt(x.record_id_eia))



# # this only works because there are no duplicate plant_id_ferc1 values that are overrides
# override_dict = dict(zip(override_ferc_ids['plant_id_ferc1'], override_ferc_ids['record_id_eia']))


In [None]:
li

In [None]:
# #def check(df, ferc_id):
# df = connects_ferc1_eia

# for ferc_id, eia_id in override_dict.items():
#     ferc_id_view = df[
#         (df['plant_id_ferc1']==ferc_id)
#         & (~df['match_type'].isin(['overridden', 'correct_prediction']))]

#     total_records = len(ferc_id_view)
#     total_records_notna = len(ferc_id_view[ferc_id_view['record_id_eia'].notna()])
#     pct_matched = total_records_notna / total_records * 100

#     total_correct_matches = len(ferc_id_view[ferc_id_view['record_id_eia']==eia_id])
#     pct_correct_match = total_correct_matches / total_records * 100
#     pct_correct_of_matches = total_correct_matches / total_records_notna * 100
    
#     print(ferc_id)
#     print(f"total records: {total_records}")
#     print(f"PCT MATCHED: {pct_matched}")
#     print(f"PCT CORRECT: {pct_correct_match}")
#     print(f"PCT OF MATCHES CORRECT: {pct_correct_of_matches}")
#     print("\n")

In [None]:
import pickle

file_path = '/Users/aesharpe/Desktop/master_unit_list.pkl'

with open(file_path, 'rb') as handle:
    mul_test = pickle.load(handle)

In [196]:
import pickle

with open('/Users/aesharpe/Desktop/why_63.pkl', 'rb') as handle:
    why_63 = pickle.load(handle)
    
why_63 = why_63.iloc[:, :8]

In [199]:
gens = pudl_out.gens_eia860()
why_65 = gens[gens['utility_id_eia'].isna()].iloc[:, :8]

In [202]:
why_63.merge(why_65, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only'].head()

Unnamed: 0,report_date,plant_id_eia,plant_id_pudl,plant_name_eia,utility_id_eia,utility_id_pudl,utility_name_eia,generator_id,_merge
63,2008-01-01,10682,3345,Colorado Power Partners,,1296,Colorado Energy Management,ST,right_only
64,2008-01-01,55320,4501,Wise County Power LP,,3797,"Wise County Power Co., LP",GT3,right_only
65,2009-01-01,675,1556,Larsen Memorial,,1063,Lakeland City of,7,right_only
66,2009-01-01,10682,3345,Colorado Power Partners,,1296,Colorado Energy Management,ST,right_only
67,2009-01-01,50274,3563,Simplot Leasing Don Plant,,3173,Simplot Leasing Corp,1,right_only


In [203]:
why_63.merge(why_65, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only'].head()

Unnamed: 0,report_date,plant_id_eia,plant_id_pudl,plant_name_eia,utility_id_eia,utility_id_pudl,utility_name_eia,generator_id,_merge
0,2009-01-01,675,1562,Larsen Memorial,,1063,Lakeland City of,7,left_only
1,2009-01-01,10682,3355,Colorado Power Partners,,1296,Colorado Energy Management,ST,left_only
2,2009-01-01,50274,3573,Simplot Leasing Don Plant,,3173,Simplot Leasing Corp,1,left_only
3,2009-01-01,55320,4511,Wise County Power LP,,3797,"Wise County Power Co., LP",GT3,left_only
5,2010-01-01,675,1562,Larsen Memorial,,1063,Lakeland City of,7,left_only


In [204]:
why_63[why_63['plant_name_eia']=='Colorado Power Partners']

Unnamed: 0,report_date,plant_id_eia,plant_id_pudl,plant_name_eia,utility_id_eia,utility_id_pudl,utility_name_eia,generator_id
288670,2009-01-01,10682,3355,Colorado Power Partners,,1296,Colorado Energy Management,ST
