# Use Splink to match FERC1 plants to EIA plant parts

This notebook walks through how to use splink to match FERC1 plants to EIA plant parts, as is done in `pudl.analysis.record_linkage.eia_ferc1_record_linkage_model.py`. Splink has several visualizations during the model training process that are helpful for understanding model weights and the input datasets. Thos visualizations are not captured in the PUDL module that implements this model, so this companion notebook provides insight into how to use splink for model development.

The [Splink docs](https://moj-analytical-services.github.io/splink/index.html) include helpful tutorials and the Github issues and discussions are also helpful places to look.

In [None]:
%load_ext autoreload
%autoreload 3

In [None]:
import jellyfish
import sqlalchemy as sa
from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.blocking_rule_library import block_on
import pandas as pd

import pudl
from pudl.analysis.record_linkage import eia_ferc1_record_linkage as eia_ferc1_model
from pudl.analysis.record_linkage.name_cleaner import CompanyNameCleaner
from pudl.analysis.record_linkage.embed_dataframe import _fill_fuel_type_from_name
from pudl.analysis.record_linkage import eia_ferc1_model_config
from pudl.etl import defs

In [None]:
pudl_engine = sa.create_engine(pudl.workspace.setup.PudlPaths().pudl_db)

# Get model inputs and preprocess

In [None]:
out_ferc1__yearly_all_plants = defs.load_asset_value("out_ferc1__yearly_all_plants")
out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 = defs.load_asset_value("out_ferc1__yearly_steam_plants_fuel_by_plant_sched402")
out_eia__yearly_plant_parts = defs.load_asset_value("out_eia__yearly_plant_parts")

In [None]:
inputs = eia_ferc1_model.get_compiled_input_manager(out_ferc1__yearly_all_plants,
                                                    out_ferc1__yearly_steam_plants_fuel_by_plant_sched402,
                                                    out_eia__yearly_plant_parts)

In [None]:
eia_df, ferc_df = eia_ferc1_model.get_input_dfs(inputs)
train_df = eia_ferc1_model.get_training_data_df(inputs)

In [None]:
plant_name_cleaner = eia_ferc1_model.plant_name_cleaner
utility_name_cleaner = CompanyNameCleaner(legal_term_location=2)

In [None]:
ferc_df["plant_name"] = plant_name_cleaner.apply_name_cleaning(ferc_df["plant_name"])
ferc_df["utility_name"] = utility_name_cleaner.apply_name_cleaning(ferc_df["utility_name"])
ferc_df["fuel_type_code_pudl"] = _fill_fuel_type_from_name(ferc_df, "fuel_type_code_pudl", "plant_name")

In [None]:
eia_df["plant_name"] = plant_name_cleaner.apply_name_cleaning(eia_df["plant_name"])
eia_df["utility_name"] = utility_name_cleaner.apply_name_cleaning(eia_df["utility_name"])
eia_df["fuel_type_code_pudl"] = _fill_fuel_type_from_name(eia_df, "fuel_type_code_pudl", "plant_name")

In [None]:
ferc_df["installation_year"] = pd.to_datetime(ferc_df["installation_year"], format="%Y")
ferc_df["construction_year"] = pd.to_datetime(ferc_df["construction_year"], format="%Y")
eia_df["installation_year"] = pd.to_datetime(eia_df["installation_year"], format="%Y")
eia_df["construction_year"] = pd.to_datetime(eia_df["construction_year"], format="%Y")

In [None]:
def _get_metaphone(row, col_name):
    if pd.isnull(row[col_name]):
        return None
    return jellyfish.metaphone(row[col_name])

In [None]:
eia_df["plant_name_mphone"] = eia_df.apply(_get_metaphone, axis=1, args=("plant_name",))
ferc_df["plant_name_mphone"] = ferc_df.apply(_get_metaphone, axis=1, args=("plant_name",),)

In [None]:
eia_df["utility_name_mphone"] = eia_df.apply(_get_metaphone, axis=1, args=("utility_name",))
ferc_df["utility_name_mphone"] = ferc_df.apply(_get_metaphone, axis=1, args=("utility_name",))

In [None]:
cols = eia_ferc1_model.ID_COL + eia_ferc1_model.MATCHING_COLS + eia_ferc1_model.EXTRA_COLS
eia_df = eia_df[cols]
ferc_df = ferc_df[cols]

# Set settings dictionary and create linker

In [None]:
settings_dict = {"link_type": "link_only",
                 "unique_id_column_name": "record_id",
                 "additional_columns_to_retain": ["plant_id_pudl", "utility_id_pudl", "utility_name_mphone", "plant_name_mphone"]}

In [None]:
linker = DuckDBLinker([eia_df, ferc_df], input_table_aliases = ["eia_df", "ferc_df"], settings_dict=settings_dict)

In [None]:
train_table = linker.register_table(train_df, "training_labels", overwrite=True)

# Data Exploration

In [None]:
linker.completeness_chart(cols=eia_ferc1_model.MATCHING_COLS)

Columns with higher cardinality are better for matching. Note the skew in `fuel_type_code_pudl` which means we'll need to use a term frequency adjustment.

In [None]:
linker.profile_columns(eia_ferc1_model.MATCHING_COLS, top_n=10, bottom_n=5)

# Generate blocking rules

Define blocking rules to reduce the search space of potential candidate pairs that the matching model must consider. See `pudl.analysis.record_linkage.eia_ferc1_model_config` for blocking rule definitions.

From the docs:
- "More generally, we can often specify multiple blocking rules such that it becomes highly implausible that a true match would not meet at least one of these blocking critera. This is the recommended approach in Splink. Generally we would recommend between about 3 and 10, though even more is possible."
- "For linkages in DuckDB on a standard laptop, we suggest using blocking rules that create no more than about 20 million comparisons."

In [None]:
linker.cumulative_num_comparisons_from_blocking_rules_chart(eia_ferc1_model_config.BLOCKING_RULES)

# Define Comparison Levels

In [None]:
print(eia_ferc1_model_config.plant_name_comparison.human_readable_description)

In [None]:
settings_dict.update({
    "comparisons": eia_ferc1_model_config.COMPARISONS,
    "blocking_rules_to_generate_predictions": eia_ferc1_model_config.BLOCKING_RULES,
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "probability_two_random_records_match": 1/len(eia_df) # this parameter can also be estimated if it's unknown
    }
)

Explanation of probability two random records match calculation:

The EIA dataset has n records and FERC dataset has m records, where n > m. Each FERC record matches to one EIA record, so there are n - m EIA record that don't have a match.

- If I choose a FERC record first then I have a 1/n chance of choosing the matching EIA record
- If I choose an EIA record first then I have a m/n chance of choosing an EIA record that has a FERC match, and then a 1/m chance of choosing the correct matching FERC record. So the probability of choosing two matching records is m/n * 1/m  = 1/n

In either case, the probability is 1/n.

In [None]:
linker.load_settings(settings_dict)

# Estimate Model Parameters

Now that we have specified our linkage model, we need to estimate the probability_two_random_records_match (if not specified in settings dictionary), u, and m parameters.

In [None]:
linker.estimate_u_using_random_sampling(max_pairs=1e7)

We can estimate m with either training labels or unsupervised, with Expectation Maximization.

In [None]:
linker.estimate_m_from_pairwise_labels("training_labels")

In [None]:
# if we do it unsupervised, we need to define training blocking rules
# training_blocking_rule_1 = "l.plant_name = r.plant_name"
# training_session_1 = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule_1)
# training_session_2 = linker.estimate_parameters_using_expectation_maximisation(block_on(["utility_name", "net_generation_mwh"]))
# training_session_3 = linker.estimate_parameters_using_expectation_maximisation(block_on(["capacity_mw", "fuel_type_code_pudl"]))

In [None]:
linker.match_weights_chart()

In [None]:
linker.m_u_parameters_chart()

In [None]:
model_name = "ferc_eia_demo"

In [None]:
# save model settings to a chosen directory
settings = linker.save_model_to_json(f"./model_settings_{model_name}.json", overwrite=True)

# Make Predictions

In [None]:
# predict matches above a certain threshold match probability or match weight
df_preds = linker.predict(threshold_match_probability=.25)

In [None]:
sorted_preds_df = df_preds.as_pandas_dataframe().sort_values(by="match_probability", ascending=False)

In [None]:
best_match_df = sorted_preds_df.rename(columns={"record_id_r": "record_id_ferc1", "record_id_l": "record_id_eia"}).groupby("record_id_ferc1").first()

# Evaluate Results

In [None]:
train_df = train_df.rename(columns={"record_id_r": "record_id_ferc1", "record_id_l": "record_id_eia"})

In [None]:
cols = [col + "_l" for col in eia_ferc1_model.MATCHING_COLS]
cols += [col + "_r" for col in eia_ferc1_model.MATCHING_COLS]
extra_cols = ["plant_id_pudl_l", "plant_id_pudl_r", "utility_id_pudl_l", "utility_id_pudl_r"]
cols.sort()
cols = ["record_id_eia", "match_weight", "match_probability"] + cols + extra_cols
best_match_df = best_match_df[cols].reset_index()

In [None]:
def get_true_pos(pred_df, train_df):
    return train_df.merge(
                pred_df,
                how="left",
                on=["record_id_ferc1", "record_id_eia"],
                indicator=True
            )._merge.value_counts()["both"]

# where an incorrect EIA record is predicted for a FERC record
def get_false_pos(pred_df, train_df):
    shared_preds = train_df.merge(
        pred_df,
        how="inner",
        on="record_id_ferc1",
        suffixes=("_true", "_pred")
    )
    return len(shared_preds[shared_preds.record_id_eia_true != shared_preds.record_id_eia_pred])

# in training data but no prediction made
def get_false_neg(pred_df, train_df):
    return train_df.merge(
                pred_df,
                how="left",
                on=["record_id_ferc1"],
                indicator=True
            )._merge.value_counts()["left_only"]

def get_duplicated_eia_plant_part_matches(pred_df):
    return len(pred_df[(pred_df.record_id_eia.notnull()) & (pred_df.record_id_eia.duplicated(keep="first"))])

def get_match_at_threshold(df, threshold):
    return df[df.match_probability >= threshold]

In [None]:
ind = [".95", ".9", ".75", ".5", ".25"]
data = {"true_pos": [get_true_pos(get_match_at_threshold(best_match_df, threshold=.95), train_df),
                     get_true_pos(get_match_at_threshold(best_match_df, threshold=.9), train_df),
                     get_true_pos(get_match_at_threshold(best_match_df, threshold=.75), train_df),
                     get_true_pos(get_match_at_threshold(best_match_df, threshold=.5), train_df),
                     get_true_pos(get_match_at_threshold(best_match_df, threshold=.25), train_df)
                    ],
        "false_pos": [get_false_pos(get_match_at_threshold(best_match_df, threshold=.95), train_df),
                      get_false_pos(get_match_at_threshold(best_match_df, threshold=.9), train_df),
                      get_false_pos(get_match_at_threshold(best_match_df, threshold=.75), train_df),
                      get_false_pos(get_match_at_threshold(best_match_df, threshold=.5), train_df),
                      get_false_pos(get_match_at_threshold(best_match_df, threshold=.25), train_df)
                     ],
        "false_neg": [get_false_neg(get_match_at_threshold(best_match_df, threshold=.95), train_df),
                      get_false_neg(get_match_at_threshold(best_match_df, threshold=.9), train_df),
                      get_false_neg(get_match_at_threshold(best_match_df, threshold=.75), train_df),
                      get_false_neg(get_match_at_threshold(best_match_df, threshold=.5), train_df),
                      get_false_neg(get_match_at_threshold(best_match_df, threshold=.25), train_df)
                     ]
       }

stats_df = pd.DataFrame(index=ind, data=data)
stats_df.loc[:, "precision"] = stats_df["true_pos"]/(stats_df["true_pos"] + stats_df["false_pos"])
stats_df.loc[:, "recall"] = stats_df["true_pos"]/(stats_df["true_pos"] + stats_df["false_neg"])

In [None]:
stats_df

In [None]:
ind = [ ".9", ".75", ".5", ".25"]
data = {"duplicate_eia_plant_part_matches": [get_duplicated_eia_plant_part_matches(get_match_at_threshold(best_match_df, threshold=.9)),
                                             get_duplicated_eia_plant_part_matches(get_match_at_threshold(best_match_df, threshold=.75)),
                                             get_duplicated_eia_plant_part_matches(get_match_at_threshold(best_match_df, threshold=.5)),
                                             get_duplicated_eia_plant_part_matches(get_match_at_threshold(best_match_df, threshold=.25))
                                            ]
       }
dupe_df = pd.DataFrame(index=ind, data=data)

In [None]:
dupe_df

In [None]:
best_match_with_overwrites = eia_ferc1_model.get_best_matches(sorted_preds_df, inputs)
connected_df = eia_ferc1_model.get_full_records_with_overwrites(best_match_with_overwrites, inputs)

# Look at matches

In [None]:
labels_df = inputs.get_train_df().reset_index()

In [None]:
compare_df._merge.value_counts()

In [None]:
incorrect_matches = compare_df[compare_df.record_id_eia_true != compare_df.record_id_eia_pred]
incorrect_matches[["record_id_ferc1", "record_id_eia_true", "record_id_eia_pred", "match_probability"]].reset_index(drop=True)

In [None]:
i = 0
ferc_id = incorrect_matches.record_id_ferc1.iloc[i]
true_eia_id = incorrect_matches.record_id_eia_true.iloc[i]
pred_eia_id = incorrect_matches.record_id_eia_pred.iloc[i]

In [None]:
rec_true = sorted_preds_df[(sorted_preds_df.record_id_r == ferc_id) & (sorted_preds_df.record_id_l == true_eia_id)]
rec_pred = sorted_preds_df[(sorted_preds_df.record_id_r == ferc_id) & (sorted_preds_df.record_id_l == pred_eia_id)]

In [None]:
rec_true = rec_true.to_dict(orient="records")
linker.waterfall_chart(rec_true, filter_nulls=False)

In [None]:
rec_pred = rec_pred.to_dict(orient="records")
linker.waterfall_chart(rec_pred, filter_nulls=False)