# Use Splink to match FERC1 plants to EIA plant parts

This notebook walks through using splink to match FERC1 plants to EIA plant parts, as is done in `pudl.analysis.record_linkage.eia_ferc1_record_linkage_model.py`. Splink provides several visualizations during the model training process that are helpful for understanding model weights and the input datasets. For now, those visualizations are not captured in the PUDL module that implements this model, so this companion notebook provides additional insight into model development.

The [Splink docs](https://moj-analytical-services.github.io/splink/index.html) include tutorials and the Github issues and discussions are also helpful places to look.

In [None]:
%load_ext autoreload
%autoreload 3

In [None]:
import jellyfish
import sqlalchemy as sa
from splink import block_on, DuckDBAPI, Linker, SettingsCreator
from splink.blocking_analysis import count_comparisons_from_blocking_rule, cumulative_comparisons_to_be_scored_from_blocking_rules_chart, n_largest_blocks
from splink.exploratory import completeness_chart, profile_columns
import pandas as pd

import pudl
from pudl.analysis.record_linkage import eia_ferc1_record_linkage as eia_ferc1_model
from pudl.analysis.record_linkage.name_cleaner import CompanyNameCleaner
from pudl.analysis.record_linkage.embed_dataframe import _fill_fuel_type_from_name
from pudl.analysis.record_linkage import eia_ferc1_model_config

# Get model inputs and preprocess

Practically speaking, a plant is a collection of generator(s). There are many attributes of generators (i.e. prime mover, primary fuel source, technology type). We can use these generator attributes to group generator records into larger aggregate records which we call "plant parts". A plant part is a record which corresponds to a particular collection of generators that all share an identical attribute and utility owner, e.g. all of the generators with unit_id=2, or all of the generators with coal as their primary fuel source.

The EIA data about power plants (from EIA 923 and 860) is reported in tables with records that correspond to mostly generators and plants. FERC 1 is less well organized and include plants, generators and other plant parts all in the same table without any clear labels. This EIA plant part table is an attempt to create records corresponding to many different plant parts in order to connect specific slices of EIA plants to FERC.

Because generators are often owned by multiple utilities, another dimension of this plant part table involves generating two records for each owner: one for the portion of the plant part they own and one for the plant part as a whole. The portion records are labeled in the ``ownership_record_type`` column as ``owned`` and the total records are labeled as ``total``. This table includes A LOT of duplicative information about EIA plants. It is meant for use as an input into the record linkage between FERC1 plants and EIA.

In [None]:
# Get a denormalized FERC Form 1 table containing the steam, small generators, hydro, and pumped storage tables
out_ferc1__yearly_all_plants = pd.read_parquet("s3://pudl.catalyst.coop/stable/out_ferc1__yearly_all_plants.parquet")

In [None]:
# Get a table summarizing fuel data by plant, using FERC Form 1 data
out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 = pd.read_parquet("s3://pudl.catalyst.coop/stable/out_ferc1__yearly_steam_plants_fuel_by_plant_sched402.parquet",)

In [None]:
# Get a table with the aggregation of all EIA "plant parts"
out_eia__yearly_plant_parts = pd.read_parquet("s3://pudl.catalyst.coop/stable/out_eia__yearly_plant_parts.parquet")

In [None]:
out_eia__yearly_plant_parts["report_date"] = pd.to_datetime(out_eia__yearly_plant_parts["report_date"])

In [None]:
inputs = eia_ferc1_model.get_compiled_input_manager(out_ferc1__yearly_all_plants,
                                                    out_ferc1__yearly_steam_plants_fuel_by_plant_sched402,
                                                    out_eia__yearly_plant_parts)

Do a little preprocessing so the datasets have the same columns. Also, load in a dataset of manually matched training data, found [here](https://github.com/catalyst-cooperative/pudl/blob/main/src/pudl/package_data/glue/eia_ferc1_train.csv) . We'll use this to train and validate the model.

In [None]:
eia_df, ferc_df = eia_ferc1_model.get_input_dfs(inputs)
# we have a dataset of manually matched training data
train_df = eia_ferc1_model.get_training_data_df(inputs)

Normalize plant and utility name strings. Do things like expand legal terms (e.g. llc -> limited liability company), remove punctuation, remove numbers, etc.

This name cleaner is being refactored and will soon be 3x faster.

In [None]:
plant_name_cleaner = eia_ferc1_model.plant_name_cleaner
utility_name_cleaner = CompanyNameCleaner(legal_term_location=2)

In [None]:
ferc_df["plant_name"] = plant_name_cleaner.apply_name_cleaning(ferc_df["plant_name"])
ferc_df["utility_name"] = utility_name_cleaner.apply_name_cleaning(ferc_df["utility_name"])
ferc_df["fuel_type_code_pudl"] = _fill_fuel_type_from_name(ferc_df, "fuel_type_code_pudl", "plant_name")

In [None]:
eia_df["plant_name"] = plant_name_cleaner.apply_name_cleaning(eia_df["plant_name"])
eia_df["utility_name"] = utility_name_cleaner.apply_name_cleaning(eia_df["utility_name"])
eia_df["fuel_type_code_pudl"] = _fill_fuel_type_from_name(eia_df, "fuel_type_code_pudl", "plant_name")

In [None]:
ferc_df["installation_year"] = pd.to_datetime(ferc_df["installation_year"], format="%Y")
ferc_df["construction_year"] = pd.to_datetime(ferc_df["construction_year"], format="%Y")
eia_df["installation_year"] = pd.to_datetime(eia_df["installation_year"], format="%Y")
eia_df["construction_year"] = pd.to_datetime(eia_df["construction_year"], format="%Y")

We can use metaphones of the plant and utility names as columns for blocking. With splink, metaphones can work better/faster than string similarity.

In [None]:
def _get_metaphone(row, col_name):
    if pd.isnull(row[col_name]):
        return None
    return jellyfish.metaphone(row[col_name])

In [None]:
eia_df["plant_name_mphone"] = eia_df.apply(_get_metaphone, axis=1, args=("plant_name",))
ferc_df["plant_name_mphone"] = ferc_df.apply(_get_metaphone, axis=1, args=("plant_name",),)

In [None]:
eia_df["utility_name_mphone"] = eia_df.apply(_get_metaphone, axis=1, args=("utility_name",))
ferc_df["utility_name_mphone"] = ferc_df.apply(_get_metaphone, axis=1, args=("utility_name",))

In [None]:
cols = eia_ferc1_model.ID_COL + eia_ferc1_model.MATCHING_COLS + eia_ferc1_model.EXTRA_COLS
eia_df = eia_df[cols]
ferc_df = ferc_df[cols]

# Data Exploration

In [None]:
db_api = DuckDBAPI()

In [None]:
completeness_chart(eia_df, db_api=db_api, cols=eia_ferc1_model.MATCHING_COLS)

In [None]:
completeness_chart(ferc_df, db_api=db_api, cols=eia_ferc1_model.MATCHING_COLS)

Columns with higher cardinality are better for matching. Note the skew in `fuel_type_code_pudl` which means we'll need to use a term frequency adjustment.

In [None]:
profile_columns(eia_df[eia_ferc1_model.MATCHING_COLS], db_api=DuckDBAPI(), top_n=10, bottom_n=5)

In [None]:
profile_columns(ferc_df[eia_ferc1_model.MATCHING_COLS], db_api=DuckDBAPI(), top_n=10, bottom_n=5)

# Generate blocking rules

Define blocking rules to reduce the search space of potential candidate pairs that the matching model must consider. See `pudl.analysis.record_linkage.eia_ferc1_model_config` for blocking rule definitions.

From the docs:
- "More generally, we can often specify multiple blocking rules such that it becomes highly implausible that a true match would not meet at least one of these blocking critera. This is the recommended approach in Splink. Generally we would recommend between about 3 and 10, though even more is possible."
- "For linkages in DuckDB on a standard laptop, we suggest using blocking rules that create no more than about 20 million comparisons."

In [None]:
br0 = eia_ferc1_model_config.BLOCKING_RULES[0]

In [None]:
count_comparisons_from_blocking_rule(
    table_or_tables=[eia_df, ferc_df],
    blocking_rule=br0,
    link_type="link_only",
    unique_id_column_name='record_id',
    db_api=db_api,
)


In [None]:
result = n_largest_blocks(
    table_or_tables=[eia_df, ferc_df],
    blocking_rule=br0,
    link_type="link_only",
    db_api=db_api,
    n_largest=3
)

result.as_pandas_dataframe()

In [None]:
blocking_rules_for_analysis = eia_ferc1_model_config.BLOCKING_RULES

cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=[eia_df, ferc_df],
    blocking_rules=blocking_rules_for_analysis,
    db_api=db_api,
    unique_id_column_name='record_id',
    link_type="link_only",
)

# Define Model Settings

See the [splink settings guide](https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html) for more on model parameters.

In [None]:
print(eia_ferc1_model_config.plant_name_comparison.get_comparison("duckdb").human_readable_description)

Explanation of probability two random records match calculation:

The EIA dataset has n records and FERC dataset has m records, where n > m. Each FERC record matches to one EIA record, so there are n - m EIA record that don't have a match.

- If I choose a FERC record first then I have a 1/n chance of choosing the matching EIA record
- If I choose an EIA record first then I have a m/n chance of choosing an EIA record that has a FERC match, and then a 1/m chance of choosing the correct matching FERC record. So the probability of choosing two matching records is m/n * 1/m  = 1/n

In either case, the probability is 1/n.

In [None]:
settings = SettingsCreator(
    link_type="link_only",
    unique_id_column_name="record_id",
    comparisons=eia_ferc1_model_config.COMPARISONS,
    blocking_rules_to_generate_predictions=eia_ferc1_model_config.BLOCKING_RULES,
    retain_intermediate_calculation_columns=True,
    probability_two_random_records_match=1/len(eia_df) # this parameter can also be estimated if it's unknown
)

linker = Linker([eia_df, ferc_df], settings, db_api=DuckDBAPI())

In [None]:
train_table = linker.table_management.register_table(train_df, "training_labels", overwrite=True)

# Estimate Model Parameters

Now that we have specified our linkage model, we need to estimate the probability_two_random_records_match (if not specified in settings dictionary), u, and m parameters.

In [None]:
linker.training.estimate_u_using_random_sampling(max_pairs=1e7)

We can estimate m with either training labels or unsupervised, with Expectation Maximization.

In [None]:
linker.training.estimate_m_from_pairwise_labels("training_labels")

In [None]:
# if we want this to be unsupervised, we need to define training blocking rules
# training_blocking_rule_1 = "l.plant_name = r.plant_name"
# training_session_1 = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule_1)
# training_session_2 = linker.estimate_parameters_using_expectation_maximisation(block_on(["utility_name", "net_generation_mwh"]))
# training_session_3 = linker.estimate_parameters_using_expectation_maximisation(block_on(["capacity_mw", "fuel_type_code_pudl"]))

In [None]:
linker.visualisations.match_weights_chart()

In [None]:
linker.visualisations.m_u_parameters_chart()

In [None]:
model_name = "ferc_eia_demo"

In [None]:
# save model settings to a chosen directory
settings = linker.misc.save_model_to_json(f"./model_settings_{model_name}.json", overwrite=True)

# Make Predictions

In [None]:
# predict matches above a certain threshold match probability or match weight
df_preds = linker.inference.predict(threshold_match_probability=.25)

In [None]:
sorted_preds_df = df_preds.as_pandas_dataframe().sort_values(by="match_probability", ascending=False)

In [None]:
best_match_df = sorted_preds_df.rename(columns={"record_id_r": "record_id_ferc1", "record_id_l": "record_id_eia"}).groupby("record_id_ferc1").first()

# Evaluate Results

In [None]:
train_df = train_df.rename(columns={"record_id_r": "record_id_ferc1", "record_id_l": "record_id_eia"})

In [None]:
cols = [col + "_l" for col in eia_ferc1_model.MATCHING_COLS]
cols += [col + "_r" for col in eia_ferc1_model.MATCHING_COLS]
extra_cols = ["plant_id_pudl_l", "plant_id_pudl_r", "utility_id_pudl_l", "utility_id_pudl_r"]
cols.sort()
cols = ["record_id_eia", "match_weight", "match_probability"] + cols + extra_cols
best_match_df = best_match_df[cols].reset_index()

In [None]:
def get_true_pos(pred_df, train_df):
    return train_df.merge(
                pred_df,
                how="left",
                on=["record_id_ferc1", "record_id_eia"],
                indicator=True
            )._merge.value_counts()["both"]

# where an incorrect EIA record is predicted for a FERC record
def get_false_pos(pred_df, train_df):
    shared_preds = train_df.merge(
        pred_df,
        how="inner",
        on="record_id_ferc1",
        suffixes=("_true", "_pred")
    )
    return len(shared_preds[shared_preds.record_id_eia_true != shared_preds.record_id_eia_pred])

# in training data but no prediction made
def get_false_neg(pred_df, train_df):
    return train_df.merge(
                pred_df,
                how="left",
                on=["record_id_ferc1"],
                indicator=True
            )._merge.value_counts()["left_only"]

def get_duplicated_eia_plant_part_matches(pred_df):
    return len(pred_df[(pred_df.record_id_eia.notnull()) & (pred_df.record_id_eia.duplicated(keep="first"))])

def get_match_at_threshold(df, threshold):
    return df[df.match_probability >= threshold]

In [None]:
ind = [".95", ".9", ".75", ".5", ".25"]
data = {"true_pos": [get_true_pos(get_match_at_threshold(best_match_df, threshold=.95), train_df),
                     get_true_pos(get_match_at_threshold(best_match_df, threshold=.9), train_df),
                     get_true_pos(get_match_at_threshold(best_match_df, threshold=.75), train_df),
                     get_true_pos(get_match_at_threshold(best_match_df, threshold=.5), train_df),
                     get_true_pos(get_match_at_threshold(best_match_df, threshold=.25), train_df)
                    ],
        "false_pos": [get_false_pos(get_match_at_threshold(best_match_df, threshold=.95), train_df),
                      get_false_pos(get_match_at_threshold(best_match_df, threshold=.9), train_df),
                      get_false_pos(get_match_at_threshold(best_match_df, threshold=.75), train_df),
                      get_false_pos(get_match_at_threshold(best_match_df, threshold=.5), train_df),
                      get_false_pos(get_match_at_threshold(best_match_df, threshold=.25), train_df)
                     ],
        "false_neg": [get_false_neg(get_match_at_threshold(best_match_df, threshold=.95), train_df),
                      get_false_neg(get_match_at_threshold(best_match_df, threshold=.9), train_df),
                      get_false_neg(get_match_at_threshold(best_match_df, threshold=.75), train_df),
                      get_false_neg(get_match_at_threshold(best_match_df, threshold=.5), train_df),
                      get_false_neg(get_match_at_threshold(best_match_df, threshold=.25), train_df)
                     ]
       }

stats_df = pd.DataFrame(index=ind, data=data)
stats_df.loc[:, "precision"] = stats_df["true_pos"]/(stats_df["true_pos"] + stats_df["false_pos"])
stats_df.loc[:, "recall"] = stats_df["true_pos"]/(stats_df["true_pos"] + stats_df["false_neg"])

In [None]:
stats_df

In [None]:
ind = [ ".9", ".75", ".5", ".25"]
data = {"duplicate_eia_plant_part_matches": [get_duplicated_eia_plant_part_matches(get_match_at_threshold(best_match_df, threshold=.9)),
                                             get_duplicated_eia_plant_part_matches(get_match_at_threshold(best_match_df, threshold=.75)),
                                             get_duplicated_eia_plant_part_matches(get_match_at_threshold(best_match_df, threshold=.5)),
                                             get_duplicated_eia_plant_part_matches(get_match_at_threshold(best_match_df, threshold=.25))
                                            ]
       }
dupe_df = pd.DataFrame(index=ind, data=data)

In [None]:
dupe_df

In [None]:
best_match_with_overwrites = eia_ferc1_model.get_best_matches(sorted_preds_df, inputs)
connected_df = eia_ferc1_model.get_full_records_with_overwrites(best_match_with_overwrites, inputs)

# Look at matches

In [None]:
labels_df = inputs.get_train_df().reset_index()

In [None]:
compare_df._merge.value_counts()

In [None]:
incorrect_matches = compare_df[compare_df.record_id_eia_true != compare_df.record_id_eia_pred]
incorrect_matches[["record_id_ferc1", "record_id_eia_true", "record_id_eia_pred", "match_probability"]].reset_index(drop=True)

In [None]:
i = 0
ferc_id = incorrect_matches.record_id_ferc1.iloc[i]
true_eia_id = incorrect_matches.record_id_eia_true.iloc[i]
pred_eia_id = incorrect_matches.record_id_eia_pred.iloc[i]

In [None]:
rec_true = sorted_preds_df[(sorted_preds_df.record_id_r == ferc_id) & (sorted_preds_df.record_id_l == true_eia_id)]
rec_pred = sorted_preds_df[(sorted_preds_df.record_id_r == ferc_id) & (sorted_preds_df.record_id_l == pred_eia_id)]

In [None]:
rec_true = rec_true.to_dict(orient="records")
linker.visualisations.waterfall_chart(rec_true, filter_nulls=False)

In [None]:
rec_pred = rec_pred.to_dict(orient="records")
linker.visualisations.waterfall_chart(rec_pred, filter_nulls=False)