# Predicting which records match

In the previous tutorial, we built and estimated a linkage model.

In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match.


In [None]:
# Uncomment and run this cell if you're running in Google Colab.
# !pip install splink

In [1]:
from splink import Linker, DuckDBAPI, splink_datasets

import pandas as pd

pd.options.display.max_columns = 1000

db_api = DuckDBAPI()
df = splink_datasets.fake_1000

## Load estimated model from previous tutorial


In [2]:
import json

# Path to the saved model
file_path = "../../results/saved_model_from_demo.json"

# Load JSON settings from file
with open(file_path, "r", encoding="utf-8") as f:
    settings = json.load(f)

# Initialize the Linker with the loaded settings
linker = Linker(df, settings, db_api=DuckDBAPI())


# Predicting match weights using the trained model

We use `linker.predict()` to run the model.

Under the hood this will:

- Generate all pairwise record comparisons that match at least one of the `blocking_rules_to_generate_predictions`

- Use the rules specified in the `Comparisons` to evaluate the similarity of the input data

- Use the estimated match weights, applying term frequency adjustments where requested to produce the final `match_weight` and `match_probability` scores

Optionally, a `threshold_match_probability` or `threshold_match_weight` can be provided, which will drop any row where the predicted score is below the threshold.


In [3]:
df_predictions = linker.inference.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)

Blocking time: 0.01 seconds
Predict time: 0.16 seconds

You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


Unnamed: 0,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,tf_first_name_l,tf_first_name_r,bf_first_name,bf_tf_adj_first_name,surname_l,surname_r,gamma_surname,tf_surname_l,tf_surname_r,bf_surname,bf_tf_adj_surname,dob_l,dob_r,gamma_dob,bf_dob,city_l,city_r,gamma_city,tf_city_l,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,tf_email_l,tf_email_r,bf_email,bf_tf_adj_email,match_key
0,6.311409,0.987565,8,9,,Evie,-1,,0.008424,1.0,1.0,Dean,Dean,4,0.003663,0.003663,88.870507,1.334963,2015-03-03,2015-03-03,2,223.957757,,Pootsmruth,-1,,0.00123,1.0,1.0,,evihd56@earris-bailey.net,-1,,0.001267,1.0,1.0,1
1,18.868966,0.999998,26,28,Thomas,Thomas,4,0.006017,0.006017,84.821765,0.962892,Gabriel,Gabriel,4,0.004884,0.004884,88.870507,1.001222,1976-09-15,1976-09-15,2,223.957757,Loodon,London,0,0.00123,0.212792,0.462956,1.0,gabriel.t54@nnichls.info,gabriel.t54@nichols.info,3,0.001267,0.002535,212.576644,1.0,1
2,10.984407,0.999507,29,30,Thomas,Thomas,4,0.006017,0.006017,84.821765,0.962892,Gabriel,Gabriel,4,0.004884,0.004884,88.870507,1.001222,1976-08-15,1976-09-15,1,93.268001,,London,-1,,0.212792,1.0,1.0,,gabriel.t54@nlchois.info,-1,,0.001267,1.0,1.0,1
3,21.210697,1.0,37,39,Theodore,Theodore,4,0.012034,0.012034,84.821765,0.481446,Morris,Morris,4,0.004884,0.004884,88.870507,1.001222,1978-08-19,1978-08-19,2,223.957757,Birmingham,Birmingham,1,0.0492,0.0492,10.20126,1.120874,t.m39@brooks-sawyer.com,t.m39@brooks-sawyer.com,4,0.006337,0.006337,252.050601,0.346193,0
4,1.212079,0.698497,42,43,Theodore,Theodore,4,0.012034,0.012034,84.821765,0.481446,Morris,Morris,4,0.004884,0.004884,88.870507,1.001222,1978-09-18,1978-08-19,0,0.460743,Birgmhniam,Birmingham,0,0.00123,0.0492,0.462956,1.0,,t.m39@brooks-sawyer.com,-1,,0.006337,1.0,1.0,1


## Clustering

The result of `linker.predict()` is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:

```
A -> B with score 0.9
B -> C with score 0.95
C -> D with score 0.1
D -> E with score 0.99
```

Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.

With a score threshold of 0.5, the above data could be represented conceptually as:

```
ID, Cluster ID
A,  1
B,  1
C,  1
D,  2
E,  2
```

The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:


In [4]:
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predictions, threshold_match_probability=0.5
)
clusters.as_pandas_dataframe(limit=10)

Completed iteration 1, num representatives needing updating: 2
Completed iteration 2, num representatives needing updating: 0


Unnamed: 0,cluster_id,unique_id,first_name,surname,dob,city,email,cluster
0,8,9,Evie,Dean,2015-03-03,Pootsmruth,evihd56@earris-bailey.net,3
1,14,14,Oliver,Griffiths,1991-10-26,Lunton,o.griffiths90@reyes-coleman.com,5
2,22,24,Thoas,Green,1974-10-05,London,thomas.green@clark.org,10
3,26,26,Thomas,Gabriel,1976-09-15,Loodon,gabriel.t54@nnichls.info,11
4,26,30,Thomas,Gabriel,1976-09-15,London,gabriel.t54@nlchois.info,11
5,37,37,Theodore,Morris,1978-08-19,Birmingham,t.m39@brooks-sawyer.com,13
6,37,39,Theodore,Morris,1978-08-19,Birmingham,t.m39@brooks-sawyer.com,13
7,37,43,Theodore,Morris,1978-08-19,Birmingham,t.m39@brooks-sawyer.com,13
8,52,52,Jyayden,Bnennet,2017-01-11,Snawseaa,jb88@king.com,16
9,74,74,Ronni,Begum,2003-10-15,London,r.b80@ellis-berry.com,22


In [5]:
sql = f"""
select *
from {df_predictions.physical_name}
limit 2
"""
linker.misc.query_sql(sql)

Unnamed: 0,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,tf_first_name_l,tf_first_name_r,bf_first_name,bf_tf_adj_first_name,surname_l,surname_r,gamma_surname,tf_surname_l,tf_surname_r,bf_surname,bf_tf_adj_surname,dob_l,dob_r,gamma_dob,bf_dob,city_l,city_r,gamma_city,tf_city_l,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,tf_email_l,tf_email_r,bf_email,bf_tf_adj_email,match_key
0,6.311409,0.987565,8,9,,Evie,-1,,0.008424,1.0,1.0,Dean,Dean,4,0.003663,0.003663,88.870507,1.334963,2015-03-03,2015-03-03,2,223.957757,,Pootsmruth,-1,,0.00123,1.0,1.0,,evihd56@earris-bailey.net,-1,,0.001267,1.0,1.0,1
1,18.868966,0.999998,26,28,Thomas,Thomas,4,0.006017,0.006017,84.821765,0.962892,Gabriel,Gabriel,4,0.004884,0.004884,88.870507,1.001222,1976-09-15,1976-09-15,2,223.957757,Loodon,London,0,0.00123,0.212792,0.462956,1.0,gabriel.t54@nnichls.info,gabriel.t54@nichols.info,3,0.001267,0.002535,212.576644,1.0,1


!!! note "Further Reading"

For more on the prediction tools in Splink, please refer to the [Prediction API documentation](https://moj-analytical-services.github.io/splink/api_docs/inference.html).


## Next steps

Now we have made predictions with a model, we can move on to visualising it to understand how it is working.
