# Player Fingerprint Similarity Analysis

This notebook builds binary (or quantized) fingerprints for baseball players based on their career statistics, then computes pairwise similarity scores between all players. The goal is to identify players with statistically similar profiles using the Tanimoto/Jaccard coefficient.

In [1]:
import pandas as pd
import numpy as np
import csv
from diamondfp.fingerprints import binaryfp
from diamondfp.utils.features import generate_quantiles
from diamondfp.scoring import tanimoto

## Preparing the Data

We load player career batting data, then define statistical features of interest with percentile cutoffs. These percentiles act as thresholds for converting continuous stats (like batting average, OBP, etc.) into binary fingerprints. Each player’s fingerprint is a binary vector indicating which statistical thresholds they meet.

In [2]:
df = pd.read_csv("../data/career-batting.csv")

stat_features = {
    "H": [0.5, 0.75, 0.9, 0.95],
    "2B": [0.75, 0.95],
    "3B": [0.75, 0.95],
    "HR": [0.9, 0.99],
    "K%": [0.1, 0.25],
    "BB%": [0.75, 0.99],
    "AVG": [0.5, 0.75, 0.9, 0.95],
    "OBP": [0.5, 0.75, 0.9, 0.95],
    "SLG": [0.5, 0.75, 0.9, 0.95],
    "OPS": [0.5, 0.75, 0.9, 0.95],
}

feat_quants = generate_quantiles(df, stat_features)
df['Fingerprint'] = df.apply(lambda x: binaryfp(x, feat_quants), axis=1)

## Converting Fingerprints to Arrays

We convert the fingerprints into a NumPy array for efficient vectorized computation. Using float32 (bool/uint8 for binary data would be more efficient but I want this to be general for fingerprints that are not binary) minimizes memory usage while keeping the operations fast.

In [None]:
fp = np.array(df["Fingerprint"].tolist(), dtype=np.float32)  
pkeys = df["playerID"].to_numpy()
players = df["Name"].to_numpy()
player_data = list(zip(players, pkeys, fp))
n = len(fp)

## Computing Pairwise Similarities

We compute pairwise Tanimoto similarity for all player pairs.

$$
T(v_1, v_2) = \frac{|v_1 \cap v_2|}{|v_1 \cup v_2|}
$$


To stay memory-safe with ~18k players, we avoid building an entire n × n matrix. Instead, we compute similarities row-by-row, only keeping results above a similarity threshold (0.75 here).

In [None]:
import itertools

with open("../data/mlb-similarities.csv", "w", newline="") as csvfile:
    fieldnames = ["Player 1", "Player 2", "PKey 1", "PKey 2", "Similarity"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    # Iterate over all unique player pairs
    for (player1, pkey1, fp1), (player2, pkey2, fp2) in itertools.combinations(
        player_data, 2
    ):
        sim = tanimoto(fp1, fp2)
        if sim > 0.75:
            writer.writerow(
                {
                    "Player 1": player1,
                    "Player 2": player2,
                    "PKey 1": pkey1,
                    "PKey 2": pkey2,
                    "Similarity": sim,
                }
            )


## Querying Similar Players

Once we have a similarity table, we need a fast way to query it.

In [5]:
def find_similar(df, player_name, tophits=10):
    # Select only relevant rows (no .copy() → avoids extra memory)
    mask1 = df["Player 1"] == player_name
    mask2 = df["Player 2"] == player_name

    # Build results directly without modifying a DataFrame
    players = np.where(mask1, df["Player 2"], df["Player 1"])
    sims = df["Similarity"]

    # Only keep rows where player_name was in either column
    valid = mask1 | mask2
    players = players[valid]
    sims = sims[valid]

    # Turn into DataFrame once (minimal allocations)
    result = pd.DataFrame({"Similar Player": players, "Similarity": sims})

    return result.nlargest(tophits, "Similarity").reset_index(drop=True)

Example Query

We can now explore the most similar players to a given player, e.g., Bryce Harper:

In [6]:
df_sim = pd.read_csv("../data/mlb-similarities.csv")
find_similar(df_sim, "Bryce Harper")

Unnamed: 0,Similar Player,Similarity
0,Albert Belle,1.0
1,Jeromy Burnitz,1.0
2,Jose Canseco,1.0
3,Rocky Colavito,1.0
4,Nelson Cruz,1.0
5,Carlos Delgado,1.0
6,Jim Edmonds,1.0
7,Edwin Encarnacion,1.0
8,Prince Fielder,1.0
9,Jason Giambi,1.0
