# 2025 Cy Young Similarity Search

`diamondfp` makes it possible to compare players across different seasons. One way to apply this is by creating a composite fingerprint that represents typical Cy Young candidates. By comparing this composite against every player in the current season, we can identify which players most closely match the Cy Young profile and may have the strongest case for the award.

**What we will do**
- Select the stats and quantiles that are most relevant to Cy Young decisions
- Generate fingerprints for all Cy Young winners from 2015 to 2024
- Create a composite fingerprint that represents a typical Cy Young profile
- Compare this composite fingerprint to all players from the current season
- Rank the results to see which players have the strongest case for Cy Young

In [1]:
import pandas as pd
import numpy as np
from diamondfp.fingerprints import binnedfp
from diamondfp.utils.features import generate_quantiles
from diamondfp.scoring import tanimoto

### Read in Stats from 2015-2024

In [2]:
df = pd.read_csv("../data/pitching_2015-2025.csv")

def make_name(row):
    last_name, first_name = row["last_name, first_name"].split(", ")
    return f"{first_name} {last_name}"

df['name'] = df.apply(lambda x: make_name(x), axis=1)

### Create a DataFrame of Cy Young Candidates During This Time

In [3]:
cyy = pd.DataFrame(
    {
        "Season": [2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024],
        "AL Cy Young": ["Dallas Keuchel", "Rick Porcello", "Corey Kluber", 
                        "Blake Snell", "Justin Verlander", "Shane Bieber", 
                        "Robbie Ray", "Justin Verlander", "Gerrit Cole", 
                        "Tarik Skubal"],
        "NL Cy Young": ["Jake Arrieta", "Max Scherzer", "Max Scherzer", 
                        "Jacob deGrom", "Jacob deGrom", "Trevor Bauer", 
                        "Corbin Burnes", "Sandy Alcantara", "Blake Snell", 
                        "Chris Sale"],
    }
)
cyy

Unnamed: 0,Season,AL Cy Young,NL Cy Young
0,2015,Dallas Keuchel,Jake Arrieta
1,2016,Rick Porcello,Max Scherzer
2,2017,Corey Kluber,Max Scherzer
3,2018,Blake Snell,Jacob deGrom
4,2019,Justin Verlander,Jacob deGrom
5,2020,Shane Bieber,Trevor Bauer
6,2021,Robbie Ray,Corbin Burnes
7,2022,Justin Verlander,Sandy Alcantara
8,2023,Gerrit Cole,Blake Snell
9,2024,Tarik Skubal,Chris Sale


### Generate the Quantile List for Chosen Stats

In [4]:
stat_list = list(df.columns)[4:-1]
quant_list = [0.05, 0.1, 0.25, 0.5]
# loop over stat list with quant list to set stat_features dict
stat_features = {stat: quant_list for stat in stat_list}
# for stats where we want lowest values we can reverse the quantiles
stat_features["p_formatted_ip"] = [0.5, 0.75, 0.9, 0.95]
stat_features["pa"] = [0.5, 0.75, 0.9, 0.95]
stat_features["strikeout"] = [0.5, 0.75, 0.9, 0.95]
stat_features["walk"] = [0.5, 0.75, 0.9, 0.95]
stat_features["z_swing_percent"] = [0.5, 0.75, 0.9, 0.95]
stat_features["z_swing_miss_percent"] = [0.5, 0.75, 0.9, 0.95]
stat_features["oz_swing_percent"] = [0.5, 0.75, 0.9, 0.95]
stat_features["oz_swing_miss_percent"] = [0.5, 0.75, 0.9, 0.95]
stat_features["in_zone_percent"] = [0.5, 0.75, 0.9, 0.95]
stat_features["swing_percent"] = [0.25, 0.5, 0.75, 0.9]
stat_features["groundballs_percent"] = [0.25, 0.5, 0.75, 0.9]
stat_features["flyballs_percent"] = [0.25, 0.5, 0.75, 0.9]
stat_features["linedrives_percent"] = [0.25, 0.5, 0.75, 0.9]
stat_features["popups_percent"] = [0.25, 0.5, 0.75, 0.9]
stat_features

{'pa': [0.5, 0.75, 0.9, 0.95],
 'hit': [0.05, 0.1, 0.25, 0.5],
 'strikeout': [0.5, 0.75, 0.9, 0.95],
 'walk': [0.5, 0.75, 0.9, 0.95],
 'k_percent': [0.05, 0.1, 0.25, 0.5],
 'bb_percent': [0.05, 0.1, 0.25, 0.5],
 'batting_avg': [0.05, 0.1, 0.25, 0.5],
 'p_era': [0.05, 0.1, 0.25, 0.5],
 'woba': [0.05, 0.1, 0.25, 0.5],
 'barrel_batted_rate': [0.05, 0.1, 0.25, 0.5],
 'hard_hit_percent': [0.05, 0.1, 0.25, 0.5],
 'avg_best_speed': [0.05, 0.1, 0.25, 0.5],
 'avg_hyper_speed': [0.05, 0.1, 0.25, 0.5],
 'z_swing_percent': [0.5, 0.75, 0.9, 0.95],
 'z_swing_miss_percent': [0.5, 0.75, 0.9, 0.95],
 'oz_swing_percent': [0.5, 0.75, 0.9, 0.95],
 'oz_swing_miss_percent': [0.5, 0.75, 0.9, 0.95],
 'oz_contact_percent': [0.05, 0.1, 0.25, 0.5],
 'out_zone_percent': [0.05, 0.1, 0.25, 0.5],
 'iz_contact_percent': [0.05, 0.1, 0.25, 0.5],
 'in_zone_percent': [0.5, 0.75, 0.9, 0.95],
 'whiff_percent': [0.05, 0.1, 0.25, 0.5],
 'swing_percent': [0.25, 0.5, 0.75, 0.9],
 'groundballs_percent': [0.25, 0.5, 0.75, 0.9],


### Get the Cy Young fingerprints for each Season

In [5]:
seasons = [2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]
cyyfps = []
# Generate fingerprints for each Cy Young for each season
for season in seasons:
    # Make a temporary dataframe for the iterated season
    temp_df = df[df["year"] == season]
    temp_cyy = cyy[cyy["Season"] == season]
    
    # Generate quantiles for iterated season's stats
    feat_quants = generate_quantiles(temp_df, stat_features)

    # Get fingerprints for AL and NL Cy Youngs
    alcyy = temp_df[temp_df["name"] == temp_cyy["AL Cy Young"].squeeze()].squeeze()
    cyyfps.append(binnedfp(alcyy, feat_quants))
    nlcyy = temp_df[temp_df["name"] == temp_cyy["NL Cy Young"].squeeze()].squeeze()
    cyyfps.append(binnedfp(nlcyy, feat_quants))

### Composite Fingerprint

Now that we have the fingerprint of all mvp winners over the last 10 seaons, let's create a composite fingerprint by taking the max value in each feature group.

In [6]:
# Sum the fingerprints for all MVPs across seasons
summed_fp = np.sum(cyyfps, axis=0)

# Create a composite fingerprint of zeros
composite_fp = np.zeros_like(summed_fp)
# Reshape the summed fingerprint to group features in sets of 4 (length of quant_list)
feat_groups = summed_fp.reshape(-1, len(quant_list))
# Find the index of the max value in each feature group
max_idx = feat_groups.argmax(axis=1)
# Create an array of row indices for the feature groups
rows = np.arange(feat_groups.shape[0])
# Set the max value in each feature group to 1
composite_fp.reshape(-1, len(quant_list))[rows, max_idx] = 1

### Create and Analyze 2025 Fingerprints

Create the fingerprints for the 2025 hitters and compare them against the composite MVP fingerprint.

In [7]:
df_2025 = df[df["year"] == 2025].copy()

feat_quants = generate_quantiles(df_2025, stat_features)
df_2025["Fingerprint"] = df_2025.apply(lambda x: binnedfp(x, feat_quants), axis=1)

fp = np.array(df_2025["Fingerprint"].tolist(), dtype=np.float32)  
players = df_2025["name"].to_numpy()
n = len(fp)

cyy_race = []
for i in range(n):
    sim = tanimoto(fp[i], composite_fp)
    player = players[i]
    d = {}
    d["Player"] = player
    d["Similarity"] = sim
    cyy_race.append(d)

### Sort the Tanimoto Similarities

Let's look at just the top 10 players based on similarity.

In [8]:
cyy_race = pd.DataFrame(cyy_race)
cyy_race = cyy_race.sort_values(by="Similarity", ascending=False)
cyy_race.head(10)

Unnamed: 0,Player,Similarity
105,Paul Skenes,0.388889
75,Tarik Skubal,0.352941
9,Zack Wheeler,0.307692
85,Garrett Crochet,0.3
20,Jacob deGrom,0.261905
108,Yoshinobu Yamamoto,0.25641
89,Spencer Schwellenbach,0.25641
107,Noah Cameron,0.236842
10,Matthew Boyd,0.232558
54,Logan Webb,0.219512


## Results

**AL Cy Young Race**
| Rank | Player | Similarity |
|--|--|--|
| 1st | Tarik Kubal  | 0.35 |
| 2nd | Garret Crochet | 0.30 |
| 3rd | Jacob deGrom | 0.26 |

**NL Cy Young Race**
| Rank | Player | Similarity |
|--|--|--|
| 1st | Paul Skenes | 0.39 |
| 2nd | Zack Wheeler | 0.31 |
| 3rd | Yoshinobu Yamamoto | 0.26 |


**The results are in!** Based on fingerprint similarity, Tarik Skubal appears on track for his second Cy Young, while Paul Skenes is in line for his first. I will note that the similarity scores came out fairly low, likely because I included such a large set of pitching statistics. A more streamlined feature set might yield clearer results.