# 2025 MVP Similarity Search

`diamondfp` makes it possible to compare players across different seasons. One way to apply this is by creating a composite fingerprint that represents typical MVP candidates. By comparing this composite against every player in the current season, we can identify which players most closely match the MVP profile and may have the strongest case for the award.

**What we will do**
- Select the stats and quantiles that are most relevant to MVP decisions
- Generate fingerprints for all MVPs from 2015 to 2024
- Create a composite fingerprint that represents a typical MVP profile
- Compare this composite fingerprint to all players from the current season
- Rank the results to see which players have the strongest case for MVP

In [1]:
import pandas as pd
import numpy as np
from diamondfp.fingerprints import binnedfp
from diamondfp.utils.features import generate_quantiles
from diamondfp.scoring import tanimoto

### Read in Stats from 2015-2024

In [None]:
df = pd.read_csv("../data/batting_2015-2025.csv")

def make_name(row):
    last_name, first_name = row["last_name, first_name"].split(", ")
    return f"{first_name} {last_name}"

df['name'] = df.apply(lambda x: make_name(x), axis=1)

### Create a DataFrame of MVP Candidates During This Time

In [4]:
mvp = pd.DataFrame(
    {
        "Season": [2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024],
        "AL MVP": ["Josh Donaldson", "Mike Trout", "Jose Altuve", 
                   "Mookie Betts", "Mike Trout", "José Abreu", 
                   "Shohei Ohtani", "Aaron Judge", "Shohei Ohtani", 
                   "Aaron Judge"],
        "NL MVP": ["Bryce Harper", "Kris Bryant", "Giancarlo Stanton", 
                   "Christian Yelich", "Cody Bellinger", "Freddie Freeman", 
                   "Bryce Harper", "Paul Goldschmidt", "Ronald Acuña Jr.", 
                   "Shohei Ohtani"]
    }
)
mvp

Unnamed: 0,Season,AL MVP,NL MVP
0,2015,Josh Donaldson,Bryce Harper
1,2016,Mike Trout,Kris Bryant
2,2017,Jose Altuve,Giancarlo Stanton
3,2018,Mookie Betts,Christian Yelich
4,2019,Mike Trout,Cody Bellinger
5,2020,José Abreu,Freddie Freeman
6,2021,Shohei Ohtani,Bryce Harper
7,2022,Aaron Judge,Paul Goldschmidt
8,2023,Shohei Ohtani,Ronald Acuña Jr.
9,2024,Aaron Judge,Shohei Ohtani


### Generate the Quantile List for Chosen Stats

In [5]:
stat_list = list(df.columns)[4:-1]
quant_list = [0.5, 0.75, 0.9, 0.95]
# loop over stat list with quant list to set stat_features dict
stat_features = {stat: quant_list for stat in stat_list}
# for stats where we want lowest values we can reverse the quantiles
stat_features["k_percent"] = [0.05, 0.1, 0.25, 0.5]
stat_features["whiff_percent"] = [0.05, 0.1, 0.25, 0.5] 
stat_features

{'hit': [0.5, 0.75, 0.9, 0.95],
 'home_run': [0.5, 0.75, 0.9, 0.95],
 'k_percent': [0.05, 0.1, 0.25, 0.5],
 'bb_percent': [0.5, 0.75, 0.9, 0.95],
 'batting_avg': [0.5, 0.75, 0.9, 0.95],
 'slg_percent': [0.5, 0.75, 0.9, 0.95],
 'on_base_percent': [0.5, 0.75, 0.9, 0.95],
 'on_base_plus_slg': [0.5, 0.75, 0.9, 0.95],
 'r_total_stolen_base': [0.5, 0.75, 0.9, 0.95],
 'woba': [0.5, 0.75, 0.9, 0.95],
 'sweet_spot_percent': [0.5, 0.75, 0.9, 0.95],
 'barrel_batted_rate': [0.5, 0.75, 0.9, 0.95],
 'hard_hit_percent': [0.5, 0.75, 0.9, 0.95],
 'avg_best_speed': [0.5, 0.75, 0.9, 0.95],
 'avg_hyper_speed': [0.5, 0.75, 0.9, 0.95],
 'whiff_percent': [0.05, 0.1, 0.25, 0.5]}

### Get the MVP fingerprints for each Season

In [6]:
seasons = [2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]
mvpfps = []
# Generate fingerprints for each MVP for each season
for season in seasons:
    # Make a temporary dataframe for the iterated season
    temp_df = df[df["year"] == season]
    temp_mvp = mvp[mvp["Season"] == season]
    
    # Generate quantiles for iterated season's stats
    feat_quants = generate_quantiles(temp_df, stat_features)

    # Get fingerprints for AL and NL MVPs
    almvp = temp_df[temp_df["name"] == temp_mvp["AL MVP"].squeeze()].squeeze()
    mvpfps.append(binnedfp(almvp, feat_quants))
    nlmvp = temp_df[temp_df["name"] == temp_mvp["NL MVP"].squeeze()].squeeze()
    mvpfps.append(binnedfp(nlmvp, feat_quants))

### Composite Fingerprint

Now that we have the fingerprint of all mvp winners over the last 10 seaons, let's create a composite fingerprint by taking the max value in each feature group.

In [None]:
# Sum the fingerprints for all MVPs across seasons
summed_fp = np.sum(mvpfps, axis=0)

# Create a composite fingerprint of zeros
composite_fp = np.zeros_like(summed_fp)
# Reshape the summed fingerprint to group features in sets of 4 (length of quant_list)
feat_groups = summed_fp.reshape(-1, len(quant_list))
# Find the index of the max value in each feature group
max_idx = feat_groups.argmax(axis=1)
# Create an array of row indices for the feature groups
rows = np.arange(feat_groups.shape[0])
# Set the max value in each feature group to 1
composite_fp.reshape(-1, len(quant_list))[rows, max_idx] = 1

### Create and Analyze 2025 Fingerprints

Create the fingerprints for the 2025 hitters and compare them against the composite MVP fingerprint.

In [None]:
df_2025 = df[df["year"] == 2025].copy()

feat_quants = generate_quantiles(df_2025, stat_features)
df_2025["Fingerprint"] = df_2025.apply(lambda x: binnedfp(x, feat_quants), axis=1)

fp = np.array(df_2025["Fingerprint"].tolist(), dtype=np.float32)  
players = df_2025["name"].to_numpy()
n = len(fp)

mvp_race = []
for i in range(n):
    sim = tanimoto(fp[i], composite_fp)
    player = players[i]
    d = {}
    d["Player"] = player
    d["Similarity"] = sim
    mvp_race.append(d)

### Sort the Tanimoto Similarities

Let's look at just the top 10 players based on similarity.

In [10]:
mvp_race = pd.DataFrame(mvp_race)
mvp_race = mvp_race.sort_values(by="Similarity", ascending=False)
mvp_race.head(10)

Unnamed: 0,Player,Similarity
18,Aaron Judge,0.882353
58,Shohei Ohtani,0.722222
56,Kyle Schwarber,0.47619
78,Juan Soto,0.318182
79,Oneil Cruz,0.315789
68,Cal Raleigh,0.304348
91,Kyle Stowers,0.24
93,Will Smith,0.208333
80,Jonathan Aranda,0.2
39,Pete Alonso,0.192308


## Results

**AL MVP Race**
| Rank | Player | Similarity |
|--|--|--|
| 1st | Aaron Judge | 0.88 |
| 2nd | Cal Raleigh | 0.32 |
| 3rd | Jonathan Aranda | 0.20 |

**NL MVP Race**
| Rank | Player | Similarity |
|--|--|--|
| 1st | Shohei Ohtani | 0.72 |
| 2nd | Kyle Schwarber | 0.48 |
| 3rd | Juan Soto | 0.32 |
| 3rd | Oneil Cruz | 0.32 |


**The results are in!** Based on fingerprint similarity, Aaron Judge is projected to win his 3rd MVP while Shohei Ohtani claims his 4th. However, this exercise does not account for defensive value, which could move Cal Raleigh higher on some ballots. It also does not consider that MVP voting is decided by baseball writers, who do not always choose the most obvious candidate, particularly if a player is having a slightly less extraordinary season compared to a multi-time winner. They may be inclined to reward the unexpected performance over the expected. This also assumes that only hitters are going to be considered for MVP.