# MLB MVP Predictor
### Author: Adam Bergen
### Date: 10-20-2025

## Description: 
In any sport, the MVP (Most Valuable Player) is one of the most desirable awards for any player and one of the most debated awards among fans. In today's age with superstars such as Aaron Judge, Cal Raleigh, and Shohei Ohtani, fans often have trouble deciding who truly is worthy. Today, we want to leave it to the numbers, look at the stats of previous winners, and determine what stats truly make a winner.

### Import Data and Related Packages

In [14]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pybaseball as pb

import pandas as pd

# MVP winners (AL + NL) since 2004
mvp_data = [
    # AL
    (2024, "Aaron Judge", "NY Yankees", "OF", "AL"),
    (2023, "Shohei Ohtani", "LA Angels", "DH/SP", "AL"),
    (2022, "Aaron Judge", "NY Yankees", "OF", "AL"),
    (2021, "Shohei Ohtani", "LA Angels", "DH/SP", "AL"),
    (2020, "José Abreu", "Chi White Sox", "1B", "AL"),
    (2019, "Mike Trout", "LA Angels", "CF", "AL"),
    (2018, "Mookie Betts", "Boston", "OF", "AL"),
    (2017, "Jose Altuve", "Houston", "2B", "AL"),
    (2016, "Mike Trout", "LA Angels", "CF", "AL"),
    (2015, "Josh Donaldson", "Toronto", "3B", "AL"),
    (2014, "Mike Trout", "LA Angels", "CF", "AL"),

    # NL
    (2024, "Shohei Ohtani", "LA Dodgers", "DH", "NL"),
    (2023, "Ronald Acuña Jr.", "Atlanta", "OF", "NL"),
    (2022, "Paul Goldschmidt", "St. Louis", "1B", "NL"),
    (2021, "Bryce Harper", "Philadelphia", "OF", "NL"),
    (2020, "Freddie Freeman", "Atlanta", "1B", "NL"),
    (2019, "Cody Bellinger", "LA Dodgers", "RF", "NL"),
    (2018, "Christian Yelich", "Milwaukee", "OF", "NL"),
    (2017, "Giancarlo Stanton", "Miami", "RF", "NL"),
    (2016, "Kris Bryant", "Chi Cubs", "3B", "NL"),
    (2015, "Bryce Harper", "Washington", "RF", "NL"),
    (2014, "Clayton Kershaw", "LA Dodgers", "SP", "NL"),
]

# Convert to DataFrame
mvp_df = pd.DataFrame(mvp_data, columns=["Year", "Player", "Team", "Position", "League"])

### Examine the Data

Here, we will begin to examine the structure of the data to help us get a better idea. Looking at just the heads of the 2025, we can begin to see the structure.

In [None]:
bat_2025 = pb.batting_stats(2025)
pit_2025 = pb.pitching_stats(2025)

bat_2025.head()
pit_2025.head()

In [15]:
# Add merge-friendly names
bat_10years = pb.batting_stats(2014, 2024)

bat_10years.rename(columns={"Name": "Player", "Season": "Year"}, inplace=True)

# Merge with MVP data
merged = pd.merge(bat_10years, mvp_df, on=["Player", "Year"], how="left")

# Add binary MVP target column
merged["MVP"] = merged["League"].notna().astype(int)

# Drop extra columns from the MVP list
merged.drop(columns=["Team_y", "Position", "League"], inplace=True, errors="ignore")
merged.rename(columns={"Team_x": "Team"}, inplace=True)

print("Shape after merge:", merged.shape)
merged.head()


Shape after merge: (1522, 321)


Unnamed: 0,IDfg,Year,Player,Team,Age,G,AB,PA,H,1B,...,HardHit,HardHit%,Events,CStr%,CSW%,xBA,xSLG,xwOBA,L-WAR,MVP
0,15640,2024,Aaron Judge,NYY,32,158,559,704,180,85,...,239,0.611,391,0.146,0.267,0.31,0.724,0.48,11.7,1
1,15640,2022,Aaron Judge,NYY,30,157,570,696,177,87,...,247,0.611,404,0.169,0.287,0.305,0.706,0.463,11.4,1
2,25764,2024,Bobby Witt Jr.,KCR,24,161,636,709,211,123,...,260,0.483,538,0.138,0.236,0.315,0.576,0.407,10.5,0
3,13611,2018,Mookie Betts,BOS,25,136,520,614,180,96,...,218,0.502,434,0.22,0.27,0.309,0.607,0.431,10.4,1
4,10155,2018,Mike Trout,LAA,26,140,471,608,147,80,...,162,0.46,352,0.201,0.261,0.294,0.603,0.435,9.5,0


### Select Predictive Features

Here, we will select features / stats that I believe to be the most influential in determining a player's chances in winning the MVP. 

In [28]:
numeric_cols = [
    "WAR", "wRC+", "OPS", "HR", "RBI", "R", "OBP", "SLG", "AVG",
    "SB", "BB%", "K%", "wOBA", "Age"
]

# Filter out rows missing key stats
data = merged[["Player", "Year"] + numeric_cols + ["MVP"]].dropna()

X = data[numeric_cols]
y = data["MVP"]
meta = data[["Player", "Year"]]  # to keep track of names

## Train-Test Split + MLP Model

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch

X_train, X_test, y_train, y_test, meta_train, meta_test = train_test_split(
    X, y, meta, test_size=0.2, stratify=y, random_state=42
)

# Scale numeric features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert to tensors
X_train_t = torch.tensor(X_train_scaled, dtype=torch.float32)
y_train_t = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1)
X_test_t = torch.tensor(X_test_scaled, dtype=torch.float32)
y_test_t = torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1)


In [30]:
import torch.nn as nn

class MVPNet(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.net(x)

model = MVPNet(X_train.shape[1])


In [31]:
import torch.optim as optim

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

epochs = 50
for epoch in range(epochs):
    model.train()
    optimizer.zero_grad()
    
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}")


TypeError: linear(): argument 'input' (position 1) must be Tensor, not DataFrame

## Evaluation of the Model

In [None]:
model.eval()
with torch.no_grad():
    preds = model(X_test_t)
    probs = preds.numpy().flatten()

df_results = pd.DataFrame(X_test_t.numpy(), columns=numeric_cols)
df_results["MVP_Prob"] = probs
df_results["True_MVP"] = y_test_t.numpy().flatten()

# Merge the metadata (Player, Year)
df_results = pd.concat([meta_test.reset_index(drop=True), df_results], axis=1)

top_candidates = df_results.sort_values("MVP_Prob", ascending=False).head(20)
top_candidates[["Player", "Year", "MVP_Prob", "True_MVP"] + numeric_cols]



Accuracy: 0.987
