# Predicting the 2025 World Series (Tutorial)

This beginner-friendly notebook trains a **logistic regression** on past seasons to predict a 2025 World Series champion, and a simple **multinomial logistic regression** to guess the series length.

**What you’ll learn**:
1. Loading data and basic feature engineering
2. Training a logistic regression model for a binary outcome
3. Evaluating with accuracy and ROC-AUC
4. Making 2025 predictions and exporting a CSV
5. A baseline multinomial model for series length


In [4]:
# --- Setup & data loading ---
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score
from pathlib import Path
import glob

# Find the uploaded CSV robustly (adjust pattern as needed)
candidates = sorted(glob.glob('/content/ws_basic_teams_plus_2025*.csv'))
path = candidates[0]
print('Using dataset:', path)
df = pd.read_csv(path)
print(df.shape)
df


Using dataset: /content/ws_basic_teams_plus_2025 (1).csv
(244, 14)


Unnamed: 0,year,teamID,franchID,lgID,name,wins,losses,runs_scored,runs_allowed,nl_winner,al_winner,ws_winner,ws_game_wins,series_length
0,1903,BOS,BOS,AL,Boston Americans,91,47,708,504,0.0,1.0,1.0,5.0,8.0
1,1903,PIT,PIT,NL,Pittsburgh Pirates,91,49,793,613,1.0,0.0,0.0,3.0,8.0
2,1905,NY1,SFG,NL,New York Giants,105,48,780,505,1.0,0.0,1.0,4.0,5.0
3,1905,PHA,OAK,AL,Philadelphia Athletics,92,56,623,488,0.0,1.0,0.0,1.0,5.0
4,1906,CHA,CHW,AL,Chicago White Sox,93,58,570,460,0.0,1.0,1.0,4.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,2024,NYA,NYY,AL,New York Yankees,94,68,815,668,0.0,1.0,0.0,1.0,5.0
240,2025,LAN,LAD,NL,Los Angeles Dodgers,93,69,825,683,,,,,
241,2025,MIL,MIL,NL,Milwaukee Brewers,97,65,806,634,,,,,
242,2025,TOR,TOR,AL,Toronto Blue Jays,94,68,798,721,,,,,


In [5]:
# --- Minimal feature engineering ---
df['win_pct']  = df['wins'] / (df['wins'] + df['losses'])
df['run_diff'] = df['runs_scored'] - df['runs_allowed']

FEATURES = ['wins', 'losses', 'win_pct', 'run_diff', 'runs_scored', 'runs_allowed']
df[FEATURES + ['year','name']].tail(8)


Unnamed: 0,wins,losses,win_pct,run_diff,runs_scored,runs_allowed,year,name
236,90,72,0.555556,165,881,716,2023,Texas Rangers
237,84,78,0.518519,-15,746,761,2023,Arizona Diamondbacks
238,98,64,0.604938,156,842,686,2024,Los Angeles Dodgers
239,94,68,0.580247,147,815,668,2024,New York Yankees
240,93,69,0.574074,142,825,683,2025,Los Angeles Dodgers
241,97,65,0.598765,172,806,634,2025,Milwaukee Brewers
242,94,68,0.580247,77,798,721,2025,Toronto Blue Jays
243,90,72,0.555556,72,766,694,2025,Seattle Mariners


## Part A — Champion model (binary logistic regression)
Train on seasons where `ws_winner` is known. This is a simple, tutorial-level baseline.

In [6]:
hist = df[df['ws_winner'].notna()].copy()
X = hist[FEATURES].values
y = hist['ws_winner'].astype(int).values  # 1=champion, 0=not champion

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

champ_pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=1000, class_weight='balanced', solver='lbfgs')
)
champ_pipe.fit(Xtr, ytr)

yhat  = champ_pipe.predict(Xte)
probs = champ_pipe.predict_proba(Xte)[:, 1]
acc   = accuracy_score(yte, yhat)
auc   = roc_auc_score(yte, probs)

print(f'Validation accuracy: {acc:.3f}')
print(f'Validation ROC-AUC:  {auc:.3f}')
print('\nClassification report:\n')
print(classification_report(yte, yhat, digits=3))


Validation accuracy: 0.550
Validation ROC-AUC:  0.574

Classification report:

              precision    recall  f1-score   support

           0      0.548     0.567     0.557        30
           1      0.552     0.533     0.542        30

    accuracy                          0.550        60
   macro avg      0.550     0.550     0.550        60
weighted avg      0.550     0.550     0.550        60



## Predict 2025 champion and export probabilities
Score all 2025 teams and sort by the model's probability of winning the World Series.

In [8]:
df_2025 = df[df['year'] == 2025].copy()
X_2025 = df_2025[FEATURES].values
df_2025['p_win_ws'] = champ_pipe.predict_proba(X_2025)[:, 1]

df_2025_sorted = df_2025[['name','lgID','wins','losses','win_pct','run_diff','p_win_ws']].sort_values('p_win_ws', ascending=False)
df_2025_sorted.reset_index(drop=True, inplace=True)
df_2025_sorted.head(10)

# Save probabilities to CSV
out_probs_csv = '/content/ws2025_team_win_probs.csv'
df_2025_sorted.to_csv(out_probs_csv, index=False)
print('Saved:', out_probs_csv)


Saved: /content/ws2025_team_win_probs.csv


## Part B — Series length model (multinomial logistic regression)
This **baseline** predicts the World Series length using only single-team features. In reality, series length is a head-to-head outcome; this is intentionally simple for tutorial purposes.

In [9]:
ws_participants = df[df['series_length'].notna()].copy()
y_len = ws_participants['series_length'].astype(int)

len_pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=1000, multi_class='multinomial')
)
len_pipe.fit(ws_participants[FEATURES].values, y_len.values)

# Pick the most likely 2025 champion and predict series length for that team
predicted_idx = df_2025['p_win_ws'].idxmax()
predicted_champ_row = df.loc[predicted_idx]
champ_features = predicted_champ_row[FEATURES].values.reshape(1, -1)
series_len_proba = len_pipe.predict_proba(champ_features)[0]
len_classes = len_pipe.named_steps['logisticregression'].classes_
pred_series_len = int(len_classes[np.argmax(series_len_proba)])

summary = {
    'Predicted 2025 champion': predicted_champ_row['name'],
    'Predicted league': predicted_champ_row['lgID'],
    'Predicted series length (games)': pred_series_len,
}
summary




{'Predicted 2025 champion': 'Milwaukee Brewers',
 'Predicted league': 'NL',
 'Predicted series length (games)': 7}

### What to try next
- Add playoff-only filters (division winners, wild cards)
- Add opponent-strength or head-to-head features
- Use recent-form features (last 30 games)
- Model the series length using *both* teams’ features
