## Import Necessary Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob

## Data Cleaning/ Prep

In [2]:
plays = pd.read_csv('plays.csv')
players = pd.read_csv('players.csv')
player_play = pd.read_csv('player_play.csv')
games = pd.read_csv('games.csv')


In [4]:
# filter the other data for KC 

# Filter plays data for Kansas City Chiefs
kc_plays = plays[plays['possessionTeam'] == 'KC']
#kc_plays = plays

# Filter player_play data for Kansas City Chiefs
kc_player_play = player_play[player_play['teamAbbr'] == 'KC']

# Filter games data for Kansas City Chiefs
kc_games = games[(games['homeTeamAbbr'] == 'KC') | (games['visitorTeamAbbr'] == 'KC')]

# Create a copy of kc_plays to avoid SettingWithCopyWarning
kc_plays = kc_plays.copy()

# Step 1: Score Differential
kc_plays['scoreDifferential'] = kc_plays['preSnapHomeScore'] - kc_plays['preSnapVisitorScore']
kc_plays['isLeading'] = (kc_plays['scoreDifferential'] > 0).astype(int)

# Step 2: Field Position
kc_plays['isRedZone'] = (kc_plays['absoluteYardlineNumber'] <= 20).astype(int)

# Step 3: Down and Distance
kc_plays['shortYardage'] = (kc_plays['yardsToGo'] <= 2).astype(int)
kc_plays['thirdAndLong'] = ((kc_plays['down'] == 3) & (kc_plays['yardsToGo'] > 7)).astype(int)

# Step 4: Game Context
kc_plays['fourthQuarter'] = (kc_plays['quarter'] == 4).astype(int)
kc_plays['twoMinuteDrill'] = (kc_plays['quarter'] >= 2) & (kc_plays['gameClock'].str.split(':').str[0].astype(int) <= 2)
kc_plays['playType'] = kc_plays['passResult'].apply(lambda x: 'run' if pd.isnull(x) else 'pass')


1. Accuracy
What it means: Accuracy is the percentage of plays the model predicts correctly, whether they are a pass or a run.
Example: If there are 100 plays and the model correctly predicts 85 of them, the accuracy is 85%.
Strength: It gives a simple overview of how often the model is right.
Weakness: If the dataset has more passes than runs, the model might predict "pass" most of the time and still have high accuracy, even if it rarely predicts "run" correctly.
2. Precision
What it means: Precision focuses on how often the model is correct when it predicts a pass (or you can calculate it for runs too). It tells us how many of the predicted passes were actually passes.
Formula: Precision = (Correct Pass Predictions) / (Total Pass Predictions)
Example: If the model predicts 50 plays as passes and 40 of those are actually passes, precision is 40/50 = 80%.
Use Case: Precision is important when you want to minimize false alarms. For example, you want to avoid predicting "pass" if it’s actually a "run."
3. Recall
What it means: Recall tells us how well the model identifies all the actual passes. It measures how many of the true passes the model successfully predicted.
Formula: Recall = (Correct Pass Predictions) / (Total Actual Passes)
Example: If there are 60 actual passes and the model correctly predicts 40 of them, recall is 40/60 = 66.7%.
Use Case: Recall is crucial when you don’t want to miss predicting any passes. For example, if missing a "pass" prediction is costly, you’ll prioritize recall.
4. PRAUC (Precision-Recall Area Under Curve)
What it means: PRAUC is a summary of how the model performs across different thresholds for predicting "pass" or "run." It combines precision and recall into a single score.
Why it’s useful: It helps when the dataset is imbalanced (e.g., way more passes than runs) because it focuses on how well the model performs on the minority class (e.g., runs).
Better PRAUC: A higher PRAUC means the model balances precision and recall well across all thresholds.
5. ROC (Receiver Operating Characteristic) and AUC (Area Under the Curve)
What it means: The ROC curve shows how well the model distinguishes between passes and runs by plotting the trade-off between:
True Positive Rate (Recall): How many actual passes are predicted correctly.
False Positive Rate: How often the model wrongly predicts "pass" when it’s actually "run."
AUC (Area Under Curve): A single score summarizing the ROC curve. A perfect model has an AUC of 1.0, while a random guess has an AUC of 0.5.
Why it’s useful: It shows how well the model distinguishes between passes and runs overall, regardless of the prediction threshold.
How They Work Together:
Accuracy: General correctness of predictions.
Precision: Focuses on how clean the predictions are (minimizing false positives).
Recall: Focuses on capturing all true instances (minimizing false negatives).
PRAUC: Evaluates how well the model balances precision and recall for a class, especially useful in imbalanced datasets.
ROC AUC: Measures the overall ability of the model to distinguish between pass and run plays.
By considering these metrics, you can choose the right balance for your model depending on whether missing predictions or being overly cautious is more important.

In [51]:
import pandas as pd
import numpy as np
import optuna
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, roc_auc_score, 
    average_precision_score, classification_report
)
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder

# Step 1: Prepare Data
# Assume `kc_plays` is a pre-loaded DataFrame.
features = ['yardsToGo', 'quarter', 'down', 'expectedPoints', 'scoreDifferential', 'isLeading', 
            'isRedZone', 'shortYardage', 'thirdAndLong', 'fourthQuarter', 'twoMinuteDrill', 
            'preSnapHomeScore', 'preSnapVisitorScore', 'offenseFormation']

data = kc_plays[features + ['playType']].dropna()

# One-hot encode offensive formations
encoder = OneHotEncoder(sparse_output=False)
encoded_formations = encoder.fit_transform(data[['offenseFormation']])
formation_columns = [f'formation_{cat}' for cat in encoder.categories_[0]]
encoded_df = pd.DataFrame(encoded_formations, columns=formation_columns)

# Merge encoded features back into the dataset
data = pd.concat([data.reset_index(drop=True), encoded_df], axis=1).drop(columns=['offenseFormation'])

X = data.drop(columns=['playType'])
y = data['playType']

# Split the data into train, validation, and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=28, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.2, random_state=28, stratify=y_train_full)

# Define the Optuna objective function
def objective(trial):
    # Adjusted parameter ranges based on the best trial
    params = {
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'max_depth': trial.suggest_int('max_depth', 3, 5),  # Around 3
        'learning_rate': trial.suggest_float('learning_rate', 0.05, 0.15),  # Around 0.073
        'subsample': trial.suggest_float('subsample', 0.5, 0.8),  # Around 0.613
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.75, 0.95),  # Around 0.852
        'n_estimators': trial.suggest_int('n_estimators', 300, 800),  # Around 727
        'reg_alpha': trial.suggest_float('reg_alpha', 2, 5.5),  # Around 4.999
        'reg_lambda': trial.suggest_float('reg_lambda', 2, 5),  # Around 4.615
    }
    
    # XGBoost model with Stratified K-Fold cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    auc_scores = []
    
    for train_idx, val_idx in skf.split(X_train, y_train):
        X_train_fold, X_val_fold = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[val_idx]
        
        model = xgb.XGBClassifier(**params)
        model.fit(X_train_fold, y_train_fold, eval_set=[(X_val_fold, y_val_fold)], verbose=False)
        
        y_pred = model.predict_proba(X_val_fold)[:, 1]
        auc_scores.append(roc_auc_score(y_val_fold, y_pred))
    
    return np.mean(auc_scores)

# Run Optuna optimization with adjusted ranges
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=200)

# Best hyperparameters
best_params = study.best_params
print("Best Hyperparameters:", best_params)

# Train final model with best hyperparameters on the entire training set
final_model = xgb.XGBClassifier(**best_params)
final_model.fit(X_train_full, y_train_full)

# Cross-validation metrics for the best model
cv_results = []
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, val_idx in skf.split(X_train_full, y_train_full):
    X_train_fold, X_val_fold = X_train_full.iloc[train_idx], X_train_full.iloc[val_idx]
    y_train_fold, y_val_fold = y_train_full.iloc[train_idx], y_train_full.iloc[val_idx]
    
    model = xgb.XGBClassifier(**best_params)
    model.fit(X_train_fold, y_train_fold, eval_set=[(X_val_fold, y_val_fold)], verbose=False)
    
    y_pred = model.predict(X_val_fold)
    y_pred_proba = model.predict_proba(X_val_fold)[:, 1]
    
    metrics = {
        'accuracy': accuracy_score(y_val_fold, y_pred),
        'precision': precision_score(y_val_fold, y_pred),
        'recall': recall_score(y_val_fold, y_pred),
        'roc_auc': roc_auc_score(y_val_fold, y_pred_proba),
        'pr_auc': average_precision_score(y_val_fold, y_pred_proba)
    }
    cv_results.append(metrics)

# Print cross-validation results
cv_df = pd.DataFrame(cv_results)
print("Cross-Validation Results:")
print(cv_df)

# Evaluate on the test set
y_test_pred = final_model.predict(X_test)
y_test_pred_proba = final_model.predict_proba(X_test)[:, 1]

test_metrics = {
    'accuracy': accuracy_score(y_test, y_test_pred),
    'precision': precision_score(y_test, y_test_pred),
    'recall': recall_score(y_test, y_test_pred),
    'roc_auc': roc_auc_score(y_test, y_test_pred_proba),
    'pr_auc': average_precision_score(y_test, y_test_pred_proba)
}

print("Test Set Metrics:")
for metric, value in test_metrics.items():
    print(f"{metric}: {value:.4f}")


[I 2024-11-25 23:08:09,445] A new study created in memory with name: no-name-0d30e787-5af5-4419-abad-99c57ee04c95
[I 2024-11-25 23:08:13,418] Trial 0 finished with value: 0.7860006273888012 and parameters: {'max_depth': 7, 'learning_rate': 0.15417281599477328, 'subsample': 0.6690322515116343, 'colsample_bytree': 0.6330844054743027, 'n_estimators': 383, 'reg_alpha': 4.8754621283998025, 'reg_lambda': 3.6994391766528496}. Best is trial 0 with value: 0.7860006273888012.
[I 2024-11-25 23:08:17,071] Trial 1 finished with value: 0.7835808855548543 and parameters: {'max_depth': 7, 'learning_rate': 0.15692814478442135, 'subsample': 0.625959142969496, 'colsample_bytree': 0.679868202069582, 'n_estimators': 378, 'reg_alpha': 3.9163297023966575, 'reg_lambda': 4.9291014942244935}. Best is trial 0 with value: 0.7860006273888012.
[I 2024-11-25 23:08:19,490] Trial 2 finished with value: 0.8023349597732263 and parameters: {'max_depth': 3, 'learning_rate': 0.07660833398474136, 'subsample': 0.754332054608

Best Hyperparameters: {'max_depth': 4, 'learning_rate': 0.054238433629758984, 'subsample': 0.7443763196948939, 'colsample_bytree': 0.7783798981988443, 'n_estimators': 314, 'reg_alpha': 3.6778408489507983, 'reg_lambda': 4.200707950923127}
Cross-Validation Results:
   accuracy  precision    recall   roc_auc    pr_auc
0  0.749804   0.793967  0.796010  0.810695  0.875062
1  0.736078   0.779328  0.790862  0.806612  0.871410
2  0.747451   0.782258  0.811454  0.808091  0.869394
3  0.732444   0.774905  0.790862  0.793309  0.857332
4  0.741467   0.787043  0.789575  0.802835  0.872385
Test Set Metrics:
accuracy: 0.7503
precision: 0.7937
recall: 0.7977
roc_auc: 0.8198
pr_auc: 0.8813


# RANDOM FOREST

In [68]:
import pandas as pd
import numpy as np
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, roc_auc_score, 
    average_precision_score, f1_score, classification_report
)
from sklearn.preprocessing import OneHotEncoder

# Step 1: Prepare Data
# Assume `kc_plays` is a pre-loaded DataFrame.
features = ['yardsToGo', 'quarter', 'down', 'expectedPoints', 'scoreDifferential', 'isLeading', 
            'isRedZone', 'shortYardage', 'thirdAndLong', 'fourthQuarter', 'twoMinuteDrill', 
            'preSnapHomeScore', 'preSnapVisitorScore', 'offenseFormation']

data = kc_plays[features + ['playType']].dropna()

# One-hot encode offensive formations
encoder = OneHotEncoder(sparse_output=False)
encoded_formations = encoder.fit_transform(data[['offenseFormation']])
formation_columns = [f'formation_{cat}' for cat in encoder.categories_[0]]
encoded_df = pd.DataFrame(encoded_formations, columns=formation_columns)

# Merge encoded features back into the dataset
data = pd.concat([data.reset_index(drop=True), encoded_df], axis=1).drop(columns=['offenseFormation'])

X = data.drop(columns=['playType'])
y = data['playType']

# Split the data into train, validation, and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=28, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.2, random_state=28, stratify=y_train_full)

# Define the Optuna objective function
def custom_objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 200, 400),  # Number of trees
        'max_depth': trial.suggest_int('max_depth', 3, 5),  # Depth of each tree
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 6),  # Min samples to split a node
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 3),  # Min samples at leaf node
        'max_features': trial.suggest_float('max_features', 0.35, 0.55)  # Fraction of features for splits
    }
    
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=28)
    f1_scores = []
    
    for train_idx, val_idx in skf.split(X_train, y_train):
        X_train_fold, X_val_fold = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[val_idx]
        
        model = RandomForestClassifier(**params, random_state=28, n_jobs=-1,class_weight='balanced')
        model.fit(X_train_fold, y_train_fold)
        
        y_pred = model.predict(X_val_fold)
        f1_scores.append(f1_score(y_val_fold, y_pred))
    
    # Print current trial metrics
    #print(f"Trial {trial.number}: F1 Score = {np.mean(f1_scores):.4f}, Parameters = {params}")
    
    return np.mean(f1_scores)

# Run Optuna optimization with F1 score as the primary metric
study = optuna.create_study(direction='maximize')
study.optimize(custom_objective, n_trials=300)

# Print best trial results
print("\nBest Trial:")
print(f"  Trial Number: {study.best_trial.number}")
print(f"  Best F1 Score: {study.best_trial.value:.4f}")
print(f"  Best Parameters: {study.best_trial.params}")

# Train final model with best hyperparameters on the entire training set
best_params = study.best_params
final_model = RandomForestClassifier(**best_params, random_state=28, n_jobs=-1,class_weight='balanced')
final_model.fit(X_train_full, y_train_full)

# Cross-validation metrics for the best model
cv_results = []
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, val_idx in skf.split(X_train_full, y_train_full):
    X_train_fold, X_val_fold = X_train_full.iloc[train_idx], X_train_full.iloc[val_idx]
    y_train_fold, y_val_fold = y_train_full.iloc[train_idx], y_train_full.iloc[val_idx]
    
    model = RandomForestClassifier(**best_params, random_state=42, n_jobs=-1,class_weight='balanced')
    model.fit(X_train_fold, y_train_fold)
    
    y_pred = model.predict(X_val_fold)
    y_pred_proba = model.predict_proba(X_val_fold)[:, 1]
    
    metrics = {
        'accuracy': accuracy_score(y_val_fold, y_pred),
        'precision': precision_score(y_val_fold, y_pred),
        'recall': recall_score(y_val_fold, y_pred),
        'f1_score': f1_score(y_val_fold, y_pred),  # Include F1 score in evaluation
        'roc_auc': roc_auc_score(y_val_fold, y_pred_proba),
        'pr_auc': average_precision_score(y_val_fold, y_pred_proba)
    }
    print(f"Fold Classification Report:\n{classification_report(y_val_fold, y_pred)}")
    cv_results.append(metrics)

# Print cross-validation results
cv_df = pd.DataFrame(cv_results)
print("\nCross-Validation Results:")
print(cv_df)

# Evaluate on the test set
y_test_pred = final_model.predict(X_test)
y_test_pred_proba = final_model.predict_proba(X_test)[:, 1]

test_metrics = {
    'accuracy': accuracy_score(y_test, y_test_pred),
    'precision': precision_score(y_test, y_test_pred),
    'recall': recall_score(y_test, y_test_pred),
    'f1_score': f1_score(y_test, y_test_pred),  # F1 score for the test set
    'roc_auc': roc_auc_score(y_test, y_test_pred_proba),
    'pr_auc': average_precision_score(y_test, y_test_pred_proba)
}

print("\nTest Set Metrics:")
for metric, value in test_metrics.items():
    print(f"{metric}: {value:.4f}")

# Print classification report for the test set
print(f"\nTest Set Classification Report:\n{classification_report(y_test, y_test_pred)}")


[I 2024-11-26 00:11:14,953] A new study created in memory with name: no-name-111d1e4a-888c-46c8-b229-8eb1df76c905
[I 2024-11-26 00:11:16,892] Trial 0 finished with value: 0.7933627137666937 and parameters: {'n_estimators': 300, 'max_depth': 5, 'min_samples_split': 5, 'min_samples_leaf': 3, 'max_features': 0.37475264454911156}. Best is trial 0 with value: 0.7933627137666937.
[I 2024-11-26 00:11:18,699] Trial 1 finished with value: 0.790678955453149 and parameters: {'n_estimators': 287, 'max_depth': 4, 'min_samples_split': 5, 'min_samples_leaf': 3, 'max_features': 0.419881790348535}. Best is trial 0 with value: 0.7933627137666937.
[I 2024-11-26 00:11:20,047] Trial 2 finished with value: 0.7791310114521325 and parameters: {'n_estimators': 204, 'max_depth': 3, 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': 0.4224728911774374}. Best is trial 0 with value: 0.7933627137666937.
[I 2024-11-26 00:11:22,663] Trial 3 finished with value: 0.8051342396282288 and parameters: {'n_estim


Best Trial:
  Trial Number: 28
  Best F1 Score: 0.8132
  Best Parameters: {'n_estimators': 309, 'max_depth': 5, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 0.3663565779347745}
Fold Classification Report:
              precision    recall  f1-score   support

       False       0.58      0.60      0.59        25
        True       0.83      0.81      0.82        59

    accuracy                           0.75        84
   macro avg       0.70      0.71      0.70        84
weighted avg       0.75      0.75      0.75        84

Fold Classification Report:
              precision    recall  f1-score   support

       False       0.48      0.40      0.43        25
        True       0.76      0.81      0.79        59

    accuracy                           0.69        84
   macro avg       0.62      0.61      0.61        84
weighted avg       0.68      0.69      0.68        84

Fold Classification Report:
              precision    recall  f1-score   support

       Fals