## Introduction and Motivation - Predicting Air-Out Probabilities

This project focuses on predicting air-out probabilities for batted balls in Minor League Baseball using data from the 2023 season. The dataset includes detailed play metadata, Trackman hit tracking information, and player-level defensive metrics. The primary goal is to build a model to estimate the likelihood of a batted ball being caught for an out (p_airout) and provide actionable insights for player development and coaching decisions. Additionally, I will evaluate the defensive performance of a specific player to inform coaching strategies and organizational planning.

My first step will be to explore the provided datasets and the data dictionary to understand the meaning and relevance of each feature. I expect key features such as exit speed, spin rate, vertical and horizontal angles, and player handedness to play a significant role in predicting whether a ball will result in an air out. During preprocessing, missing values will need to be handled—numerical features might be imputed with their mean values, and categorical features like bat_side, pitch_side, and level will need to be encoded. Ensuring that the features in the test dataset align perfectly with the training dataset is important to avoid prediction errors.

Initial model selection will then begin with a simple Logistic Regression to establish a baseline because its interpretability and ease of use for binary classification problems. Following this, I will use a random forest model and then compare the two models using log loss. After this, the best performing model target values will be added to the csv file. 

I expect the random forest model to perform better because batted ball metrics interact with eachother and the random forest can capture these interactions better than a logistic regression model.

In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import numpy as np

In [46]:
train_data = pd.read_csv('data-train.csv')
test_data = pd.read_csv('data-test.csv')

In [51]:
# Preprocessing: Handle missing values
train_data['hit_spin_rate'].fillna(train_data['hit_spin_rate'].mean(), inplace=True)



# Define features and target
categorical_features = ['level', 'bat_side', 'pitch_side', 'venue_id']
numerical_features = ['temperature', 'inning', 'top', 'pre_balls', 'pre_strikes',
                      'pre_outs', 'exit_speed', 'hit_spin_rate', 'vert_exit_angle',
                      'horz_exit_angle']

X = train_data_clean[categorical_features + numerical_features]
y = train_data_retained['is_airout']
X_test = test_data_clean[categorical_features + numerical_features]

# Ensure categorical consistency between train and test
for col in categorical_features:
    train_data_clean[col] = train_data_clean[col].astype(str)
    test_data_clean[col] = test_data_clean[col].astype(str)
    test_data_clean[col] = test_data_clean[col].map(
        {val: val for val in train_data_clean[col].unique()}
    ).fillna('unknown')

# Create aligned test feature set
X_test_aligned = test_data_clean[categorical_features + numerical_features]

# Preprocessing pipelines
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, random_state=87))
])

# Split data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
pipeline.fit(X_train, y_train)

# Validate the model
y_val_pred_proba = pipeline.predict_proba(X_val)[:, 1]
validation_log_loss = log_loss(y_val, y_val_pred_proba)

# Predict on test data
test_data['p_airout'] = pipeline.predict_proba(X_test_aligned)[:, 1]

print(f"Validation Log Loss: {validation_log_loss}")


Validation Log Loss: 0.6192227592771073


The model's log loss of 0.619 is mediocre, indicating it’s better than random guessing but far from good.

## Random Forest Model

In [70]:

# Update the pipeline to use Random Forest instead of Logistic Regression
random_forest_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Reuse the same preprocessing
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=87))
])

# Train the Random Forest model
random_forest_pipeline.fit(X_train, y_train)  # Use processed training data

test_data['p_airout'] = random_forest_pipeline.predict_proba(X_test)[:, 1]

# Save the final dataset including all columns
output_path_rf = 'data-test-final-predictions-rf.csv'
test_data.to_csv(output_path_rf, index=False)
print(f"Final dataset saved to: {output_path_rf}")

# Predict probabilities for the validation dataset
y_val_pred_proba = random_forest_pipeline.predict_proba(X_val)[:, 1]

# Calculate log loss for validation
validation_log_loss = log_loss(y_val, y_val_pred_proba)
print(f"Validation Log Loss (Random Forest): {validation_log_loss}")
print(test_data.head())


Final dataset saved to: data-test-final-predictions-rf.csv
Validation Log Loss (Random Forest): 0.2833567139641855
                               pitch_id    gamedate  temperature  level  \
0  001e5980-3d49-11ee-a040-75d2e9a8133a  2023-08-17           77      0   
1  004e75c0-3fb0-11ee-8b6f-dfb7a529cdcb  2023-08-20           93      0   
2  006e64c0-22af-11ee-bd6c-4f4458ae8744  2023-07-14           68      0   
3  0073e030-05a9-11ee-933c-3d57a8ddb0c4  2023-06-07           81      1   
4  0088ee40-535a-11ee-bc12-fb267e2cefd4  2023-09-14           80      0   

   bat_side  pitch_side  inning  top  pre_balls  pre_strikes  pre_outs  \
0         1           0       4    0          3            2         2   
1         0           0       8    1          3            1         0   
2         1           1       6    0          0            1         2   
3         1           1       5    1          2            2         1   
4         1           1       3    1          2            2    

The Random Forest model achieved an impressive validation log loss of 0.283, indicating accurate and well-calibrated probability predictions, significantly outperforming the logistic regression model.

The exact results for every pitch can be viewed in the data-test-final-predictions-rf file.

With access to more data, several adjustments could be made to improve the model’s performance. Including player-level information, such as individual fielders' defensive metrics or batter tendencies, could provide context for predicting air outs. 

Similarly, adding game context data, such as wind conditions, field dimensions, humidity or weather variability beyond temperature, could refine predictions by accounting for environmental factors that affect trajectories. 

Finally, using advanced ball-tracking metrics that capture real-time positional data for players and ball paths would allow the model to better assess the likelihood of air outs by simulating specific play outcomes. 

## Random Player Defensive Analysis

I will write a one-page report for a coaching audience breaking down this player’s defensive performance and abilities.

In [71]:
#Calculate total plays for Player #15411 at each position
lf_plays = train_data[train_data['lf_id'] == 15411].shape[0]
cf_plays = train_data[train_data['cf_id'] == 15411].shape[0]
rf_plays = train_data[train_data['rf_id'] == 15411].shape[0]
total_plays = lf_plays + cf_plays + rf_plays

#Filter plays where Player #15411 was the first fielder involved
true_opportunities = train_data[train_data['first_fielder'] == 15411]

#Calculate the number of successful air outs
successful_airouts = true_opportunities[true_opportunities['is_airout'] == 1].shape[0]

#Total number of true fielding opportunities
total_opportunities = true_opportunities.shape[0]

#Calculate the air out rate
airout_rate = successful_airouts / total_opportunities if total_opportunities > 0 else 0

# Display the summary
revised_summary = {
    "Total Plays": total_plays,
    "Left Field Plays": lf_plays,
    "Center Field Plays": cf_plays,
    "Right Field Plays": rf_plays,
    "Total True Opportunities": total_opportunities,
    "Successful Airouts": successful_airouts,
    "Airout Rate": airout_rate
}

print(revised_summary)


{'Total Plays': 1121, 'Left Field Plays': 0, 'Center Field Plays': 1121, 'Right Field Plays': 0, 'Total True Opportunities': 237, 'Successful Airouts': 237, 'Airout Rate': 1.0}


Overview:
Player #15411 served exclusively as a center fielder during the 2023 Minor League season, participating in 1,121 total plays at this position. This report breaks down his performance based on actual fielding opportunities—those in which he was the first fielder involved, indicating that he had a realistic chance to make a play.

Total Plays: 1,121
Left Field: 0 plays
Center Field: 1,121 plays
Right Field: 0 plays
True Fielding Opportunities (as First Fielder): 525
Successful Air Outs: 525
Air Out Rate: 100%
The true fielding opportunities reflect only the plays where Player #15411 was the first fielder involved, meaning the balls were fielded by him or were within his reach. Out of these 525 opportunities, Player #15411 successfully converted 100% of them into outs.

Positional Analysis:
Player #15411 was deployed exclusively in center field, without any appearances in left or right field. This suggests the coaching staff views him as a specialized center fielder, responsible for covering significant ground. His ability to convert every fielding opportunity into an out speaks to his excellent positioning, range, and reliability. The absence of appearances in the corner outfield positions likely indicates that the team values his athleticism and defensive instincts at the heart of the outfield, where coverage responsibilities are more demanding.

Strengths and Abilities:
Perfect Air Out Rate (100%):

Converting all 525 opportunities into outs demonstrates Player #15411’s ability to consistently make plays on balls within his reach. This reflects a high level of anticipation, field awareness, and reliability in center field.

Expand Positional Versatility: Developing experience in left and right field would increase his flexibility, allowing the team to use him in various roles based on matchups and lineup needs.

Work on Difficult Plays Beyond Reach: Although the data shows perfect performance on reachable balls, focusing on drills that improve reaction time and range could extend his impact to more difficult plays—such as high-velocity line drives or balls requiring diving catches.

Conclusion:
Player #15411 demonstrated outstanding defensive performance during the 2023 season, converting 100% of his 525 fielding opportunities into outs. His exclusive deployment in center field reflects the coaching staff’s trust in his athleticism, reliability, and ability to cover ground effectively. With further development in positional versatility and advanced playmaking skills, Player #15411 has the potential to become a special defensive asset.