# CS 363M Machine Learning Project

## Authors
- Hudson Gould (HAG929)
- Cristian Cantu (cjc5844)
- Diego Costa (dc48222)
- Dylan Dang (dad4364)

## Background



In this project, we want to predict wether or not a given baseball pitch will be a home run or not. This is an interesting problem because it could be used to better predict the outcomes of baseball games in advance (at least in terms of number of homeruns). Alternatively, one can calculate the probability of a given pitch being a home run *during* the pitch itself (though the outcome will be evident seconds after).

To do this, we are using data from the UT Baseball 2024 Season. Our dataset contains data from every single pitch during UT home games, taken from a TrackMan brand detection machine which tracks and records 3D characteristic of a baseball in motion.

We want to use this data to predict whether a given pitch will be a home run or not. We will use information such as the pitch velocities, runs scored, and other pitch information to predict this. This ML problem is especially interesting, as it suffers from a massive imbalance of classes - far more of the pitches are NOT homeruns, compared to those that are. (Reminiscent of the "predicting credit card fraud" problem). This means that our data will have to be carefully pruned and our modeling techniques must be jucicious to avoid a too-high false negative rate.



## Data Preparation

### Import packages

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as spstats
import seaborn as sns
import pandas as pd
%matplotlib inline

### Data Cleaning

##### Print the head of the data as a cursory look

In [12]:
data = pd.read_csv('data.csv')

#Print the head of the data
print(data.head())

print(data.shape)

                          game_id        Date     Time  PitchNo  Inning  \
0  20240220-HighPointUniversity-1  2024-02-20  60314.0       82       3   
1  20240220-HighPointUniversity-1  2024-02-20  63576.0      185       6   
2  20240220-HighPointUniversity-1  2024-02-20  66446.0      269       8   
3  20240220-HighPointUniversity-1  2024-02-20  64809.0      216       6   
4  20240220-HighPointUniversity-1  2024-02-20  67985.0      308       9   

  inning_half  PAofInning  PitchofPA           Pitcher   PitcherId  ...    z0  \
0         Top           4          2  Olsovsky, Dalton  1000251274  ...  5.41   
1         Top           2          3     Glover, Lucas  1000138461  ...  6.01   
2         Top           3          1      Carter, Noah  1000108939  ...  5.52   
3      Bottom           6          1     Welch, Collin  1000192105  ...  6.23   
4      Bottom           4          3       Lewis, Zach  1000127413  ...  5.18   

    vx0     vy0    vz0    ax0    ay0    az0          catcher  

One notable feature of this data is the enourmous amount of rows - 1.5 Million! Looking at the first column "game_id", we in fact see that multiple other schools are represented in this dataset, since we see names of schools that are not Texas or it's opponent on a given day (for example, the head of this data shows a game from High Point University v Not Texas). 

For this reason, an easy first step is to remove all the rows not containing "Texas" in the first column, to give us a pruned dataset of ONLY Texas home games. We choose to ignore the Away games, as garnering insights to get a better home field advantage is better than trying to analyze our performance at 6 different other school's fields.

In [14]:
texas_data = data[data['game_id'].str.contains("Texas", na=False)]

texas_data.to_csv("texas_data.csv", index=False)

print(texas_data.head())

print(texas_data.shape)

               game_id        Date     Time  PitchNo  Inning inning_half  \
4651  20240220-Texas-1  2024-02-20  78782.0      357       9         Top   
4652  20240220-Texas-1  2024-02-20  72532.0      158       5         Top   
4653  20240220-Texas-1  2024-02-20  75554.0      252       6      Bottom   
4654  20240220-Texas-1  2024-02-20  73889.0      197       5      Bottom   
4655  20240220-Texas-1  2024-02-20  76475.0      281       7      Bottom   

      PAofInning  PitchofPA           Pitcher   PitcherId  ...    z0   vx0  \
4651           3          3      O'Hara, Cade  1000192590  ...  6.04  1.96   
4652           1          2  Hamilton, Hudson      815123  ...  6.12  7.30   
4653           7          3   Gilley, Brayden  1000165200  ...  6.41  1.61   
4654          10          3   Gilley, Brayden  1000165200  ...  6.27  4.20   
4655           2          6      Wilson, Dave  1000306108  ...  5.79  2.06   

         vy0   vz0   ax0    ay0    az0             catcher catcher_id  \
4

After reducing our dataset to only the Texas home games, our dataset has a much more manageable 10230 rows. This is the dataset we will be basing the rest of our analysis on.

This is good enough for preliminary data cleaning. Now we explore the data to better understand our features and what we need to consider when modeling!

### Data Exploration

In [None]:
cols = data.select_dtypes(include=['object', 'category']).columns
cols = cols.drop(['game_id', 'Pitcher', 'PitcherTeam', 'Batter', 'BatterTeam', 'catcher', 'catcher_team'])

max_col_width = max(len(col) for col in cols)

for col in cols:
    print(f'{col:<{max_col_width}}: {data[col].unique()}')

In [None]:
# print number of missing values in each column
missing = data.isnull().sum()
print(missing[missing > 0])

In [None]:
print(data['PlayResult'].value_counts())

In [None]:
plt.figure()
sns.boxplot(x='PlayResult', y='Distance', data=data)
plt.xlabel('Play Results')
plt.ylabel('Distance')
plt.xticks(rotation=90)
plt.show()

In [6]:
data = pd.read_csv('data_10k.csv')

# Ensure IsHomeRun is defined as a binary column
data['IsHomeRun'] = (data['PlayResult'] == 'HomeRun').astype(int)

# Drop all columns that aren't numerical
numerical_features = data.select_dtypes(include=['float64', 'int64']).columns

# Calculate correlations with IsHomeRun
home_run_corr = data[numerical_features].corr()['IsHomeRun'].sort_values(ascending=False)

# Display the correlations
pd.set_option('display.max_rows', None)
print(home_run_corr)

IsHomeRun                    1.000000
RunsScored                   0.526636
hit_last_tracked_distance    0.337569
Distance                     0.327059
hit_y                        0.315036
hit_hang_time                0.246433
ExitSpeed                    0.198423
hit_max_height               0.153468
position_110x                0.070841
hit_contact_x                0.068017
Angle                        0.042155
position_110y                0.039385
PitchofPA                    0.037337
Balls                        0.028325
Strikes                      0.025754
PitcherId                    0.016385
HorzBreak                    0.015899
catcher_id                   0.014210
Extension                    0.013043
InducedVertBreak             0.012919
pfxz                         0.012479
VertApprAngle                0.012105
az0                          0.010826
VertBreak                    0.010763
SpinAxis                     0.009017
PAofInning                   0.008517
HorzApprAngl

### Feature Engineering

## Modeling

### Decision Trees (Cristian)

### Neural Nets (Hudson)

### SVM

## Outcome