# CS 363M Machine Learning Project

## Authors
- Hudson Gould (HAG929)
- Cristian Cantu (cjc5844)
- Diego Costa (dc48222)
- Dylan Dang (dad4364)

## Background



In this project, we want to predict wether or not a given baseball pitch will be a home run or not. This is an interesting problem because it could be used to better predict the outcomes of baseball games in advance (at least in terms of number of homeruns). Alternatively, one can calculate the probability of a given pitch being a home run *during* the pitch itself (though the outcome will be evident seconds after).

To do this, we are using data from the UT Baseball 2024 Season. Our dataset contains data from every single pitch during UT home games, taken from a TrackMan brand detection machine which tracks and records 3D characteristic of a baseball in motion.

We want to use this data to predict whether a given pitch will be a home run or not. We will use information such as the pitch velocities, runs scored, and other pitch information to predict this. This ML problem is especially interesting, as it suffers from a massive imbalance of classes - far more of the pitches are NOT homeruns, compared to those that are. (Reminiscent of the "predicting credit card fraud" problem). This means that our data will have to be carefully pruned and our modeling techniques must be jucicious to avoid a too-high false negative rate.



## Data Preparation

### Import packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as spstats
import seaborn as sns
import pandas as pd
%matplotlib inline

### Data Cleaning

##### Print the head of the data as a cursory look

In [None]:
data = pd.read_csv('data.csv')

#Print the head of the data
print(data.head())

print(data.shape)

One notable feature of this data is the enourmous amount of rows - 1.5 Million! Looking at the first column "game_id", we in fact see that multiple other schools are represented in this dataset, since we see names of schools that are not Texas or it's opponent on a given day (for example, the head of this data shows a game from High Point University v Not Texas). 

For this reason, an easy first step is to remove all the rows not containing "Texas" in the first column, to give us a pruned dataset of ONLY Texas home games. We choose to ignore the Away games, as garnering insights to get a better home field advantage is better than trying to analyze our performance at 6 different other school's fields.

In [None]:
texas_data = data[data['game_id'].str.contains("Texas", na=False)]

texas_data.to_csv("texas_data.csv", index=False)

Now that we have our texas_data.csv, we can proceed with our analysis from here!

In [None]:
texas_data = pd.read_csv("texas_data.csv")

print(texas_data.head())

print("Shape of the data: ", end="")
print(texas_data.shape)

num_rows = texas_data.shape[0]

#Figure out how many games this represents
num_games = texas_data['game_id'].nunique()
print("This data represents "+str(num_games)+" games")
print("This means there were an average "+str(num_rows/num_games)+" pitches per game")

After reducing our dataset to only the Texas home games, our dataset has a much more manageable 10230 rows. 

Now we want to take a more in-depth look at all of our features and use both logic and analytical methods to identify features which are not useful and then remove them as part of our feature engineering step. For this purpose, it is crucially important to understand what exactly the 77 given features are exactly.

Here are the features and their meanings: (taken from the TrackMan website)

<details>
<summary>Features</summary>

**Game Information**
- **game_id**: Game ID  
- **Date**: Date of the game  
- **Time**: Time of the pitch  
- **Inning**: Inning of the game  
- **inning_half**: Top or Bottom of the inning  
- **PAofInning**: Plate appearance of the inning  
- **PitchofPA**: Pitch number within the plate appearance  

**Pitcher Information**
- **Pitcher**: Name of the pitcher  
- **PitcherId**: Unique identifier for the pitcher  
- **PitcherThrows**: Pitcher's throwing hand (e.g., right or left)  
- **PitcherTeam**: Team of the pitcher  

**Batter Information**
- **Batter**: Name of the batter  
- **BatterId**: Unique identifier for the batter  
- **BatterSide**: Batter's stance (e.g., right or left)  
- **BatterTeam**: Team of the batter  

**Catcher Information**
- **catcher**: Name of the catcher  
- **catcher_id**: Unique identifier for the catcher  
- **catcher_team**: Team of the catcher  

**Pitch Call and Results**
- **PitchCall**: Umpire call for the pitch (e.g., ball, strike)  
- **PlayResult**: Outcome of the play (e.g., single, out, home run)  
- **KorBB**: Strikeout or base on balls indicator  
- **OutsOnPlay**: Number of outs resulting from the play  
- **RunsScored**: Runs scored on the play  

**Game State**
- **Balls**: Count of balls in the at-bat  
- **Strikes**: Count of strikes in the at-bat  
- **Outs**: Number of outs in the inning  

**Pitch Information**
- **TaggedPitchType**: Categorized pitch type (e.g., fastball, curveball)  
- **RelSpeed**: Release speed of the pitch  
- **SpinRate**: Spin rate of the pitch in revolutions per minute  
- **SpinAxis**: Orientation of the spin axis (degrees)  
- **Tilt**: Clock-style representation of spin axis  
- **InducedVertBreak**: Vertical break due to spin (in inches)  
- **VertBreak**: Total vertical break (in inches)  
- **HorzBreak**: Total horizontal break (in inches)  
- **VertApprAngle**: Vertical approach angle at the plate (degrees)  
- **HorzApprAngle**: Horizontal approach angle at the plate (degrees)  
- **zone_time**: Time to reach the strike zone (seconds)  

**Release Metrics**
- **vert_rel_angle**: Vertical release angle of the pitch (degrees)  
- **horz_rel_angle**: Horizontal release angle of the pitch (degrees)  
- **RelHeight**: Release height of the pitch (feet)  
- **RelSide**: Horizontal release position relative to the rubber (feet)  
- **Extension**: Distance from the mound to the release point (feet)  

**Plate Location**
- **PlateLocHeight**: Height of the pitch as it crosses the plate (feet)  
- **PlateLocSide**: Horizontal location of the pitch at the plate (feet)  

**Hit Information**
- **TaggedHitType**: Categorized hit type (e.g., ground ball, fly ball)  
- **hit_x**: X-coordinate of the hit landing spot (feet)  
- **hit_y**: Y-coordinate of the hit landing spot (feet)  
- **ExitSpeed**: Exit velocity of the ball off the bat (mph)  
- **Angle**: Launch angle of the ball (degrees)  
- **HitSpinRate**: Spin rate of the ball off the bat (rpm)  
- **hit_spin_axis** Spin axis of the ball off the bat (degrees)  
- **Distance**: Total distance of the hit (feet)  
- **hit_last_tracked_distance**: Last tracked distance of the ball (feet)  
- **hit_hang_time**: Time the ball is in the air (seconds)  
- **Direction**: Direction of the hit (e.g., pull, opposite)  
- **Bearing**: Bearing of the hit relative to the field (degrees)  
- **hit_max_height**: Maximum height of the ball (feet)  
- **hit_contact_x**: X-coordinate of the contact point on the bat (inches)  
- **hit_contact_y**: Y-coordinate of the contact point on the bat (inches)  
- **hit_contact_z**: Z-coordinate of the contact point on the bat (inches)  

**Pitch Physics**
- **position_110x**: X-position at 110 feet from release point (feet)  
- **position_110y**: Y-position at 110 feet from release point (feet)  
- **position_110z**: Z-position at 110 feet from release point (feet)  
- **pfxx**: Horizontal movement of the pitch (inches)  
- **pfxz**: Vertical movement of the pitch (inches)  
- **x0**: X-coordinate of the pitch at release (feet)  
- **y0**: Y-coordinate of the pitch at release (feet)  
- **z0**: Z-coordinate of the pitch at release (feet)  
- **vx0**: X-component of velocity at release (mph)  
- **vy0**: Y-component of velocity at release (mph)  
- **vz0**: Z-component of velocity at release (mph)  
- **ax0**: X-component of acceleration (ft/s²)  
- **ay0**: Y-component of acceleration (ft/s²)  
- **az0**: Z-component of acceleration (ft/s²)  
- **EffectiveVelo**: Effective velocity as perceived by the batter (mph)  
- **SpeedDrop**: Velocity drop from release to plate (mph)

<details>

While it may be tempting to immediately remove features such as the inning number, we need to do some critical thinking. The only rows which we should drop outright are those which are either too difficult to process or too variant to be meaningful. For example, pitcher name (which is categorical), and the timestamp are 2 good examples of columns we should just drop. However, info like the inning number is useful and may in fact have a correlation with home runs which should not be glossed over. For example, perhaps pitchers tend to get tired by the 9th inning, and thus give up more home runs. Or alternatively, they "lock in" in the final inning to close out a tight game! We don't really know, so it behooves us to keep it in and let analytics to the thinking.

With all this said, we will first remove those obvious "should not use" features. Note that we choose to not include player information simply because one-hot-encoding all the players would result in too many additional features.

In [40]:
cols_to_drop = [
    "game_id",
    "Date",
    "Time",
    "Pitcher",
    "PitcherId",
    "Batter",
    "BatterId",
    "catcher",
    "catcher_id",
    "catcher_team"
]

data = texas_data.drop(columns=cols_to_drop, errors='ignore')

print(data.shape)

(10230, 67)


Note that the number of columns is now down from 77 to 67, (minus 10, which is the length of cols_to_drop)

As a restatement of our goal, we want to predict wether a given pitch will be a home run given all the data up to (and including) the batter's point of contact. Any information after the fact (like distance, number of runs scored, and the play call) makes it quite easy to infer wether the hit was a home run or not. Thus, we now need to separate all the columns containing after-hit data into a different dataframe. (Not erase it, since it will be useful for accuracy metrics later!)

In [41]:
after_hit_cols = [
    "PitchCall",
    "PlayResult",
    "KorBB",
    "OutsOnPlay",
    "RunsScored",
    "TaggedHitType",
    "hit_x",
    "hit_y",
    "Distance",
    "hit_last_tracked_distance",
    "hit_hang_time",
    "Direction",
    "Bearing",
    "hit_max_height",
    "TaggedPitchType"
]

#Print before size
print(data.shape)

after_hit_data = data[after_hit_cols]
data = data.drop(columns=after_hit_cols)

print(after_hit_data.head())

#Print after sizes to confirm proper split
print()
print("Pre-hit / Training Data Shape: "+str(data.shape))
print("After-hit / Testing Data Shape: "+str(after_hit_data.shape))

(10230, 67)
              PitchCall PlayResult      KorBB  OutsOnPlay  RunsScored  \
0                InPlay        Out  Undefined           1           0   
1                InPlay     Double  Undefined           0           0   
2          StrikeCalled  Undefined  Undefined           0           0   
3            BallCalled  Undefined  Undefined           0           0   
4  FoulBallNotFieldable  Undefined  Undefined           0           0   

  TaggedHitType   hit_x   hit_y  Distance  hit_last_tracked_distance  \
0      fly_ball  136.25  226.40    264.23                     258.01   
1      fly_ball  -74.63  349.69    357.56                     357.56   
2           NaN     NaN     NaN       NaN                        NaN   
3           NaN     NaN     NaN       NaN                        NaN   
4      fly_ball -249.18  214.98    329.11                     325.88   

   hit_hang_time  Direction  Bearing  hit_max_height TaggedPitchType  
0           3.27      23.07    31.04         

Now we have isolated our pre-hit data to just 53 rows, and our post-hit data, containing 15 rows

This is good enough for preliminary data cleaning. Now we explore the data to better understand our features and what we need to consider when modeling!

### Data Exploration

To get a very overall feel for the data, we want to take a look at the correlations between the remaining columns and see how they are related to home runs

In [47]:
#Create an "IsHomeRun" label column
data['IsHomeRun'] = (after_hit_data['PlayResult'] == 'HomeRun').astype(int)

#Check if there are any categorical  columns left
categorical_features = data.select_dtypes(exclude=["number"]).columns
print("Categorical Features:")
print(categorical_features) 
print()

#There are 5, but we will keep those for the feature engineering section. So we temporarily drop them.
# Drop all columns that aren't numerical
numerical_features = data.select_dtypes(include=['float64', 'int64']).columns
print("Numerical Features: ")
print(numerical_features)
print()

## Calculate correlations with IsHomeRun
home_run_corr = data[numerical_features].corr()['IsHomeRun'].sort_values(ascending=False)

# Display the correlations
pd.set_option('display.max_rows', None)
print(home_run_corr)

Categorical Features:
Index(['inning_half', 'PitcherThrows', 'PitcherTeam', 'BatterSide',
       'BatterTeam'],
      dtype='object')

Numerical Features: 
Index(['PitchNo', 'Inning', 'PAofInning', 'PitchofPA', 'Balls', 'Strikes',
       'Outs', 'RelSpeed', 'SpinRate', 'SpinAxis', 'Tilt', 'InducedVertBreak',
       'VertBreak', 'HorzBreak', 'VertApprAngle', 'HorzApprAngle',
       'vert_rel_angle', 'horz_rel_angle', 'RelHeight', 'RelSide', 'Extension',
       'PlateLocHeight', 'PlateLocSide', 'zone_time', 'EffectiveVelo',
       'SpeedDrop', 'ExitSpeed', 'Angle', 'HitSpinRate', 'hit_spin_axis',
       'hit_contact_x', 'hit_contact_y', 'hit_contact_z', 'position_110x',
       'position_110y', 'position_110z', 'pfxx', 'pfxz', 'x0', 'y0', 'z0',
       'vx0', 'vy0', 'vz0', 'ax0', 'ay0', 'az0', 'IsHomeRun'],
      dtype='object')

IsHomeRun           1.000000
ExitSpeed           0.208884
position_110x       0.102694
hit_contact_x       0.073320
position_110y       0.062656
Angle            

The results of this are quite interesting - they show that the highest correlated features are the Exit speed, contact positions, and the Pitch of Plate Appearance! Reasonably, they are mostly related to the hit itself rather than the pitch.

### Feature Engineering

Now we get to feature engineering. There are some notable categorical features that we want to one-hot encode or turn into binary variables, to allow convenient modeling. These operations are shown below.


In [None]:
from sklearn.preprocessing import OrdinalEncoder

numerical = [
    "PitcherThrows",
    "PitcherTeam",
    "BatterSide",
    "BatterTeam",
    "TaggedPitchType",
    #
    "inning_half"
]

#Make the handed-ness and Teams a boolean value
data["PitcherThrows"] = data["PitcherThrows"].map({"L":0, "R":1})
data["PitcherTeam"] = (data["PitcherTeam"] == "TEX_LON").astype(int)
data["BatterSide"] = data["BatterSide"].map({"L":0, "R":1})
data["BatterTeam"] = (data["BatterTeam"] == "TEX_LON").astype(int)
data["inning_half"]

#We also want to engineer this new feature indicating the
#handed-ness matchup. Statistically, L/L or R/R favors the pitcher
#while L/R or R/L favors the batter. We want to represent this as 
#a new feature!
data["Sidematchup"] = (data["PitcherThrows"] == data["BatterSide"]).astype(int)




TaggedPitchType
3.0    4040
5.0    2778
4.0    1575
0.0     960
1.0     496
2.0     330
6.0      25
Name: count, dtype: int64

TaggedPitchType
3.0    4040
5.0    2778
4.0    1575
0.0     960
1.0     496
2.0     330
6.0      25
Name: count, dtype: int64
0        3.0
1        3.0
2        3.0
3        1.0
4        3.0
        ... 
10225    3.0
10226    3.0
10227    3.0
10228    3.0
10229    4.0
Name: TaggedPitchType, Length: 10228, dtype: float64
0        3.0
1        3.0
2        3.0
3        1.0
4        3.0
        ... 
10225    3.0
10226    3.0
10227    3.0
10228    3.0
10229    4.0
Name: TaggedPitchType, Length: 10228, dtype: float64


In [None]:
#Analysis on TaggedPitchType shows 9 values
print(data["TaggedPitchType"].value_counts())
print()

#Remove the Knuckleball and Four-seam
knuckle_four_rows = data[data["TaggedPitchType"].isin(["Knuckleball", "Four-Seam"])].index
data = data.drop(knuckle_four_rows)
print(data["TaggedPitchType"].value_counts()) #This shows that we removed the properly

#Encode the TaggedPitchType Ordinally
encoder = OrdinalEncoder()
print(data["TaggedPitchType"])

data["TaggedPitchType"] = encoder.fit_transform(data[["TaggedPitchType"]])
print(data["TaggedPitchType"])

## Modeling

### Decision Trees (Cristian)

### Neural Nets ()

### SVM

## Outcome