# CS 363M Machine Learning Project

## Authors
- Hudson Gould (HAG929)
- Cristian Cantu (cjc5844)
- Diego Costa (dc48222)
- Dylan Dang (dad4364)

## Background



In this project, we want to predict wether or not a given baseball pitch will be a home run or not. This is an interesting problem because it could be used to better predict the outcomes of baseball games in advance (at least in terms of number of homeruns). Alternatively, one can calculate the probability of a given pitch being a home run *during* the pitch itself (though the outcome will be evident seconds after).

To do this, we are using data from the UT Baseball 2024 Season. Our dataset contains data from every single pitch during UT home games, taken from a TrackMan brand detection machine which tracks and records 3D characteristic of a baseball in motion.

We want to use this data to predict whether a given pitch will be a home run or not. We will use information such as the pitch velocities, runs scored, and other pitch information to predict this. This ML problem is especially interesting, as it suffers from a massive imbalance of classes - far more of the pitches are NOT homeruns, compared to those that are. (Reminiscent of the "predicting credit card fraud" problem). This means that our data will have to be carefully pruned and our modeling techniques must be jucicious to avoid a too-high false negative rate.



## Data Preparation

### Import packages

In [39]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as spstats
import seaborn as sns
import sklearn as sk

# Render all stand standalone statements 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

### Data Cleaning

##### Print the head of the data as a cursory look

In [40]:
data = pd.read_csv('data.csv')

print("Shape: ", data.shape)

data.head()

Shape:  (1513439, 77)


Unnamed: 0,game_id,Date,Time,PitchNo,Inning,inning_half,PAofInning,PitchofPA,Pitcher,PitcherId,PitcherThrows,PitcherTeam,Batter,BatterId,BatterSide,BatterTeam,PitchCall,PlayResult,KorBB,OutsOnPlay,RunsScored,Balls,Strikes,Outs,TaggedPitchType,RelSpeed,SpinRate,SpinAxis,Tilt,InducedVertBreak,VertBreak,HorzBreak,VertApprAngle,HorzApprAngle,vert_rel_angle,horz_rel_angle,RelHeight,RelSide,Extension,PlateLocHeight,PlateLocSide,zone_time,EffectiveVelo,SpeedDrop,TaggedHitType,hit_x,hit_y,ExitSpeed,Angle,HitSpinRate,hit_spin_axis,Distance,hit_last_tracked_distance,hit_hang_time,Direction,Bearing,hit_max_height,hit_contact_x,hit_contact_y,hit_contact_z,position_110x,position_110y,position_110z,pfxx,pfxz,x0,y0,z0,vx0,vy0,vz0,ax0,ay0,az0,catcher,catcher_id,catcher_team
0,20240220-HighPointUniversity-1,2024-02-20,60314.0,82,3,Top,4,2,"Olsovsky, Dalton",1000251274,R,HIG_PAN,"Quintero, Adam",1000191973,R,APP_MOU,StrikeCalled,Undefined,Undefined,0,0,0,1,1,Slider,73.64,2557.01,76.94,30600.0,-3.64,-58.59,-24.36,-9.14,-5.51,1.42,-1.23,5.32,2.14,4.72,-1.05,1.78,0.53,70.65,6.94,,,,,,,,,,,,,,,,,,,,11.48,-1.66,-2.0,50.0,5.41,3.28,-106.58,0.58,12.91,22.07,-34.04,"Ruiz, Justin",1000209000.0,HIG_PAN
1,20240220-HighPointUniversity-1,2024-02-20,63576.0,185,6,Top,2,3,"Glover, Lucas",1000138461,R,HIG_PAN,"Boyd, CJ",1000092609,R,APP_MOU,StrikeCalled,Strikeout,Strikeout,0,0,0,2,1,Slider,81.57,2167.91,112.31,35100.0,4.46,-38.74,-6.96,-8.15,-3.64,-1.01,-2.4,6.12,2.08,5.68,-0.74,1.95,0.47,79.68,7.87,,,,,,,,,,,,,,,,,,,,2.71,3.1,-1.87,50.0,6.01,5.23,-118.11,-3.37,3.74,26.71,-27.88,"Ruiz, Justin",1000209000.0,HIG_PAN
2,20240220-HighPointUniversity-1,2024-02-20,66446.0,269,8,Top,3,1,"Carter, Noah",1000108939,R,HIG_PAN,"St. Laurent, Austin",1000075448,R,APP_MOU,HitByPitch,Undefined,Undefined,0,0,0,0,2,Fastball,84.45,2150.3,192.59,45000.0,17.56,-19.68,3.67,-4.43,-0.81,-0.62,-1.49,5.57,2.5,7.13,1.46,3.36,0.44,85.82,7.02,,,,,,,,,,,,,,,,,,,,-2.4,9.98,-2.41,50.0,5.52,3.08,-122.94,-1.77,-3.68,24.28,-16.89,"Grintz, Eric",686456.0,HIG_PAN
3,20240220-HighPointUniversity-1,2024-02-20,64809.0,216,6,Bottom,6,1,"Welch, Collin",1000192105,R,APP_MOU,"Klingler, Charlie",1000101443,L,HIG_PAN,BallCalled,Undefined,Undefined,0,0,0,0,2,ChangeUp,81.24,1484.51,260.46,9900.0,4.5,-39.75,17.67,-6.72,2.56,0.53,-0.56,6.21,1.21,4.98,2.15,3.4,0.48,78.72,7.52,,,,,,,,,,,,,,,,,,,,-8.84,2.55,-1.17,50.0,6.23,0.38,-117.74,-0.37,-12.27,24.59,-28.63,"Church, Braxton",1000192000.0,APP_MOU
4,20240220-HighPointUniversity-1,2024-02-20,67985.0,308,9,Bottom,4,3,"Lewis, Zach",1000127413,R,APP_MOU,"Martinez II, Matthew",1000268640,R,HIG_PAN,BallCalled,Undefined,Undefined,0,0,1,1,2,Fastball,91.84,2421.61,205.21,45900.0,15.83,-17.91,6.87,-7.55,-3.05,-4.16,-4.28,5.52,1.56,6.0,-1.84,0.16,0.42,90.17,8.81,,,,,,,,,,,,,,,,,,,,-4.9,9.83,-1.23,50.0,5.18,9.62,-132.57,-10.2,-8.59,32.05,-14.93,"Church, Braxton",1000192000.0,APP_MOU


One notable feature of this data is the enourmous amount of rows - 1.5 Million! Looking at the first column "game_id", we in fact see that multiple other schools are represented in this dataset, since we see names of schools that are not Texas or it's opponent on a given day (for example, the head of this data shows a game from High Point University v Not Texas). 

For this reason, an easy first step is to remove all the rows not containing "Texas" in the first column, to give us a pruned dataset of ONLY Texas home games. We choose to ignore the Away games, as garnering insights to get a better home field advantage is better than trying to analyze our performance at 6 different other school's fields.

In [41]:
texas_data = data[data['game_id'].str.contains("Texas", na=False)]

texas_data.to_csv("texas_data.csv", index=False)

texas_data.head()

Unnamed: 0,game_id,Date,Time,PitchNo,Inning,inning_half,PAofInning,PitchofPA,Pitcher,PitcherId,PitcherThrows,PitcherTeam,Batter,BatterId,BatterSide,BatterTeam,PitchCall,PlayResult,KorBB,OutsOnPlay,RunsScored,Balls,Strikes,Outs,TaggedPitchType,RelSpeed,SpinRate,SpinAxis,Tilt,InducedVertBreak,VertBreak,HorzBreak,VertApprAngle,HorzApprAngle,vert_rel_angle,horz_rel_angle,RelHeight,RelSide,Extension,PlateLocHeight,PlateLocSide,zone_time,EffectiveVelo,SpeedDrop,TaggedHitType,hit_x,hit_y,ExitSpeed,Angle,HitSpinRate,hit_spin_axis,Distance,hit_last_tracked_distance,hit_hang_time,Direction,Bearing,hit_max_height,hit_contact_x,hit_contact_y,hit_contact_z,position_110x,position_110y,position_110z,pfxx,pfxz,x0,y0,z0,vx0,vy0,vz0,ax0,ay0,az0,catcher,catcher_id,catcher_team
4651,20240220-Texas-1,2024-02-20,78782.0,357,9,Top,3,3,"O'Hara, Cade",1000192590,R,TEX_LON,"LaRue, Dylan",804659,L,HBU_HUS,InPlay,Out,Undefined,1,0,1,1,2,Fastball,88.59,2157.57,209.72,3600.0,19.77,-16.72,10.6,-6.41,0.8,-3.27,-1.08,6.34,0.08,5.33,-0.06,1.88,0.44,86.69,7.6,fly_ball,136.25,226.4,83.61,20.69,2629.59,216.8,264.23,258.01,3.27,23.07,31.04,35.12,2.04,2.06,0.06,99.04,32.26,47.86,-5.68,11.23,0.01,50.0,6.04,1.96,-128.39,-7.87,-9.46,26.93,-13.46,"Schuessler, Kimble",694645.0,TEX_LON
4652,20240220-Texas-1,2024-02-20,72532.0,158,5,Top,1,2,"Hamilton, Hudson",815123,R,TEX_LON,"LaRue, Dylan",804659,L,HBU_HUS,InPlay,Double,Undefined,0,0,0,1,0,Fastball,92.0,2445.0,207.92,3600.0,16.32,-16.59,8.12,-5.28,-1.82,-2.15,-3.28,6.3,2.69,5.9,0.32,2.92,0.41,91.28,7.47,fly_ball,-74.63,349.69,95.86,21.54,2182.83,155.6,357.56,357.56,4.26,-4.79,-12.05,56.0,0.94,2.89,-0.29,109.13,41.28,-13.82,-5.04,9.25,-2.43,50.0,6.12,7.3,-133.46,-5.53,-9.15,27.18,-15.4,"Galvan, Rylan",805025.0,TEX_LON
4653,20240220-Texas-1,2024-02-20,75554.0,252,6,Bottom,7,3,"Gilley, Brayden",1000165200,R,HBU_HUS,"Galvan, Rylan",805025,R,TEX_LON,StrikeCalled,Undefined,Undefined,0,0,1,1,2,Fastball,87.79,2343.51,214.19,4500.0,13.63,-22.44,8.49,-6.08,0.65,-1.87,-0.87,6.57,1.2,5.98,1.09,2.96,0.43,87.21,7.23,,,,,,,,,,,,,,,,,,,,-4.66,7.87,-1.13,50.0,6.41,1.61,-127.56,-4.87,-7.68,25.99,-19.2,"LaRue, Dylan",804659.0,HBU_HUS
4654,20240220-Texas-1,2024-02-20,73889.0,197,5,Bottom,10,3,"Gilley, Brayden",1000165200,R,HBU_HUS,"Galvan, Rylan",805025,R,TEX_LON,BallCalled,Undefined,Undefined,0,0,0,2,1,Curveball,78.07,2156.85,33.84,26100.0,-12.97,-60.08,-9.78,-11.45,-3.68,-0.57,-1.94,6.36,1.56,5.56,-1.07,0.82,0.49,76.3,6.61,,,,,,,,,,,,,,,,,,,,4.32,-5.93,-1.38,50.0,6.27,4.2,-113.24,-3.16,5.52,24.1,-39.73,"LaRue, Dylan",804659.0,HBU_HUS
4655,20240220-Texas-1,2024-02-20,76475.0,281,7,Bottom,2,6,"Wilson, Dave",1000306108,R,HBU_HUS,"Duplantier, Jayden",702979,R,TEX_LON,FoulBallNotFieldable,Undefined,Undefined,0,0,3,2,0,Fastball,90.65,2430.16,174.98,42300.0,11.56,-22.85,-0.92,-5.92,-1.04,-1.69,-0.88,5.95,1.91,5.51,1.01,2.46,0.42,89.28,7.46,fly_ball,-249.18,214.98,94.08,28.56,3171.45,131.32,329.11,325.88,4.19,-36.75,-49.21,63.58,2.04,2.48,-0.98,82.83,50.94,-72.38,0.23,6.62,-1.83,50.0,5.79,2.06,-131.64,-4.69,0.41,27.55,-20.54,"LaRue, Dylan",804659.0,HBU_HUS


Now that we have our texas_data.csv, we can proceed with our analysis from here!

In [42]:
texas_data = pd.read_csv("texas_data.csv")

print("Shape of the data: ", texas_data.shape)

num_rows = texas_data.shape[0]

#Figure out how many games this represents
num_games = texas_data['game_id'].nunique()
print("This data represents", num_games, "games")
print("This means there were an average", num_rows/num_games, "pitches per game")

Shape of the data:  (10230, 77)
This data represents 33 games
This means there were an average 310.0 pitches per game


After reducing our dataset to only the Texas home games, our dataset has a much more manageable 10230 rows. 

Now we want to take a more in-depth look at all of our features and use both logic and analytical methods to identify features which are not useful and then remove them as part of our feature engineering step. For this purpose, it is crucially important to understand what exactly the 77 given features are exactly.

Here are the features and their meanings: (taken from the TrackMan website)

<details>
<summary>Features</summary>

**Game Information**
- **game_id**: Game ID  
- **Date**: Date of the game  
- **Time**: Time of the pitch  
- **Inning**: Inning of the game  
- **inning_half**: Top or Bottom of the inning  
- **PAofInning**: Plate appearance of the inning  
- **PitchofPA**: Pitch number within the plate appearance  

**Pitcher Information**
- **Pitcher**: Name of the pitcher  
- **PitcherId**: Unique identifier for the pitcher  
- **PitcherThrows**: Pitcher's throwing hand (e.g., right or left)  
- **PitcherTeam**: Team of the pitcher  

**Batter Information**
- **Batter**: Name of the batter  
- **BatterId**: Unique identifier for the batter  
- **BatterSide**: Batter's stance (e.g., right or left)  
- **BatterTeam**: Team of the batter  

**Catcher Information**
- **catcher**: Name of the catcher  
- **catcher_id**: Unique identifier for the catcher  
- **catcher_team**: Team of the catcher  

**Pitch Call and Results**
- **PitchCall**: Umpire call for the pitch (e.g., ball, strike)  
- **PlayResult**: Outcome of the play (e.g., single, out, home run)  
- **KorBB**: Strikeout or base on balls indicator  
- **OutsOnPlay**: Number of outs resulting from the play  
- **RunsScored**: Runs scored on the play  

**Game State**
- **Balls**: Count of balls in the at-bat  
- **Strikes**: Count of strikes in the at-bat  
- **Outs**: Number of outs in the inning  

**Pitch Information**
- **TaggedPitchType**: Categorized pitch type (e.g., fastball, curveball)  
- **RelSpeed**: Release speed of the pitch  
- **SpinRate**: Spin rate of the pitch in revolutions per minute  
- **SpinAxis**: Orientation of the spin axis (degrees)  
- **Tilt**: Clock-style representation of spin axis  
- **InducedVertBreak**: Vertical break due to spin (in inches)  
- **VertBreak**: Total vertical break (in inches)  
- **HorzBreak**: Total horizontal break (in inches)  
- **VertApprAngle**: Vertical approach angle at the plate (degrees)  
- **HorzApprAngle**: Horizontal approach angle at the plate (degrees)  
- **zone_time**: Time to reach the strike zone (seconds)  

**Release Metrics**
- **vert_rel_angle**: Vertical release angle of the pitch (degrees)  
- **horz_rel_angle**: Horizontal release angle of the pitch (degrees)  
- **RelHeight**: Release height of the pitch (feet)  
- **RelSide**: Horizontal release position relative to the rubber (feet)  
- **Extension**: Distance from the mound to the release point (feet)  

**Plate Location**
- **PlateLocHeight**: Height of the pitch as it crosses the plate (feet)  
- **PlateLocSide**: Horizontal location of the pitch at the plate (feet)  

**Hit Information**
- **TaggedHitType**: Categorized hit type (e.g., ground ball, fly ball)  
- **hit_x**: X-coordinate of the hit landing spot (feet)  
- **hit_y**: Y-coordinate of the hit landing spot (feet)  
- **ExitSpeed**: Exit velocity of the ball off the bat (mph)  
- **Angle**: Launch angle of the ball (degrees)  
- **HitSpinRate**: Spin rate of the ball off the bat (rpm)  
- **hit_spin_axis** Spin axis of the ball off the bat (degrees)  
- **Distance**: Total distance of the hit (feet)  
- **hit_last_tracked_distance**: Last tracked distance of the ball (feet)  
- **hit_hang_time**: Time the ball is in the air (seconds)  
- **Direction**: Direction of the hit (e.g., pull, opposite)  
- **Bearing**: Bearing of the hit relative to the field (degrees)  
- **hit_max_height**: Maximum height of the ball (feet)  
- **hit_contact_x**: X-coordinate of the contact point on the bat (inches)  
- **hit_contact_y**: Y-coordinate of the contact point on the bat (inches)  
- **hit_contact_z**: Z-coordinate of the contact point on the bat (inches)  

**Pitch Physics**
- **position_110x**: X-position at 110 feet from release point (feet)  
- **position_110y**: Y-position at 110 feet from release point (feet)  
- **position_110z**: Z-position at 110 feet from release point (feet)  
- **pfxx**: Horizontal movement of the pitch (inches)  
- **pfxz**: Vertical movement of the pitch (inches)  
- **x0**: X-coordinate of the pitch at release (feet)  
- **y0**: Y-coordinate of the pitch at release (feet)  
- **z0**: Z-coordinate of the pitch at release (feet)  
- **vx0**: X-component of velocity at release (mph)  
- **vy0**: Y-component of velocity at release (mph)  
- **vz0**: Z-component of velocity at release (mph)  
- **ax0**: X-component of acceleration (ft/s²)  
- **ay0**: Y-component of acceleration (ft/s²)  
- **az0**: Z-component of acceleration (ft/s²)  
- **EffectiveVelo**: Effective velocity as perceived by the batter (mph)  
- **SpeedDrop**: Velocity drop from release to plate (mph)

<details>

While it may be tempting to immediately remove features such as the inning number, we need to do some critical thinking. The only rows which we should drop outright are those which are either too difficult to process or too variant to be meaningful. For example, pitcher name (which is categorical), and the timestamp are 2 good examples of columns we should just drop. However, info like the inning number is useful and may in fact have a correlation with home runs which should not be glossed over. For example, perhaps pitchers tend to get tired by the 9th inning, and thus give up more home runs. Or alternatively, they "lock in" in the final inning to close out a tight game! We don't really know, so it behooves us to keep it in and let analytics to the thinking.

With all this said, we will first remove those obvious "should not use" features. Note that we choose to not include player information simply because one-hot-encoding all the players would result in too many additional features.

In [43]:
cols_to_drop = [
    "game_id",
    "Date",
    "Time",
    "Pitcher",
    "PitcherId",
    "Batter",
    "BatterId",
    "catcher",
    "catcher_id",
    "catcher_team"
]

data = texas_data.drop(columns=cols_to_drop, errors='ignore')

data.shape

(10230, 67)

Note that the number of columns is now down from 77 to 67, (minus 10, which is the length of cols_to_drop)

As a restatement of our goal, we want to predict wether a given pitch will be a home run given all the data up to (and including) the batter's point of contact. Any information after the fact (like distance, number of runs scored, and the play call) makes it quite easy to infer wether the hit was a home run or not. Thus, we now need to separate all the columns containing after-hit data into a different dataframe. (Not erase it, since it will be useful for accuracy metrics later!)

In [44]:
# we also want to drop irrelevant records where the ball was not hit or 
# in other words the playresult is not a hit
relevant_play_results = ['Single', 'Double', 'Triple', 'HomeRun', 'Out']

# Filter the dataset to keep only relevant play results
data = data[data['PlayResult'].isin(relevant_play_results)]

# now lets drop those columns
after_hit_cols = [
    "PitchCall",
    "PlayResult",
    "KorBB",
    "OutsOnPlay",
    "RunsScored",
    "TaggedHitType",
    "hit_x",
    "hit_y",
    "Distance",
    "hit_last_tracked_distance",
    "hit_hang_time",
    "Direction",
    "Bearing",
    "hit_max_height",
    "TaggedPitchType"
]

print("Shape:", data.shape)

after_hit_data = data[after_hit_cols]
data = data.drop(columns=after_hit_cols)

after_hit_data.head()

#Print after sizes to confirm proper split
print("Pre-hit / Training Data Shape: ", data.shape)
print("After-hit / Testing Data Shape: ", after_hit_data.shape)

Shape: (1570, 67)


Unnamed: 0,PitchCall,PlayResult,KorBB,OutsOnPlay,RunsScored,TaggedHitType,hit_x,hit_y,Distance,hit_last_tracked_distance,hit_hang_time,Direction,Bearing,hit_max_height,TaggedPitchType
0,InPlay,Out,Undefined,1,0,fly_ball,136.25,226.4,264.23,258.01,3.27,23.07,31.04,35.12,Fastball
1,InPlay,Double,Undefined,0,0,fly_ball,-74.63,349.69,357.56,357.56,4.26,-4.79,-12.05,56.0,Fastball
6,InPlay,Single,Undefined,0,1,fly_ball,-85.1,226.27,241.74,241.74,2.83,-12.55,-20.61,31.09,Curveball
10,InPlay,Single,Undefined,0,0,,,,,,,,,,Slider
15,InPlay,Double,Undefined,0,0,fly_ball,-69.84,364.18,370.81,370.81,3.93,-11.61,-10.86,50.6,ChangeUp


Pre-hit / Training Data Shape:  (1570, 52)
After-hit / Testing Data Shape:  (1570, 15)


Now we have isolated our pre-hit data to just 52 columns, and our post-hit data, containing 15 columns. We also reduced the number of records to 1570 from our previous 10000 by only keeping the relevant PlayResults. 

This is good enough for preliminary data cleaning. Now we explore the data to better understand our features and what we need to consider when modeling!

### Data Exploration

To get a very overall feel for the data, we want to take a look at the correlations between the remaining columns and see how they are related to home runs

In [45]:
# Create an "IsHomeRun" label column
data['IsHomeRun'] = (after_hit_data['PlayResult'] == 'HomeRun').astype(int)

# Check if there are any categorical  columns left
categorical_features = data.select_dtypes(exclude=["number"]).columns
print("Categorical Features: ", categorical_features)

#There are 5, but we will keep those for the feature engineering section. So we temporarily drop them.
# Drop all columns that aren't numerical
numerical_features = data.select_dtypes(include=['float64', 'int64']).columns.tolist()
if 'IsHomeRun' not in numerical_features:
    numerical_features.append('IsHomeRun')
print("Numerical Features: ", numerical_features)

# Calculate correlations with IsHomeRun
home_run_corr = pd.DataFrame(data[numerical_features].corr()['IsHomeRun'].sort_values(ascending=False))

# Display the correlations
home_run_corr

Categorical Features:  Index(['inning_half', 'PitcherThrows', 'PitcherTeam', 'BatterSide',
       'BatterTeam'],
      dtype='object')
Numerical Features:  ['PitchNo', 'Inning', 'PAofInning', 'PitchofPA', 'Balls', 'Strikes', 'Outs', 'RelSpeed', 'SpinRate', 'SpinAxis', 'Tilt', 'InducedVertBreak', 'VertBreak', 'HorzBreak', 'VertApprAngle', 'HorzApprAngle', 'vert_rel_angle', 'horz_rel_angle', 'RelHeight', 'RelSide', 'Extension', 'PlateLocHeight', 'PlateLocSide', 'zone_time', 'EffectiveVelo', 'SpeedDrop', 'ExitSpeed', 'Angle', 'HitSpinRate', 'hit_spin_axis', 'hit_contact_x', 'hit_contact_y', 'hit_contact_z', 'position_110x', 'position_110y', 'position_110z', 'pfxx', 'pfxz', 'x0', 'y0', 'z0', 'vx0', 'vy0', 'vz0', 'ax0', 'ay0', 'az0', 'IsHomeRun']


Unnamed: 0,IsHomeRun
IsHomeRun,1.0
ExitSpeed,0.220523
Angle,0.118566
hit_contact_x,0.096372
position_110y,0.093541
az0,0.084719
InducedVertBreak,0.083672
pfxz,0.081276
VertApprAngle,0.074895
VertBreak,0.072639


The results of this are quite interesting - they show that the highest correlated features are the Exit speed, contact positions, and the Pitch of Plate Appearance! Reasonably, they are mostly related to the hit itself rather than the pitch.

### Feature Engineering

Now we get to feature engineering. There are some notable categorical features that we want to one-hot encode or turn into binary variables, to allow convenient modeling. These operations are shown below.


In [46]:
from sklearn.preprocessing import OrdinalEncoder

numerical = [
    "PitcherThrows",
    "PitcherTeam",
    "BatterSide",
    "BatterTeam",
    "TaggedPitchType",
    "inning_half"
]

# Make the handed-ness and Teams a boolean value
data["PitcherThrows"] = data["PitcherThrows"].map({"L":0, "R":1})
data["PitcherTeam"] = (data["PitcherTeam"] == "TEX_LON").astype(int)
data["BatterSide"] = data["BatterSide"].map({"L":0, "R":1})
data["BatterTeam"] = (data["BatterTeam"] == "TEX_LON").astype(int)
data["inning_half"]

# We also want to engineer this new feature indicating the
# handed-ness matchup. Statistically, L/L or R/R favors the pitcher
# while L/R or R/L favors the batter. We want to represent this as 
# a new feature!
data["Sidematchup"] = (data["PitcherThrows"] == data["BatterSide"]).astype(int)




0           Top
1           Top
6        Bottom
10       Bottom
15       Bottom
          ...  
10179       Top
10183       Top
10209       Top
10219    Bottom
10229    Bottom
Name: inning_half, Length: 1570, dtype: object

In [47]:
# Remove the Knuckleball and Four-seam
knuckle_four_rows = after_hit_data[after_hit_data["TaggedPitchType"].isin(["Knuckleball", "Four-Seam"])].index
after_hit_data = after_hit_data.drop(knuckle_four_rows)

# Encode the TaggedPitchType Ordinally
encoder = OrdinalEncoder()
after_hit_data["TaggedPitchTypeOrdinal"] = encoder.fit_transform(after_hit_data[["TaggedPitchType"]])
pd.DataFrame(after_hit_data[["TaggedPitchType", "TaggedPitchTypeOrdinal"]])

Unnamed: 0,TaggedPitchType,TaggedPitchTypeOrdinal
0,Fastball,3.0
1,Fastball,3.0
6,Curveball,1.0
10,Slider,5.0
15,ChangeUp,0.0
...,...,...
10179,Fastball,3.0
10183,Slider,5.0
10209,Splitter,6.0
10219,Sinker,4.0


## Modeling

### Decision Trees (Cristian)

In [48]:
pd.DataFrame(data.isnull().sum())
data = data.drop_duplicates()

Unnamed: 0,0
PitchNo,0
Inning,0
inning_half,0
PAofInning,0
PitchofPA,0
PitcherThrows,0
PitcherTeam,0
BatterSide,0
BatterTeam,0
Balls,0


Now lets deal with null values. Here we will remove columns with too many nulls and replace the remaining nulls with their average. 

In [49]:
from sklearn.preprocessing import OneHotEncoder
# dropping the position columns because over 800 are null for position
# and over 500 are null for HitSpinRate
position_columns = ['position_110x', 'position_110y', 'position_110z', 'HitSpinRate']
# data = data.drop(columns=position_columns)

# Columns with 2 null values
columns_with_2_nulls = [
    'RelSpeed', 'SpinRate', 'SpinAxis', 'Tilt', 'InducedVertBreak', 'VertBreak', 
    'HorzBreak', 'VertApprAngle', 'HorzApprAngle', 'vert_rel_angle', 'horz_rel_angle', 
    'RelHeight', 'RelSide', 'PlateLocHeight', 'PlateLocSide', 'zone_time', 
    'EffectiveVelo', 'SpeedDrop', 'pfxx', 'pfxz', 'x0', 'y0', 'z0', 'vx0', 'vy0', 'vz0', 
    'ax0', 'ay0', 'az0'
]

# Identify rows with null values in these columns
rows_with_nulls = data[columns_with_2_nulls].isnull().any(axis=1)

# Check if these rows are the same across all columns
common_null_rows = data[rows_with_nulls].index

# If all rows are the same, remove them
if len(common_null_rows) == 2:
    data = data.drop(index=common_null_rows)

# Separate numerical and non-numerical columns
numerical_columns = data.select_dtypes(include=['float64', 'int64']).columns
non_numerical_columns = data.select_dtypes(exclude=['float64', 'int64']).columns

# Fill missing values for numerical columns with the mean
data[numerical_columns] = data[numerical_columns].fillna(data[numerical_columns].mean())

# Fill missing values for non-numerical columns with the mode
for column in non_numerical_columns:
    data[column] = data[column].fillna(data[column].mode()[0])

# Verify the changes
print("Shape:", data.shape)
pd.DataFrame(data.isnull().sum()).transpose()

# One-hot encode non-numerical columns
encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' to avoid multicollinearity
encoded_features = encoder.fit_transform(data[non_numerical_columns])
encoded_feature_names = encoder.get_feature_names_out(non_numerical_columns)

# Create a DataFrame with the encoded features
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names, index=data.index)

# Concatenate the original numerical columns with the encoded features
data = pd.concat([data[numerical_columns], encoded_df], axis=1)

# Verify the changes
print("Shape:", data.shape)
data.head()

Shape: (1568, 54)


Unnamed: 0,PitchNo,Inning,inning_half,PAofInning,PitchofPA,PitcherThrows,PitcherTeam,BatterSide,BatterTeam,Balls,Strikes,Outs,RelSpeed,SpinRate,SpinAxis,Tilt,InducedVertBreak,VertBreak,HorzBreak,VertApprAngle,HorzApprAngle,vert_rel_angle,horz_rel_angle,RelHeight,RelSide,Extension,PlateLocHeight,PlateLocSide,zone_time,EffectiveVelo,SpeedDrop,ExitSpeed,Angle,HitSpinRate,hit_spin_axis,hit_contact_x,hit_contact_y,hit_contact_z,position_110x,position_110y,position_110z,pfxx,pfxz,x0,y0,z0,vx0,vy0,vz0,ax0,ay0,az0,IsHomeRun,Sidematchup
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Shape: (1568, 54)


Unnamed: 0,PitchNo,Inning,PAofInning,PitchofPA,PitcherThrows,BatterSide,Balls,Strikes,Outs,RelSpeed,SpinRate,SpinAxis,Tilt,InducedVertBreak,VertBreak,HorzBreak,VertApprAngle,HorzApprAngle,vert_rel_angle,horz_rel_angle,RelHeight,RelSide,Extension,PlateLocHeight,PlateLocSide,zone_time,EffectiveVelo,SpeedDrop,ExitSpeed,Angle,HitSpinRate,hit_spin_axis,hit_contact_x,hit_contact_y,hit_contact_z,position_110x,position_110y,position_110z,pfxx,pfxz,x0,y0,z0,vx0,vy0,vz0,ax0,ay0,az0,inning_half_Top,PitcherTeam_1,BatterTeam_1,IsHomeRun_1,Sidematchup_1
0,357,9,3,3,1,0,1,1,2,88.59,2157.57,209.72,3600.0,19.77,-16.72,10.6,-6.41,0.8,-3.27,-1.08,6.34,0.08,5.33,-0.06,1.88,0.44,86.69,7.6,83.61,20.69,2629.59,216.8,2.04,2.06,0.06,99.04,32.26,47.86,-5.68,11.23,0.01,50.0,6.04,1.96,-128.39,-7.87,-9.46,26.93,-13.46,1.0,1.0,0.0,0.0,0.0
1,158,5,1,2,1,0,0,1,0,92.0,2445.0,207.92,3600.0,16.32,-16.59,8.12,-5.28,-1.82,-2.15,-3.28,6.3,2.69,5.9,0.32,2.92,0.41,91.28,7.47,95.86,21.54,2182.83,155.6,0.94,2.89,-0.29,109.13,41.28,-13.82,-5.04,9.25,-2.43,50.0,6.12,7.3,-133.46,-5.53,-9.15,27.18,-15.4,1.0,1.0,0.0,0.0,0.0
6,239,6,4,3,1,0,1,1,0,78.24,2150.31,41.34,27000.0,-10.8,-57.25,-10.83,-8.83,-2.98,1.61,-1.04,6.38,1.57,5.6,-0.3,3.11,0.49,76.84,6.64,81.49,22.02,2165.22,108.7,0.41,2.94,0.39,105.51,30.51,-31.09,5.19,-5.56,-1.48,50.0,6.48,2.47,-113.6,1.26,6.72,22.53,-39.37,0.0,0.0,1.0,0.0,0.0
10,325,8,1,3,1,1,0,2,0,82.99,2213.94,30.76,25200.0,-6.45,-48.14,-4.62,-8.72,-2.67,-0.01,-1.85,5.89,1.95,5.22,-0.17,1.87,0.47,81.11,6.71,89.045465,13.626112,2958.731528,172.812121,1.739239,2.375073,0.011091,103.062161,46.502784,-1.597064,1.8,-2.94,-1.78,50.0,5.86,4.08,-120.45,-1.81,2.64,24.11,-36.48,0.0,0.0,1.0,0.0,1.0
15,101,3,1,1,0,1,0,0,0,80.59,1682.48,88.4,32400.0,1.1,-43.03,-11.69,-7.73,0.96,0.11,3.04,5.22,-2.31,5.32,-0.43,1.74,0.48,78.83,6.81,107.28,18.88,1566.02,186.29,1.93,1.75,0.51,108.05,35.29,-20.6,6.78,0.88,2.04,50.0,5.19,-5.73,-116.85,-1.27,9.35,22.67,-30.97,0.0,0.0,1.0,0.0,0.0


In [56]:
# Inspect the numerical columns
data = data.drop(columns=['PitchNo', 'y0'], errors='ignore')
pd.set_option('display.max_columns', None)
data.describe()

# # Determine which columns to round and to what precision
# # For example, let's round columns with large values to the nearest integer
# columns_to_round = ['ExitSpeed', 'Angle', 'HitSpinRate', 'hit_spin_axis', 'hit_contact_x', 'hit_contact_y', 'hit_contact_z']

# # Apply rounding
# data[columns_to_round] = data[columns_to_round].round(0)

# # Verify the changes
# print(data[columns_to_round].head())

Unnamed: 0,Inning,PAofInning,PitchofPA,PitcherThrows,BatterSide,Balls,Strikes,Outs,RelSpeed,SpinRate,SpinAxis,Tilt,InducedVertBreak,VertBreak,HorzBreak,VertApprAngle,HorzApprAngle,vert_rel_angle,horz_rel_angle,RelHeight,RelSide,Extension,PlateLocHeight,PlateLocSide,zone_time,EffectiveVelo,SpeedDrop,ExitSpeed,Angle,HitSpinRate,hit_spin_axis,hit_contact_x,hit_contact_y,hit_contact_z,position_110x,position_110y,position_110z,pfxx,pfxz,x0,z0,vx0,vy0,vz0,ax0,ay0,az0,inning_half_Top,PitcherTeam_1,BatterTeam_1,IsHomeRun_1,Sidematchup_1
count,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0,1568.0
mean,4.867347,2.940051,3.397321,0.723852,0.549745,1.122449,1.074617,0.985969,86.883316,2185.542895,173.15486,26683.163265,9.824388,-27.761269,0.894362,-6.698909,-0.665453,-1.560453,-0.826543,6.052443,0.684841,5.822329,-0.012398,2.288036,0.440568,85.928418,7.197838,89.045465,13.626112,2958.731528,172.812121,1.739239,2.375073,0.011091,103.062161,46.502784,-1.597064,-0.697066,5.819688,-0.614688,5.909777,1.874751,-126.057239,-4.464018,-1.565159,25.67243,-22.252003,0.494898,0.494898,0.505102,0.049107,0.508291
std,2.537329,1.758263,1.845581,0.447234,0.497678,1.016924,0.813993,0.81455,5.076175,294.868696,63.792099,15920.526048,7.899056,11.703876,10.480841,1.325098,1.927765,1.302731,2.539394,0.493116,1.902486,0.515201,0.568616,0.593034,0.027989,5.276104,0.887488,14.306491,25.856362,1037.492148,61.265306,0.693355,0.554292,0.525694,4.683493,20.5212,25.650599,5.888201,4.328868,1.716035,0.439543,5.347128,7.320798,2.658827,9.564141,3.529866,7.383628,0.500133,0.500133,0.500133,0.216161,0.500091
min,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,71.69,898.48,3.46,3600.0,-17.82,-66.48,-25.04,-11.12,-5.36,-5.28,-6.51,4.35,-5.16,4.52,-2.04,0.41,0.38,70.48,4.56,21.49,-84.34,679.37,2.33,-0.6,0.47,-1.62,77.49,3.63,-76.75,-12.77,-9.56,-4.14,4.31,-14.99,-141.45,-12.96,-22.76,17.09,-44.68,0.0,0.0,0.0,0.0,0.0
25%,3.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,83.2775,2053.645,127.2825,7200.0,4.205,-36.7525,-7.925,-7.6425,-2.1,-2.4225,-2.6225,5.75,-1.11,5.45,-0.4,1.89,0.42,82.08,6.58,83.745,1.8425,2381.675,151.21,1.29,2.04,-0.32,103.062161,43.75,-1.597064,-5.6725,2.74,-1.76,5.62,-2.645,-131.74,-6.1325,-9.7225,23.05,-28.09,0.0,0.0,0.0,0.0,0.0
50%,5.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,87.725,2208.89,191.43,32400.0,11.175,-24.385,2.325,-6.53,-0.785,-1.635,-1.74,6.02,1.47,5.8,-0.01,2.3,0.43,86.79,7.22,90.44,13.626112,2958.731528,172.812121,1.739239,2.375073,0.011091,103.062161,46.502784,-1.597064,-1.43,6.56,-1.33,5.89,3.69,-127.275,-4.485,-2.335,25.635,-21.155,0.0,0.0,1.0,0.0,1.0
75%,7.0,4.0,5.0,1.0,1.0,2.0,2.0,2.0,90.75,2360.275,220.26,39600.0,16.3325,-18.4175,9.6525,-5.69,0.5625,-0.7575,1.2825,6.4,1.95,6.1625,0.38,2.7,0.46,90.1725,7.83,99.17,27.9425,3443.405,195.4175,2.14,2.71,0.32,104.24,46.502784,-1.597064,4.34,9.4325,0.9625,6.24,5.67,-120.89,-2.73,6.505,28.19,-16.2075,1.0,1.0,1.0,0.0,1.0
max,11.0,12.0,12.0,1.0,1.0,3.0,2.0,2.0,97.39,3511.79,335.74,45900.0,25.55,-9.76,22.5,-3.41,6.12,3.16,7.55,7.23,4.52,7.83,1.79,4.18,0.54,97.84,10.4,116.78,87.76,7324.22,357.96,4.31,4.4,1.98,110.0,135.06,78.07,12.96,13.91,4.65,7.07,12.84,-104.0,4.05,21.07,37.22,-8.5,1.0,1.0,1.0,1.0,1.0


### Neural Nets ()

### SVM

## Outcome