# Pitch Predictor - Deep Learning Project

Owners: Jason Vasquez & Dylan Skinner

Last Modified: 9/28/2023

CS 674 Advanced Deep Learning Project 1

Dr Wingate, 2023 Fall Semester, BYU

## Introduction

The goal of this project is to use a Deep Learning Neural Network to predict a pitch in an MLB Baseball game based on circumstances that would be known to the batter before the pitch (who's pitching, number of pitches, number of outs, etc.)

Data for this project was downloaded from kaggle, courtesy of Paul Schale (https://www.kaggle.com/datasets/pschale/mlb-pitch-data-20152018). The dataset contains all pitch data from the 2015-2019 MLB Seasons, over 3 million pitches.

We begin with setting up several baselines with which we can compare the accuracy of our neural network. Our first baseline is a random pull from the distribution of pitches for the specific pitcher. The second baseline is a shallow machine learning catboost model.

In [2]:
#necessary imports
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import json
import os

## Data Setup

##### Helper Functions

In [7]:
def read_data(file_path):

    #set the working directory
    os.chdir(file_path)
    
    # Read in data
    pitches = pd.read_csv('pitches.csv')
    names = pd.read_csv('player_names.csv')
    at_bat = pd.read_csv('atbats.csv')
    games = pd.read_csv('games.csv')

    # Read in more recent data.
    pitches_2019 = pd.read_csv('2019_pitches.csv')
    atbats_2019 = pd.read_csv('2019_atbats.csv')
    games_2019 = pd.read_csv('2019_games.csv')

    # Drop necessary columns to be able to stack data.
    games.drop(columns=['delay'], inplace=True)

    # Stack dataframes together.
    print(f'pitches.shape {pitches.shape}, at_bat.shape {at_bat.shape}, games.shape {games.shape}')
    pitches = pd.concat([pitches, pitches_2019], ignore_index=True, axis=0)
    at_bat = pd.concat([at_bat, atbats_2019], ignore_index=True, axis=0)
    games = pd.concat([games, games_2019], ignore_index=True, axis=0)
    print(f'pitches.shape {pitches.shape}, at_bat.shape {at_bat.shape}, games.shape {games.shape}')

    # Create column in names with full name.
    names['full_name'] = names['first_name'] + ' ' + names['last_name']

    return pitches, at_bat, games, names

In [8]:
def merge(pitches, at_bat, names):
    # Merge pitches and at_bats together.
    pitches_merge = pitches.merge(at_bat, on='ab_id', validate='m:1')

    # Merge pitches_merge and names together
    pitches_merge = pitches_merge.merge(names, left_on='pitcher_id', right_on='id')
    pitches_merge.drop(['last_name', 'first_name'], axis=1, inplace=True)
    pitches_merge.set_index('pitcher_id', inplace=True)

    return pitches_merge

In [10]:
def save_data(pitches_merge, file_path):
    # several columns have value 'placeholder' for many rows. Convert those all to 0.
    pitches_merge['spin_rate'] = pitches_merge.spin_rate.replace('placeholder', 0)
    pitches_merge['spin_dir'] = pitches_merge.spin_dir.replace('placeholder', 0)
    pitches_merge['type_confidence'] = pitches_merge.type_confidence.replace('placeholder', 0)
    pitches_merge['zone'] = pitches_merge.zone.replace('placeholder', 0)
    pitches_merge['x'] = pd.to_numeric(pitches_merge['x'], errors='coerce')
    pitches_merge['y'] = pd.to_numeric(pitches_merge['y'], errors='coerce')
    pitches_merge['code'] = pitches_merge.code.astype(str)
    pitches_merge['type'] = pitches_merge.type.astype(str)
    pitches_merge['top'] = pitches_merge.top.replace({1.0: True, 0.0: False})

    # Convert to parquet.
    pitches_merge.to_parquet(file_path + 'pitches_data.parquet')

##### Read in the Data from downloaded csv files

In [None]:
#change file path to wherever csv data is located
pitches, at_bats, games, names = read_data('/Users/jasonvasquez/Desktop/cs_674/PitchPredictor/csv_data')

##### Merge the pitch data with the at bat data to produce one dataset

In [9]:
pitches_merge = merge(pitches, at_bats, names)

##### Save the data as a parquet file for easy reloading

In [11]:
#change filepath for where you want to save data file
save_data(pitches_merge, file_path='/Users/jasonvasquez/Desktop/cs_674/PitchPredictor/')

## Pitch Distributions

Create a JSON file that contains each pitcher in the dataset and their distribution of pitchers. This will be useful for our random baseline and also will be used as a feature in our deep learning model.