Within this assessment, I was given the pitches from the 2011 MLB season and description data for each column within the data. The goal of the assessment is to build a model that will predict the next pitch thrown. 

I went about this model wanting to break down the data and look at specific pitcher data and predict based on each pitcher. Each pitcher has a unique set of pitches and throws them at different rates. For example, Tim Wakefield, a predominant knuckleball pitcher who pitched for the Red Sox in 2011, has a completely different pitch profile than Mariano Rivera, a closer for the New York Yankees in 2011 who relied upon his famous cutter.

Once I realized I wanted to breakdown by pitcher, I needed to try and find other important categorical and empirical data to add to the model. After studying the data, I thought I needed to add the following categories:
- **pitcher_id** - identify and group by pitcher
- **game_pk** - identifies the game id and can be used to group the pitcher_id with to find previous pitches within the game.
- **inning** - identify the inning the pitcher is throwing in, in case earlier in games certain pitches are thrown.
- **top** - whether the inning is at the top or bottom of the inning to show if player is on the home or away team.
- **pcount_at_bat** - identifies the number of pitches within the at bat, belief is a pitcher may throw different pitches early within an at bat.
- **p_throws/stand** - identifies the hand the pitcher throws with and the stance of the batter. Thought process is certain pitches may be thrown to players who bat and throw on same side or opposite side.
- **balls/strikes/fouls/outs** - the count and number of outs during the at bat. The scenario of the at bat can determine the type of pitches thrown.
- **type** - identifies if the pitch was either a strike, ball or in play. Used more for identifying the previous type to see if a pitcher throws a predominant pitch after a strike/ball/etc.
- **pitch_type** - identifies the type of pitch the pitcher threw, important for understanding the types of pitcher can throw and for pitches thrown in certain scenarios.
- **on_1b/on_2b/on_3b** - if a player is on 1st, 2nd or 3rd base. Helps in indentifying the scenario the pitcher pitches in and the pitches thrown while runners on/off base.
- **home_team_runs/away_team_runs** - identifies the score of the game and seeing if a pitcher pitches differently in different score scenarios.
- **zone** - zone where the pitch crossed homeplate based on the statcast data.
- **start_speed** - speed which the ball left a player's hand based on the statcast data. Pitchers may alter speed based on previous pitch.


In [4]:
## import relevant libraries

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import numpy as np
from collections import Counter
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'xgboost'

In [None]:
## Read the file into a dataframe to manipulate data
all_pitches = pd.read_csv('pitches', low_memory = False)

In [None]:
## Describe the statistics of certain data columns to make sure there are no extreme outliers, such as start speed above 105, etc.

all_pitches[['pcount_pitcher','start_speed','end_speed','sz_top','sz_bot','pfx_x','pfx_z','px','pz','x0','z0','y0',
            'vx0','vz0','vy0','ax','az','ay','break_length','break_y','break_angle','spin_rate']].describe()

## Clean Data

Once the data has been imported, I wanted to create a function to clean the data, and remove unwanted columns. Additionally, I wanted to add previous pitch data to have a better understanding of pitchers patterns. 

In [None]:
def clean_data(df):
    
    pitches = df
    
    ## Keep only relevant columns
    keep_columns = ['game_pk','inning','top','pcount_at_bat','pcount_pitcher','pitcher_id','p_throws','batter_id','stand',
           'balls','strikes','fouls','outs','type','pitch_type','on_1b','on_2b','on_3b',
           'home_team_runs','away_team_runs','zone','start_speed']

    pitches = all_pitches[keep_columns]

    ## remove columns where the pitch_type is unknown
    pitches = pitches[~pd.isna(pitches['pitch_type'])]
    
    ## create a unqique identifier for pitchers and the game so previous pitches can be found with the groupby.
    pitches['game_pitcher_id'] = pitches['game_pk'].astype(str)+"_"+pitches['pitcher_id'].astype(str)
    pitches.drop(['game_pk'],axis = 1, inplace = True)

    ## Find the score differential for the pitcher pitching as it may effect pitch selection
    pitches['score_differential'] = -np.power(-1, pitches['top']) * (pitches['home_team_runs'] - pitches['away_team_runs'])
    pitches.drop(['home_team_runs','away_team_runs'],axis = 1, inplace = True)

    """Convert the players on base to either True/False as we don't care the specific player on the base. 
    In future creation of the model knowing the player on base may influence pitches, 
    i.e. a known base stealer on first pitcher may throw only fastballs."""
    
    pitches['on_1b'] = pitches['on_1b'].apply(lambda x: not np.isnan(x))
    pitches['on_2b'] = pitches['on_2b'].apply(lambda x: not np.isnan(x))
    pitches['on_3b'] = pitches['on_3b'].apply(lambda x: not np.isnan(x))
    
    ## Create a True/False of if the batter and pitcher both hit and pitch from the same side to see if this influences pitch selection
    pitches['same_side_pitch_bat'] = pitches['p_throws'] == pitches['stand']
    pitches.drop(['p_throws','stand'],axis = 1, inplace = True)
    
    ## Find the previous pitch outcome to see if this influences future pitch selection
    pitches['prev_pitch_outcome']= pitches.groupby('game_pitcher_id')['type'].apply(lambda x: x.shift(1))
    pitches.drop(['type'],axis = 1, inplace = True)
    
    ## find the previous zone the pitcher threw the ball to see if it influences next pitch
    pitches['prev_zone']= pitches.groupby('game_pitcher_id')['zone'].apply(lambda x: x.shift(1))
    pitches.drop(['zone'],axis = 1, inplace = True)
    
    ## find the previous speed of the past pitch to see if it influences the next pitch selection
    pitches['prev_start_speed']= pitches.groupby('game_pitcher_id')['start_speed'].apply(lambda x: x.shift(1))
    pitches.drop(['start_speed'],axis = 1, inplace = True)
    
    ## Find previous pitch type to see if past pitch types in certain scenarios effects pitch selection
    pitches['prev_pitch_type'] = pitches.groupby('game_pitcher_id')['pitch_type'].apply(lambda x: x.shift(1))
    
    ## Change the pitch outcome to a number as opposed to category variable
    def pitch_outcome_numeric(pitch):
        if pitch == 'B':
            return 0
        elif pitch == 'S':
            return 1
        else:
            return 2
    pitches['prev_pitch_outcome'] = pitches['prev_pitch_outcome'].apply(pitch_outcome_numeric)

    ## Remove certain unknown pitches or automatic balls, pitch-outs, and where the previous pitch is unknown
    remove_pitches = ['PO','UN','AB','AS','IN']
    pitches = pitches[~pitches['pitch_type'].isin(remove_pitches)]
    pitches = pitches[~pitches['prev_pitch_type'].isin(remove_pitches)]
    pitches = pitches[~pd.isna(pitches['prev_pitch_type'])]
    
    ## Group all fastballs in 1 group due to them being hard to distinguish within statcast
    fastball_pitches = ['FA','FS','FT','FF']

    def mapping_fastballs(pitch):
        if pitch in fastball_pitches:
            return 'FB'
        else:
            return pitch

    pitches['pitch_type'] = pitches['pitch_type'].apply(mapping_fastballs)
    pitches['prev_pitch_type'] = pitches['prev_pitch_type'].apply(mapping_fastballs)
    
    ##reorder and add only relevant columns
    pitches = pitches[['pitcher_id','batter_id','pitch_type','inning','top','pcount_at_bat','pcount_pitcher','balls','strikes','fouls',
                       'outs','on_1b','on_2b','on_3b','same_side_pitch_bat','score_differential','prev_pitch_type','prev_pitch_outcome',
                      'prev_zone','prev_start_speed']]
    return pitches

In [None]:
## clean the pitches
cleaned_pitches = clean_data(all_pitches)
cleaned_pitches.head()

## Create Model
Once the data is cleaned and in an acceptable format, I want to train a model that will handle all of the pitch data for a specific pitcher and based on the other variables give the best guess for the pitch.

Based on the amount of data and the decisions that needed to be made, I decided to use a XGBoost model. Obviously with the amount of variables involved, a linear model does not make sense to use. Using the XGBoost model since it is great for regression, classification and ranking problems. The model creates different decision trees and tries to predict the category or label based on previous data. In this example, based on the categories from the pitching data, it will help create an expected pitch type.

In [2]:
def train_guess_pitch_model(data, cutoff = 1500):
    
    ## Find the number of pitchers who meet the cut off
    pitcher_count_df = data.groupby('pitcher_id')['pcount_pitcher'].count().reset_index().rename(columns = {'pcount_pitcher':'num_of_pitches'})
    pitcher_count_df = pitcher_count_df.query(f'num_of_pitches > {cutoff}')

    ## return pitchers into a list
    pitcher_list = pitcher_count_df['pitcher_id'].to_list()

    ## Loop through each pitcher within the pitcher list
    for i, pitcher in enumerate(pitcher_list):
        ## Find the dataframe of just the pitcher's pitches and drop the pitcher_id columns
        this_pitcher_df = data.query(f'pitcher_id == {pitcher}')
        this_pitcher_df.drop('pitcher_id',axis = 1,inplace=True)

        ## return a list of the pitch types for the specific pitcher within the loop
        all_pitch_types = list(set(list(this_pitcher_df['prev_pitch_type'].unique()) + list(this_pitcher_df['pitch_type'].unique())))
        all_pitch_types_count = Counter(this_pitcher_df['prev_pitch_type'])

        ## map pitch type to a number and number to pitch type to be used later
        pitch_map = {all_pitch_types[i]: i for i in range(len(all_pitch_types))}
        pitch_unmap = {v: k for k, v in pitch_map.items()}

        ## Turn pitch type from a label into a number
        this_pitcher_df['pitch_type'] = this_pitcher_df['pitch_type'].apply(lambda x:pitch_map[x])
        this_pitcher_df['prev_pitch_type'] = this_pitcher_df['prev_pitch_type'].apply(lambda x:pitch_map[x])

        ## Create X values which include all variables except for pitch type and y value which is just pitch_type
        X = this_pitcher_df.drop('pitch_type', axis = 1)
        y = this_pitcher_df['pitch_type']

        ## Split the data into test and train data
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2500)


        ## Create XGBoost parameters, setting the max depth of the tree either 2, 5 or 17 for the depth of the tree and learning weights to .01, .1 or .2
        xgb_params = {"max_depth": (2, 5, 17),
                  "learning_rate": (0.01, 0.1, 0.2)}
    
        ## Run a gridsearch n the XGBoost model to find the best optimized between maxdepth and learning rate
        xgb_opt = GridSearchCV(XGBClassifier(objective='multi:softprob', num_class=len(all_pitch_types_count)), 
                           param_grid=xgb_params, cv=5, scoring='accuracy', verbose=0, n_jobs=-1)
        
        ## create the fit and make the predicitions for the pitch_type
        xgb_opt.fit(X_train, y_train)
        y_pred = xgb_opt.predict(X_test)
        y_prob = xgb_opt.predict_proba(X_test)
    
        ## return the accuracy for the model and store for print out
        accuracy = round(accuracy_score(y_test, y_pred) * 100, 1)

        # get and store the mode accurarcy which just uses the most frequent occuring pitch
        mode_accuracy = round(max(all_pitch_types_count.values()) / sum(all_pitch_types_count.values()) * 100., 1)

        # print for every 5th pitcher
        if i % 5 == 0:
            print()
            print(f"Pitcher ID: {pitcher}")
            print(f"Pitcher's pitch map: {pitch_map}")
            print(f"Pitcher's pitch counter: {dict(all_pitch_types_count)}")
            print(f"Number of data points in training: {X_train.shape[0]}")
            print(f"Number of data points in testing: {X_test.shape[0]}")
            print(f"Best params: {xgb_opt.best_params_}")
            print(f"Mode accuracy: {mode_accuracy}")
            print(f"XGBooost accuracy: {accuracy}")
    

In [3]:
train_guess_pitch_model(cleaned_pitches, cutoff=2000)

NameError: name 'cleaned_pitches' is not defined

## Next Steps

I have 2 main areas for next steps within this project. First, I would like to create a function where pitcher id is the only variable needed and it will export the expected pitch. Different variables such as score, runners on base, pitch number, etc. can be added to give an even more specific prediction. Depending on the use cases, I would create a Dash app where users could select this data from dropdown menus or enter the data, or even gather real-time data from an MLB API where you can display the odds of the next pitch.

Second, I would like to add some more intricacies into the model. For example, I think the batter can influence the types of pitches thrown. However, I do not think it is as often as people would think. For example, only few hitters, like Bryce Harper, or high-end talent probably get a different pitch selection than most major league players. Additionally, I mentioned the types of players on base may influence the type of pitches thrown. These are all minor tweaks to make the model hopefully more efficient and accurate.