<font color=purple > _**Emily Nordhoff - 2022**_ </font>

# Deriving variables

This script contains variable derivation for a project about hits in the MLB. Data was gathered from BaseballSavant.mlb.com from the 2021 season. It includes all batted balls in play in the whole season.

### Contents

    1. Importing data and libraries
    2. Merge with ballparks
    3. Derive new variables
        3.1 Home or away batter
        3.2 Launch_speed_angle categories
        3.3 Runners on base
        3.4 Scoring plays
    4. Wrap up and Export

## 1. Importing data and libraries

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
path = r'/Users/Emily/Documents/CF Data Analysis Program/Immersion 6/Hits Analysis/'

In [3]:
bip = pd.read_csv(os.path.join(path,'02 data','prepared data','bip_clean.csv'))

In [4]:
parks = pd.read_pickle(os.path.join(path,'02 data','prepared data','mlb_ballparks_clean.pkl'))

In [5]:
bip.head()

Unnamed: 0,pitch_type,game_date,release_speed,player_name,batter,pitcher,events,zone,stand,p_throws,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
0,FC,2021-04-30,82.7,"Altuve, Jose",514888,642232,field_out,4.0,R,L,...,0,5,5,0,5,0,5,0,Infield shift,Standard
1,FC,2021-04-30,82.4,"Maldonado, Martín",455117,642232,field_out,8.0,R,L,...,0,5,5,0,5,0,5,0,Standard,Strategic
2,CH,2021-04-30,83.8,"Kiermaier, Kevin",595281,621121,field_out,5.0,L,R,...,0,6,0,6,6,0,0,6,Infield shift,Standard
3,KC,2021-04-30,83.8,"Madrigal, Nick",663611,669456,field_out,8.0,R,R,...,3,4,3,4,4,3,3,4,Standard,Standard
4,FF,2021-04-30,89.4,"Reynolds, Bryan",668804,607231,field_out,6.0,L,R,...,1,3,1,3,3,1,1,3,Strategic,Standard


In [6]:
bip.shape

(121707, 50)

In [7]:
bip.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 121707 entries, 0 to 22863
Columns: 50 entries, pitch_type to of_fielding_alignment
dtypes: float64(20), int64(17), object(13)
memory usage: 47.4+ MB


In [8]:
parks.head()

Unnamed: 0,Name,Capacity,Location,Surface,Team,Opened,Distance to center field,Roof type,Team abbrev
0,Chase Field,48405,"Phoenix, Arizona",Artificial turf,Arizona Diamondbacks,1998,407,Retractable,ARI
1,Truist Park,41084,"Cumberland, Georgia",Grass,Atlanta Braves,2017,400,Open,ATL
2,Oriole Park at Camden Yards,45971,"Baltimore, Maryland",Grass,Baltimore Orioles,1992,410,Open,BAL
3,Fenway Park,37755,"Boston, Massachusetts",Grass,Boston Red Sox,1912,390,Open,BOS
4,Wrigley Field,41649,"Chicago, Illinois",Grass,Chicago Cubs,1914,400,Open,CHC


In [9]:
parks.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Columns: 9 entries, Name to Team abbrev
dtypes: int64(3), object(6)
memory usage: 2.2+ KB


## 2. Merge with ballpark data

In [10]:
parks = parks[['Name','Location','Surface','Team','Team abbrev']]

In [11]:
parks.head()

Unnamed: 0,Name,Location,Surface,Team,Team abbrev
0,Chase Field,"Phoenix, Arizona",Artificial turf,Arizona Diamondbacks,ARI
1,Truist Park,"Cumberland, Georgia",Grass,Atlanta Braves,ATL
2,Oriole Park at Camden Yards,"Baltimore, Maryland",Grass,Baltimore Orioles,BAL
3,Fenway Park,"Boston, Massachusetts",Grass,Boston Red Sox,BOS
4,Wrigley Field,"Chicago, Illinois",Grass,Chicago Cubs,CHC


In [12]:
df = bip.merge(parks, how='left', left_on='home_team', right_on='Team abbrev')

In [13]:
df.drop(columns='Team abbrev', inplace=True)
df.rename(columns={'Team':'home_team_name', 'Surface':'park_surface',
                   'Location':'park_location', 'Name':'park_name'}, inplace=True)

## 3. Derive new variables

### 3.1 Home or away batter

There isn't a variable that indicates whether the batter is from the home or away team. We can deduce this by the top/bottom of the inning they're hitting in.

In [14]:
df['batter_team'] = np.where(df['inning_topbot'] == 'Top', df['away_team'], df['home_team'])

In [15]:
df['batter_home_away'] = np.where(df['inning_topbot'] == 'Top', 'away', 'home')

### 3.2 Launch_speed_angle categories

This column just indicates categories that I spend too much time looking up. Why not just replace the numbers with the category names?! Remove the column afterward.

In [16]:
choices = ['Weak', 'Topped', 'Under', 'Flare/Burner', 'Solid Contact', 'Barrel']
col = 'launch_speed_angle'
conditions = [df[col] == 1, df[col] == 2, df[col] == 3, df[col] == 4, df[col] == 5, df[col] == 6]

In [17]:
df['contact'] = np.select(conditions, choices)

In [18]:
df[['launch_speed_angle','contact']].value_counts()

launch_speed_angle  contact      
2.0                 Topped           38465
3.0                 Under            30574
4.0                 Flare/Burner     29218
6.0                 Barrel            9638
5.0                 Solid Contact     7441
1.0                 Weak              5985
dtype: int64

### 3.3 Runners on base

We don't need all the information about _which_ players are on base, but it is useful to know if there is a runner, and on which base(s). Afterward, drop the original columns.

In [19]:
df['runner_1b'] = np.where(df['on_1b'].isnull() == False, 1, df['on_1b'])

In [20]:
df['runner_2b'] = np.where(df['on_2b'].isnull() == False, 1, df['on_2b'])

In [21]:
df['runner_3b'] = np.where(df['on_3b'].isnull() == False, 1, df['on_3b'])

In [22]:
df.drop(columns=['on_1b', 'on_2b', 'on_3b'], inplace=True)

### 3.4 Scoring plays

Now that we've established which team is hitting as home/away, there's no need for the home/away scores before and after the play. But, it will be helpful to determine if there is a score change, because that will indicate the hit resulted in a run(s).

In [23]:
df.drop(columns=['home_score', 'away_score', 'post_home_score', 'post_away_score', 'post_fld_score'], inplace=True)

In [24]:
df.loc[df['post_bat_score'] > df['bat_score'], 'scoring_play'] = True
df.loc[df['post_bat_score'] <= df['bat_score'], 'scoring_play'] = False

In [25]:
df['scoring_play'].value_counts(dropna=False)

False    105728
True      15979
Name: scoring_play, dtype: int64

## Wrap up and Export

In [26]:
df.shape

(121707, 53)

In [27]:
df.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 121707 entries, 0 to 121706
Columns: 53 entries, pitch_type to scoring_play
dtypes: float64(20), int64(12), object(21)
memory usage: 50.1+ MB


In [28]:
df.to_csv(os.path.join(path, '02 data', 'prepared data', 'bip_ballparks_merged.csv'))