# Major League Baseball Home Run Exit Velocities

One of the most recent developments in MLB is the debate about why more home runs were hit in the 2017 season than any other season in the leagues history. More home runs were hit in the 2017 season than when the record for home runs hit was broken during the steroid era and many people want to know how this can be. One specific area for investigation is related to the most important item to the game, the baseball. During the 2017 season there were numerous complaints from major league pitchers that the ball felt different and the overall result was a record breaking year.

Thanks to Baseball Savant and all the different baseball statistics that are tracked on every pitch we can start to analyze if there is any connections between the baseballs being used and the increase in home runs. Specifically, I plan to draw a conclusion about whether or not the change seen in baseballs between the 2015, 2016 and 2017 MLB seasons has influenced a batters home run exit velocity. 

Using machine learning techniques I will build multiple regression models in order to predict a specific batters home run exit velocity on a particular pitch. After running these models I will decide which model should be used as the production level model and examine the influences of each feature that is provided. This model will provide the necessary insight into if a batters exit velocity is influenced by the baseballs being used along with providing further insight into what is influencing a batters exit velocity.

In [155]:
import pandas as pd
import numpy as np
from scipy import stats 

from bs4 import BeautifulSoup
import requests
import json
import time

## Data Cleaning

### Import the Data

The following notebook will load in two datasets for review and cleaning:
1. Each home run hit during the 2015, 2016 and 2017 MLB seasons
2. Stats from a sample of the actual baseballs taken from the 2015, 2016 and 2017 MLB seasons

I will also perform a webscrape of the website Baseball Savant for each players personal physical stats that hit a home run in the 2015, 2016 and 2017 seasons.
- url : https://baseballsavant.mlb.com/

#### Home Runs

To start I will review the largest of my three datasets which is the statistical data on each home run hit during the 2015, 2016 and 2017 seasons.

In [207]:
hr_df = pd.read_csv('../data/home_runs_15_16_17.csv')
baseballs_df = pd.read_excel('../data/baseballs.xlsx')

In [208]:
hr_df.columns = [x.lower().replace(' ', '_') for x in hr_df.columns]

In [209]:
hr_df.head()

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
0,CH,6/10/17,84.5,-3.402,6.4696,Aaron Judge,592450,501957,home_run,hit_into_play_score,...,0,0,0,0,0,0,0,0,Infield shift,Standard
1,FF,4/28/17,97.1,-2.8091,5.9279,Aaron Judge,592450,592332,home_run,hit_into_play_score,...,2,9,2,9,9,2,2,9,Standard,Standard
2,CU,6/23/15,86.7,-1.5647,5.3406,Giancarlo Stanton,519317,593372,home_run,hit_into_play_score,...,0,0,0,0,0,0,0,0,Standard,Standard
3,SL,9/28/17,89.5,2.0682,6.1177,Giancarlo Stanton,519317,571521,home_run,hit_into_play_score,...,5,1,5,1,1,5,5,1,Infield shift,Standard
4,SL,6/11/17,84.7,-1.9795,5.686,Aaron Judge,592450,548337,home_run,hit_into_play_score,...,7,3,7,3,3,7,7,3,Infield shift,Standard


In [210]:
hr_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16626 entries, 0 to 16625
Data columns (total 89 columns):
pitch_type                         16602 non-null object
game_date                          16626 non-null object
release_speed                      16611 non-null float64
release_pos_x                      16502 non-null float64
release_pos_z                      16502 non-null float64
player_name                        16626 non-null object
batter                             16626 non-null int64
pitcher                            16626 non-null int64
events                             16626 non-null object
description                        16626 non-null object
spin_dir                           0 non-null float64
spin_rate_deprecated               0 non-null float64
break_angle_deprecated             0 non-null float64
break_length_deprecated            0 non-null float64
zone                               16611 non-null float64
des                                16626 non-n

`#### Remove unnecessary features (noise)

After a review of all the available features within this dataset the following will be dropped:

An indepth review of each of the columns can be found at basball savant: https://baseballsavant.mlb.com/

- Dropped because of lack of data:
    - `spin_dir`
    - `spin_rate_deprecated`
    - `break_angle_deprecated`
    - `break_length_deprecated` 
    - `hit_location`
    - `tfs_deprecated`
    - `tfs_zulu_deprecated`
    - `iso_value`
    - `launch_speed_angle`: the values of 1-6 are unbalanced and I cannot find the exact meanings. I already have seperate features for both launch angle and launch speed.
- Dropped because of no relation to the actual home run stats (launch speed, launch angle, etc.)
    - `on_3b`
    - `on_2b`
    - `on_1b`
    - `umpire`
    - `events`
    - `description`
    - `des`
    - `game_type`
    - `stand`
    - `type`
    - `balls`
    - `strikes`
    - `outs_when_up`
    - `inning`
    - `inning_topbot`
    - `fielder_2`
    - `sv_id`
    - `pitcher.1`
    - `fielder_2.1`
    - `fielder_3`
    - `fielder_4`
    - `fielder_5`
    - `fielder_6`
    - `fielder_7` 
    - `fielder_8 `
    - `fielder_9`
    - `estimated_ba_using_speedangle`
    - `estimated_woba_using_speedangle`
    - `woba_value`
    - `woba_denom` 
    - `babip_value`
    - `at_bat_number`
    - `pitch_number `
    - `home_score` 
    - `away_score` 
    - `bat_score`
    - `fld_score`
    - `post_away_score`
    - `post_home_score`
    - `post_bat_score`
    - `post_fld_score`
    - `if_fielding_alignment `
    - `of_fielding_alignment`
    - `game_pk`
    - `pitch_name` (duplicate to pitch type)
    - `home_team`
    - `away_team`
    - `hc_x : coordinate of where the ball is fielded`
    - `hc_y : coordinate of where the ball is fielded`

In [211]:
col = [ 
    'umpire',
    'spin_dir',
    'spin_rate_deprecated',
    'break_angle_deprecated',
    'break_length_deprecated',
    'events',
    'description', 
    'des',
    'game_type', 
    'stand',
    'type', 
    'hit_location', 
    'balls', 
    'strikes', 
    'on_3b', 
    'on_2b', 
    'on_1b', 
    'outs_when_up',
    'inning',
    'inning_topbot',
    'tfs_deprecated', 
    'tfs_zulu_deprecated', 
    'fielder_2', 
    'sv_id', 
    'pitcher.1', 
    'fielder_2.1', 
    'fielder_3', 
    'fielder_4', 
    'fielder_5', 
    'fielder_6', 
    'fielder_7', 
    'fielder_8', 
    'fielder_9', 
    'estimated_ba_using_speedangle', 
    'estimated_woba_using_speedangle', 
    'woba_value', 
    'woba_denom', 
    'babip_value', 
    'at_bat_number', 
    'pitch_number', 
    'home_score', 
    'away_score', 
    'bat_score', 
    'fld_score', 
    'post_away_score', 
    'post_home_score', 
    'post_bat_score', 
    'post_fld_score', 
    'if_fielding_alignment', 
    'of_fielding_alignment',
    'iso_value',
    'game_pk',
    'pitch_name',
    'away_team',
    'home_team',
    'hc_x',
    'hc_y',
    'launch_speed_angle'
]

In [212]:
hr_df.drop(col, axis=1, inplace=True)

#### Removing nulls

One of the features I believe to be a very important aspect to a specific players home run stats is the pitch in which they hit a home run. So, I will be dropping the rows in which have nulls documented within the pitch type that was thrown.

In [213]:
col=['pitch_type']
hr_df.dropna(subset=col, inplace=True)

#### Updating datatypes

Ensuring specific data types are captured in the correct format

In [214]:
hr_df['game_date'] = pd.to_datetime(hr_df['game_date'])

In [215]:
hr_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16602 entries, 0 to 16625
Data columns (total 31 columns):
pitch_type           16602 non-null object
game_date            16602 non-null datetime64[ns]
release_speed        16600 non-null float64
release_pos_x        16491 non-null float64
release_pos_z        16491 non-null float64
player_name          16602 non-null object
batter               16602 non-null int64
pitcher              16602 non-null int64
zone                 16600 non-null float64
p_throws             16602 non-null object
bb_type              16602 non-null object
game_year            16602 non-null int64
pfx_x                16600 non-null float64
pfx_z                16600 non-null float64
plate_x              16600 non-null float64
plate_z              16600 non-null float64
vx0                  16600 non-null float64
vy0                  16600 non-null float64
vz0                  16600 non-null float64
ax                   16600 non-null float64
ay            

In [216]:
hr_df['pitch_type'].isnull().sum()

0

#### Filling in Nulls (Pitch Stats)

A lot of the pitching data related to the actual pitch that was thrown is missing. Using a groupby and a function I can fill these values in for the specific pitcher that threw the pitch. 

Steps to fill in these values:
1. Create a dataframe by pitcher and the pitch types they've thrown to get an average for each pitch statistic
2. Create a function that will look up the picher by their ID and then look up their average stat from the newly created dataframe above to fill in whatever missing values are missing for the specific pitch record
3. Apply the newly created function to the pitching features where nulls were identified

In [217]:
hr_df.isnull().sum()[['release_speed', 
                      'release_pos_x', 
                      'release_pos_y',
                      'release_pos_z',
                      'release_spin_rate',
                      'release_extension',
                      'pfx_x',
                      'pfx_z',
                      'plate_x',
                      'vx0',
                      'vy0',
                      'vz0',
                      'ax',
                      'ay',
                      'az',
                      'effective_speed']]

release_speed           2
release_pos_x         111
release_pos_y         111
release_pos_z         111
release_spin_rate    1011
release_extension     305
pfx_x                   2
pfx_z                   2
plate_x                 2
vx0                     2
vy0                     2
vz0                     2
ax                      2
ay                      2
az                      2
effective_speed       330
dtype: int64

In [218]:
hr_df[(hr_df.pitch_type == 'FF') & (hr_df.pitcher == 517414)]

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,zone,p_throws,...,az,sz_top,sz_bot,hit_distance_sc,launch_speed,launch_angle,effective_speed,release_spin_rate,release_extension,release_pos_y
1066,FF,2015-05-09,94.3,-0.7633,6.2443,Justin Upton,457708,517414,8.0,R,...,-14.015,3.5,1.57,402.0,109.7,19.077,94.305,2109.0,6.064,54.4355
2828,FF,2016-07-01,93.4,-0.7002,6.4708,Jarrett Parker,592620,517414,13.0,R,...,-12.238,3.54,1.59,424.0,107.3,32.845,94.11,2114.0,6.337,54.1639
3253,FF,2015-04-15,94.7,-1.9226,6.6718,Justin Upton,457708,517414,2.0,R,...,-10.254,3.49,1.58,395.0,106.9,22.2869,,,,54.2295
3960,FF,2016-05-31,92.5,-2.7624,6.2747,Luis Valbuena,472528,517414,4.0,R,...,-10.515,3.36,1.59,422.0,106.3,27.292,93.546,2360.0,6.383,54.1182
5812,FF,2016-08-02,92.4,-1.2452,6.2537,Jayson Werth,150029,517414,4.0,R,...,-15.011,3.69,1.66,368.0,104.9,19.406,92.23,2275.0,6.004,54.4976
9267,FF,2017-04-04,,,,Brandon Crawford,543063,517414,,R,...,,3.842,1.589,,102.8,30.199,,,,
12014,FF,2017-06-09,94.3,-1.4991,6.1734,Hernan Perez,541650,517414,4.0,R,...,-16.2633,3.3213,1.3925,403.0,100.9,35.227,94.207,2270.0,5.966,54.5331


In [219]:
pitch_avg = hr_df.groupby(['pitcher', 'pitch_type'])[['release_speed', 
                                                      'release_pos_x', 
                                                      'release_pos_y',
                                                      'release_pos_z',
                                                      'release_spin_rate',
                                                      'release_extension',
                                                      'pfx_x',
                                                      'pfx_z',
                                                      'plate_x',
                                                      'vx0',
                                                      'vy0',
                                                      'vz0',
                                                      'ax',
                                                      'ay',
                                                      'az',
                                                      'effective_speed']].mean()

In [220]:
pitch_avg.reset_index(inplace=True)

In [221]:
pitch_avg.head()

Unnamed: 0,pitcher,pitch_type,release_speed,release_pos_x,release_pos_y,release_pos_z,release_spin_rate,release_extension,pfx_x,pfx_z,plate_x,vx0,vy0,vz0,ax,ay,az,effective_speed
0,112526,CH,81.54,-1.39152,55.25098,5.75408,1637.6,5.2492,-1.142933,0.8357,-0.22384,4.45196,-118.45622,-2.98594,-10.6673,22.15458,-25.1512,79.8946
1,112526,FF,91.388235,-0.988865,55.251547,5.992818,2276.5625,5.2575,-0.717337,1.424175,-0.022853,3.560453,-132.811671,-4.633582,-8.148265,29.473624,-16.233365,89.426625
2,112526,FT,87.585106,-1.425957,55.216726,5.713583,2109.446809,5.283426,-1.339216,0.963503,-0.173119,5.304594,-127.276523,-3.549502,-14.556315,26.805947,-22.843936,85.687319
3,112526,SL,81.7,-1.5452,55.437425,5.83525,2309.0,5.06125,0.077197,0.502796,0.172225,3.291075,-118.803012,-1.377837,1.219612,21.471113,-28.5423,79.894375
4,115629,CH,84.3,-1.8307,54.3197,6.3655,1669.0,6.148,-1.051283,1.3653,-0.379,4.877,-122.597,-4.7,-10.074,24.338,-19.731,83.946


In [222]:
def pitcher_avg(pitch, col):
    return float(pitch_avg[(pitch_avg['pitcher']==pitch.pitcher) & 
                           (pitch_avg['pitch_type']==pitch.pitch_type)][col])

In [223]:
np.isnan(hr_df.loc[9267, 'release_speed'])

True

In [224]:
float(pitcher_avg(hr_df.loc[9267, :], 'release_speed'))

93.59999999999998

In [225]:
def pitch_stats_apply(columns):
    for col in columns:
        hr_df[col] = hr_df.apply(lambda x: pitcher_avg(x, col) if pd.isnull(x[col]) else x[col],1)
    return hr_df

In [226]:
columns = pitch_avg
hr_df = pitch_stats_apply(columns)

In [235]:
hr_df['zone'].iloc[9248] = 4

In [228]:
hr_df['zone'].iloc[9248]

4.0

In [229]:
hr_df['zone'].iloc[9338] = 4

In [230]:
hr_df[(hr_df['pitcher'] == 573186) & (hr_df['pitch_type'] == 'CH')]

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,zone,p_throws,...,az,sz_top,sz_bot,hit_distance_sc,launch_speed,launch_angle,effective_speed,release_spin_rate,release_extension,release_pos_y
7306,CH,2016-06-19,85.5,-1.0559,5.5616,Matt Wieters,446308,573186,4.0,R,...,-24.71,3.59,1.77,,104.0,24.0,82.673,1474.5,5.6465,54.9406
9357,CH,2017-06-22,83.725,-1.016825,5.5545,Robinson Chirinos,455139,573186,4.0,R,...,-23.95425,3.411,1.565,395.0,102.8,30.199,82.673,1474.5,5.6465,54.950825
10697,CH,2016-05-28,83.9,-0.8911,5.3561,Travis Shaw,543768,573186,4.0,R,...,-23.31,3.67,1.71,374.0,101.9,35.761,83.308,1384.0,5.807,54.693
14992,CH,2016-04-19,83.3,-0.9497,5.7024,Matt Wieters,446308,573186,5.0,R,...,-21.347,3.7,1.76,353.0,97.6,27.999,82.038,1565.0,5.486,55.0126
15679,CH,2015-09-30,82.2,-1.1706,5.5979,Steve Pearce,456665,573186,1.0,R,...,-26.45,3.49,1.53,365.0,96.1,34.612,82.673,1474.5,5.6465,55.1571


In [231]:
hr_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16602 entries, 0 to 16625
Data columns (total 31 columns):
pitch_type           16602 non-null object
game_date            16602 non-null datetime64[ns]
release_speed        16602 non-null float64
release_pos_x        16596 non-null float64
release_pos_z        16596 non-null float64
player_name          16602 non-null object
batter               16602 non-null int64
pitcher              16602 non-null int64
zone                 16602 non-null float64
p_throws             16602 non-null object
bb_type              16602 non-null object
game_year            16602 non-null int64
pfx_x                16602 non-null float64
pfx_z                16602 non-null float64
plate_x              16602 non-null float64
plate_z              16600 non-null float64
vx0                  16602 non-null float64
vy0                  16602 non-null float64
vz0                  16602 non-null float64
ax                   16602 non-null float64
ay            

#### Filling in Nulls (Home Run Hit Stats)

With as many nulls as possible now filled in for the missing pitch stats I will now perform the same steps for the missing home hit stats. 

Steps to fill in these values:
1. Create a dataframe by batter, the pitch types and the batted ball type to get an average for each batting statistic for values that are missing
2. Create a function that will look up the batter by their ID and then look up their average stat from the newly created dataframe above to fill in whatever missing values are missing for the specific home run record
3. Apply the newly created function to the batter features where nulls were identified

In [232]:
batter_stats = hr_df.groupby(['pitch_type', 'batter', 'bb_type']).mean()[['hit_distance_sc']]

In [233]:
batter_stats.reset_index(inplace=True)

In [234]:
batter_stats.head()

Unnamed: 0,pitch_type,batter,bb_type,hit_distance_sc
0,CH,116338,fly_ball,391.666667
1,CH,120074,fly_ball,379.0
2,CH,120074,line_drive,413.666667
3,CH,121347,fly_ball,385.5
4,CH,121347,line_drive,412.0


In [236]:
def batter_avg(pitch, col):
    return float(batter_stats[(batter_stats['batter']==pitch.batter) &
                              (batter_stats['pitch_type']==pitch.pitch_type) &
                              (batter_stats['bb_type']==pitch.bb_type)][col])

In [237]:
def batter_stats_apply(columns):
    for col in columns:
        hr_df[col] = hr_df.apply(lambda x: batter_avg(x, col) 
                                               if pd.isnull(x[col])
                                               else x[col],1)

In [238]:
columns = batter_stats
batter_stats_apply(columns)

In [239]:
hr_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16602 entries, 0 to 16625
Data columns (total 31 columns):
pitch_type           16602 non-null object
game_date            16602 non-null datetime64[ns]
release_speed        16602 non-null float64
release_pos_x        16596 non-null float64
release_pos_z        16596 non-null float64
player_name          16602 non-null object
batter               16602 non-null int64
pitcher              16602 non-null int64
zone                 16602 non-null float64
p_throws             16602 non-null object
bb_type              16602 non-null object
game_year            16602 non-null int64
pfx_x                16602 non-null float64
pfx_z                16602 non-null float64
plate_x              16602 non-null float64
plate_z              16600 non-null float64
vx0                  16602 non-null float64
vy0                  16602 non-null float64
vz0                  16602 non-null float64
ax                   16602 non-null float64
ay            

#### Final missing value fill

I have now filled in as many nulls as possible within the dataset and will now drop all remaining null values

In [240]:
hr_df.dropna(inplace=True)

In [241]:
len(hr_df)

16388

### Save the Data

In [242]:
hr_df.to_csv('../data/final_clean_hr.csv')

### Baseball Data

Next, I will review and clean the sampled baseballs from the 2015, 2016 and 2017 MLB seasons

In [61]:
baseballs_df.columns = [i.lower().replace(' ', '_') for i in baseballs_df.columns]

In [62]:
baseballs_df.head()

Unnamed: 0,ball_code,year,sn,weight_(oz),circumference_(in),avg_seam_height,std_of_seam_height,avg_ccor,avg_ds,#_of_good_shots_before_damage
0,MSCC0051,2014-05-15,731,5.135,9.13,0.03587,0.01049,0.491,12133,
1,MSCC0032,2014-07-15,228,5.149,9.09,0.04403,0.01781,0.489,12468,
2,MSCC0030,2015-04-15,196,5.143,9.06,0.03726,0.0064,0.489,12518,
3,MSCC0045,2015-04-15,351,5.109,9.09,0.04574,0.01216,0.496,13442,
4,MSCC0048,2015-04-15,499,5.192,9.19,0.04862,0.00828,0.474,12394,5.0


#### Dropping unnecessary features (noise)
- Dropping because of their irrelevance to home run statistics
    - ball code
    - sn
- Dropping because of their lack of data
    - #_of_good_shots_before_damage

In [63]:
col = [
    'ball_code',
    '#_of_good_shots_before_damage',
    'sn'
]

baseballs_df.drop(col, axis=1, inplace=True)

In [64]:
baseballs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 7 columns):
year                  36 non-null datetime64[ns]
weight_(oz)           36 non-null float64
circumference_(in)    36 non-null float64
avg_seam_height       36 non-null float64
std_of_seam_height    36 non-null float64
avg_ccor              36 non-null float64
avg_ds                36 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 2.0 KB


#### Updating datatypes 

Updating fetures to their correct data types and updating the index for this dataframe to datetime for efficiency

In [65]:
baseballs_df['year'] = pd.to_datetime(baseballs_df['year'])

In [66]:
baseballs_df.set_index('year', inplace=True)

In [67]:
baseballs_df.head()

Unnamed: 0_level_0,weight_(oz),circumference_(in),avg_seam_height,std_of_seam_height,avg_ccor,avg_ds
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-05-15,5.135,9.13,0.03587,0.01049,0.491,12133
2014-07-15,5.149,9.09,0.04403,0.01781,0.489,12468
2015-04-15,5.143,9.06,0.03726,0.0064,0.489,12518
2015-04-15,5.109,9.09,0.04574,0.01216,0.496,13442
2015-04-15,5.192,9.19,0.04862,0.00828,0.474,12394


#### Creating a final average of the baseballs to use for each MLB season

- 2015
- 2016
- 2017
    - It was extremely difficult to find the data for the baseballs from the 2017 season. Using a couple of articles online I was able to construct some averages for 2017
        - Sources: 
            - https://www.theringer.com/2017/6/14/16044264/2017-mlb-home-run-spike-juiced-ball-testing-reveal-155cd21108bc 
            - https://fivethirtyeight.com/features/juiced-baseballs/

In [68]:
bb_15 = baseballs_df['2015'][['weight_(oz)',
                      'circumference_(in)',
                      'avg_seam_height',
                      'std_of_seam_height',
                      'avg_ccor',
                      'avg_ds']].mean().to_frame().T
bb_15['year'] = 2015
bb_15

Unnamed: 0,weight_(oz),circumference_(in),avg_seam_height,std_of_seam_height,avg_ccor,avg_ds,year
0,5.120125,9.107917,0.040502,0.011194,0.491333,12740.875,2015


In [69]:
bb_16 = baseballs_df['2016'][['weight_(oz)',
                      'circumference_(in)',
                      'avg_seam_height',
                      'std_of_seam_height',
                      'avg_ccor',
                      'avg_ds']].mean().to_frame().T
bb_16['year'] = 2016
bb_16

Unnamed: 0,weight_(oz),circumference_(in),avg_seam_height,std_of_seam_height,avg_ccor,avg_ds,year
0,5.1225,9.079,0.038432,0.010468,0.4941,12926.4,2016


In [70]:
mlb_baseballs = bb_15.append(bb_16).reset_index()
mlb_baseballs.drop('index', axis=1, inplace=True)
mlb_baseballs.drop('std_of_seam_height', axis=1, inplace=True)
mlb_baseballs

Unnamed: 0,weight_(oz),circumference_(in),avg_seam_height,avg_ccor,avg_ds,year
0,5.120125,9.107917,0.040502,0.491333,12740.875,2015
1,5.1225,9.079,0.038432,0.4941,12926.4,2016


#### The 2017 baseballs

- Weight
    - "The overall weight of the balls also dropped by an average of about a 0.5 grams between groups"
        - 5.12250 - 0.017637 (.5 grams converted to ouces) = 5.104863
- Avg ccor (Ccylindrical coefficient of restitution) "The bounciness of the baseball aka "The Pill""
    - "According to the Kent State researchers, these chemical changes produced a more porous, less dense layer of rubber"
        - 0.494100 - 0.017637 (.5 grams converted to ouces) = .476463

In [71]:
bb_17 = pd.DataFrame(
    [[.476463,.038,9.08,5.104863,12926,2017]], columns=['avg_ccor', 
                                                   'avg_seam_height',
                                                   'circumference_(in)',
                                                   'weight_(oz)',
                                                   'avg_ds',
                                                   'year']
)
bb_17

Unnamed: 0,avg_ccor,avg_seam_height,circumference_(in),weight_(oz),avg_ds,year
0,0.476463,0.038,9.08,5.104863,12926,2017


In [72]:
mlb_baseballs = mlb_baseballs.append(bb_17, sort=False).reset_index().drop('index', axis=1)

In [73]:
mlb_baseballs

Unnamed: 0,weight_(oz),circumference_(in),avg_seam_height,avg_ccor,avg_ds,year
0,5.120125,9.107917,0.040502,0.491333,12740.875,2015
1,5.1225,9.079,0.038432,0.4941,12926.4,2016
2,5.104863,9.08,0.038,0.476463,12926.0,2017


### Save the Data

In [74]:
mlb_baseballs.to_csv('../data/final_mlb_baseballs.csv')

### Player Personal Stats

Finally, I want to include each batters personal stats to draw a conclusion about whether or not a players size and weight are contributing to their home runs launch speed and angles.

#### Lets ensure we receive the appropriate url status code from Baseball Savant

In order to scrape a website an appropriate status code must be received. Using a player at random I will send a request to the website to ensure a successful connection.
- I want to receive a status code of 200

In [67]:
url = "https://baseballsavant.mlb.com/savant-player/j-d-martinez-502110?stats=career-r-hitting-mlb"
res =  requests.get(url)

In [68]:
if res.status_code == 200:
    print(f'Successfull status code received : {res.status_code}')

Successfull status code received : 200


#### Players to scrape

Lets ensure we know how many unique players we'll need to scrape. 

There are 750 players that have hit homes runs since 2015. I will scrape each of these players personal stats:
- Height
- Weight
- Age

In [75]:
batter_id = hr_df['batter'].unique()

In [76]:
len(batter_id)

750

#### Running the scrape

Using Pythons Beautiful Soup Library I will run a loop through each players personal page on the Baseball Savant site and obtain their speicific personal stats.

In [77]:
def get_batters_info(batter_id):
    attrs = {
        'batter_id': [str(x) for x in batter_id],
        'height':[],
        'weight':[],
        'age':[],
        'position':[],
        'bats':[],
        'throws':[]
    }

    attr_names = [
        'height',
        'weight',
        'age',
        'position',
        'bats',
        'throws'
    ]

    for idx, b in enumerate(batter_id):
        url = "https://baseballsavant.mlb.com/savant-player/{}?stats=career-r-hitting-mlb".format(b)
        res =  requests.get(url)
        soup = BeautifulSoup(res.content, 'lxml')
        for col_name, attribute in zip(attr_names, soup.find_all('div', {'class':'box-text'})):
            attrs[col_name].append(str(attribute.text))
        if idx % 50 == 0:
            with open('../data/baseball_player7.json', 'w') as f:
                json.dump(attrs, f)
        time.sleep(2)
    
    with open('../data/baseball_player7.json', 'w') as f:
        json.dump(attrs, f)
    
    return attrs

#### Webscrape results

The below two lines of code have been commented out so that they are not ran again unless needed. These are the results of the web scrape. I saved the results previously so will load the data in again for use.

In [78]:
# player_info_df = get_batters_info(batter_id)

#### Saving current scraped data

In [194]:
player_info_df = pd.read_csv('../data/batter_personal_stats.csv')
player_info_df.drop('Unnamed: 0', axis=1, inplace=True)

In [197]:
player_info_df.head()

Unnamed: 0,batter_id,height,weight,age,position,bats,throws
0,592450,"6' 7""",282,26,RF,R,R
1,519317,"6' 6""",245,28,LF,R,R
2,471865,"6' 1""",220,32,RF,L,L
3,443558,"6' 2""",230,38,DH,R,R
4,121347,"6' 3""",230,43,SS,R,R


In [198]:
final_player_info_df = player_info_df[['batter_id', 'height', 'weight', 'age']]

In [201]:
f'Total number of unique players: {len(final_player_info_df.drop_duplicates())}'

'Total number of unique players: 752'

#### Convert all players heights to inches

In [202]:
def height(height_str):
    feet, inches = height_str.split("' ")
    feet = int(feet)
    inches = int(inches[:-1])
    return 12 * feet + inches

In [None]:
final_player_info_df['height'] = final_player_info_df['height'].map(height)

### Save Data

In [205]:
player_info_df.to_csv('../data/final_scrape.csv')

In [206]:
final_player_info_df.to_csv('../data/final_clean_player_info.csv')