In [1]:
import pandas as pd
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf

In [22]:
#Read in the .csv stored in a folder in the library, I got it off of Fangraphs
dfStats = pd.read_csv("Data/OffensiveStats2014-2017.csv")

#This drops the columns I don't need for the regression, with axis=1 (columns, not rows)
dfStats = dfStats.drop('Name', 1)
dfStats = dfStats.drop('Team', 1)
dfStats = dfStats.drop('SB', 1)
dfStats = dfStats.drop('Off', 1)
dfStats = dfStats.drop('Def', 1)
dfStats = dfStats.drop('playerid', 1)
dfStats = dfStats.drop('R', 1)
dfStats = dfStats.drop('PA', 1)

#This cleans the data. In BB% and K%, there was a % and a space, which screws up turning it into a numeric value.
#I first replace the % and space with nothing to get rid of it (first four lines), then I use pandas "to_numeric" to
#change the value into a float, then I divide by 100 to get to decimal percent (0.15 instead of 15%)
dfStats['BB%'] = dfStats['BB%'].str.replace('%', '')
dfStats['K%'] = dfStats['K%'].str.replace('%', '')
dfStats['BB%'] = dfStats['BB%'].str.replace(' ', '')
dfStats['K%'] = dfStats['K%'].str.replace(' ', '')
dfStats[['BB%','K%']] = dfStats[['BB%','K%']].apply(pd.to_numeric, errors='ignore')
dfStats['BB%'] = dfStats['BB%']/100
dfStats['K%'] = dfStats['K%']/100

#This renames some columns becuase the % or + symbol screws up the regression
dfStats = dfStats.rename(columns={'K%':'K_percentage'})
dfStats = dfStats.rename(columns={'BB%':'BB_percentage'})
dfStats = dfStats.rename(columns={'wRC+':'wRC_plus'})

#This gives the name, length and type of every column
dfStats.info()
#This prints out the first five rows, just to double check things
dfStats.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289 entries, 0 to 288
Data columns (total 14 columns):
G                289 non-null int64
HR               289 non-null int64
RBI              289 non-null int64
BB_percentage    289 non-null float64
K_percentage     289 non-null float64
ISO              289 non-null float64
BABIP            289 non-null float64
AVG              289 non-null float64
OBP              289 non-null float64
SLG              289 non-null float64
wOBA             289 non-null float64
wRC_plus         289 non-null int64
BsR              289 non-null float64
WAR              289 non-null float64
dtypes: float64(10), int64(4)
memory usage: 31.7 KB


Unnamed: 0,G,HR,RBI,BB_percentage,K_percentage,ISO,BABIP,AVG,OBP,SLG,wOBA,wRC_plus,BsR,WAR
0,589,139,373,0.15,0.221,0.278,0.348,0.301,0.413,0.579,0.417,171,23.1,32.9
1,584,140,398,0.128,0.189,0.254,0.296,0.277,0.375,0.531,0.386,147,5.1,28.0
2,626,70,302,0.067,0.099,0.162,0.351,0.334,0.384,0.496,0.376,143,5.3,23.9
3,581,112,394,0.152,0.219,0.236,0.363,0.304,0.413,0.54,0.4,148,15.2,21.7
4,457,94,274,0.123,0.239,0.24,0.346,0.288,0.388,0.527,0.389,143,19.2,21.6


So now we have the data in the form that we want it. All of the values are in a numeric form (either int or float), and we've dropped the columns that don't help us. We won't change this dataframe anymore, if we want to drop columns later on, we'll create a new dataframe without the columns we want to get rid of (I don't know if we will want to do this yet), so that this will always remain the "base" dataframe.

$$
WAR = \frac{Batting Runs + Base Running Runs +Fielding Runs + Positional Adjustment + League Adjustment +Replacement Runs}{Runs Per Win}
$$

That's the calculation for WAR by FanGraphs, which is nice becuase it doesn't use any of the raw stats (PA, HR, R, RBI, BB%, K%, and the such. I'm sure it uses these in some of the calculations, but it makes it so that they are not apparent for the linear regression. So let's run it with this entire dataframe and see what we get.

In [16]:
#So this is  a function I wrote for class that chooses the first "maxk" options that have the greatest impact on the
#linear regression. It's very complicated to explain exactly how it works, but it runs certain stats for each column, sorts
#the columns by impact, and then chooses the first maxk ones and uses statsmodels.formula.api library to make a model
def forward_select(df, resp_str , maxk):
    remaining = set(df.columns)
    remaining.remove(resp_str)
    selected = []
    numselected = 1
    score_crnt, score_new = 0.0, 0.0
    while remaining and score_crnt == score_new:
        score_array = []
        for candidate in remaining:
            formula = "{} ~ {} + 1".format(resp_str,' + '.join(selected + [candidate]))
            score = smf.ols(formula, df).fit().rsquared_adj
            score_array.append((score, candidate))
        score_array.sort()
        score_new, best_option = score_array.pop()
        if score_crnt < score_new and numselected <= maxk:
            remaining.remove(best_option)
            selected.append(best_option)
            score_crnt = score_new
            numselected += 1
    formula = "{} ~ {} + 1".format(resp_str,' + '.join(selected))
    model = smf.ols(formula, df).fit()
    return model

In [23]:
model = forward_select(dfStats, "WAR", 5)
model.summary()

0,1,2,3
Dep. Variable:,WAR,R-squared:,0.746
Model:,OLS,Adj. R-squared:,0.742
Method:,Least Squares,F-statistic:,166.2
Date:,"Sun, 24 Dec 2017",Prob (F-statistic):,5.04e-82
Time:,17:48:59,Log-Likelihood:,-714.64
No. Observations:,289,AIC:,1441.0
Df Residuals:,283,BIC:,1463.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-15.8289,1.808,-8.755,0.000,-19.388 -12.270
wRC_plus,0.1821,0.014,13.243,0.000,0.155 0.209
BsR,0.1964,0.018,11.010,0.000,0.161 0.232
G,0.0100,0.003,3.161,0.002,0.004 0.016
K_percentage,-12.6889,3.342,-3.797,0.000,-19.268 -6.110
RBI,0.0103,0.005,2.201,0.029,0.001 0.019

0,1,2,3
Omnibus:,6.375,Durbin-Watson:,1.401
Prob(Omnibus):,0.041,Jarque-Bera (JB):,5.344
Skew:,0.247,Prob(JB):,0.0691
Kurtosis:,2.553,Cond. No.,10900.0


This looks like a complicated model to interpret, but you don't need all the info. R-squared and Adj. R-Squared model what percentage of the response variable variation that is explained by a linear model, and Adj. R-squared takes into account the number of variables in the regression and recalculates the percentage.  
  
The F-statistic is the same as a T-statistic, but with a linear regression, and what is more helpful is the Prob (F-statistic), which is the p-value. This has a p-value of $5.04*10^{-82}$. We're going to assume an alpha level of 0.05 (reject versus fail to reject the null-hypothesis), so this means this linear regression is significant.

So now we interpet. The way to intepret the bottom part, which gives us the model, is the column head on the left, multiplied by the coef (coefficent), all added together. The P>|t| is the p-value for each individial statistic, so we want it to be lwoer than 0.05 as well. So our regression model is:  
$$  
WAR=-15.82+([wRC+]*0.182)+(BsR*0.196)+(G*0.01)+(Kpercentage*-12.69)+(RBI*0.01)  
$$

We can tell from this that there is a positive correaltion between WAR and wRC+, BsR, G, and RBI, and a negative correlation between K% and WAR (suprise), and they all have extremely strong correlations but for RBI, which is weaker but significant. This doesn't tell us a ton, so maybe we'll drop a few of these varaibles and try again when I get more time.