Estimators
* Desirable properties: unbiased and efficient
* Examples: mean, conditional means

Linear regression
* Linear regression estimates the slopes and intercepts of the population regression line
* Measures the linear relationship between explanitory variables and a response variable
* The slope of the population regression line is the expected effect on Y of a unit change in X        

The general linear model can be expressed as,

$$y_i = \beta_0 + \beta_1 x_i + \mu_i, i=1...n$$

* y is the dependent variable
* x is the independent variable
* The $b_0$ = intercept
* $b_1$ = slope
* $mu_i$ = the regression residual or error term

Ordinary Least Squares (OLS)
    
* An estimator that estimates the conditioanl means of the population parameters
* Minimizes the average squared difference between the actual values of y and the estmated lines predicted values

Consider $winpercent_i = \beta_0 + \beta_1 firstgoal_i + \mu_i, i=1...n$    

* unit of observation is the a season-team for $i=1...30$
* $winpercent_i$: a team season-level winning percentage
* $firstgoal_i$: the proportion of total season games the team scored first
* The intercept, $b_0$, (taken literally) is the estimated number of season wins if the team had 0 first goal games (i.e., $x=0$)
* The paramater, $b_1$, is the esimated change in a team's winning percentage for each 1 point increase in the proportion of fist goal games

Specification

* Continuous variables: parameter estimates are slope effects
* Categorical data represented as a series of indicator variables (i.e. fixed effects): parameter estimates are shift effects relative to the intercept/constant
* Binary (1/0) dependent variables are linear probability models
* Generally obtain unbiased slope effects
* Potentially predict outside the 0-1 interval

Examine the impact of physical characteristics on winning

In [None]:
%matplotlib inline
import os
import sys
import numpy 
import pandas 
import matplotlib.pyplot as plt
import pylab

# imports regression library
# OLS: ordinary least squares (estimation technique used to estimate the linear regression model)
import statsmodels.api as sm
from statsmodels.formula.api import ols
    
# Set some Pandas options
pandas.set_option('display.notebook_repr_html', True)
pandas.set_option('display.max_columns', 40)
pandas.set_option('display.max_rows', 10)
pandas.options.display.float_format = '{:,.4f}'.format

In [None]:
dm = pandas.read_csv('2010game_physical.csv')

In [None]:
# dm.columns.tolist()
# len(dm)
# dm.head()
# dm.tail()
# dm.describe()
# dm.dtypes

In [None]:
dm.head()


In [None]:
dm.describe()

# generate variables

In [None]:
dm['dGoals'] = dm['homeGoals'] - dm['awayGoals'] #regulation score-margin

dm['homeWin'] = dm.apply(lambda x: 1 if (x['homeTeam'] == x['winteamcode']) else 0, axis=1) #home win team indicator variable
dm['ishwin'] = numpy.where(dm['homeTeam']==dm['winteamcode'], 1 , 0)

# differences
dm['dAge'] =  dm['homeAge']-dm['awayAge']
dm['dHeight'] = dm['homeHeight']-dm['awayHeight']
dm['dWeight'] = dm['homeWeight']-dm['awayHeight']

dm['lnDAge'] = numpy.log(dm['homeAge']/dm['awayAge'])
dm['lnDHeight'] = numpy.log(dm['homeHeight']/dm['awayHeight'])
dm['lnDWeight'] = numpy.log(dm['homeWeight']/dm['awayWeight'])

dm['DSalary'] = dm['homeSalary'] - dm['awaySalary']

# logs
dm['lnhsalary'] = numpy.log(dm['homeSalary'])
dm['lnasalary'] = numpy.log(dm['awaySalary'])
dm['lnDSalary'] = numpy.log(dm['homeSalary']/dm['awaySalary'])


In [None]:
plt.hist(dm['homeGoals'])

Estimate the impact of salary on goals scored: $hgoals_i = \beta_0 + \beta_1 hsalary_i  + \mu_i$

In [None]:
# note, a vector of ones is included for the constant/intercept term

Y = dm['homeGoals']
X = sm.add_constant(dm['homeSalary'])

m1 = sm.OLS(Y, X).fit()
m1.summary2()

In [None]:
m1.params

$\beta_0=2.26$ 

$\beta_1=0.014$ 

A one million dollar increase in salary resulting in an increase of 0.014 goals per game

Taken literally the constant represents the number of goals scored per game with a home salary of 0 dollars.

In [None]:
# embed the variables into the equation
temp = sm.OLS(dm['homeGoals'],sm.add_constant(dm['homeSalary'])).fit()
temp.summary2()

Estimate the impact of salary on goals scored: $hgoals_i = \beta_0 + \beta_1 ln(hsalary_i)  + \mu_i$

In [None]:
dm.head()

In [None]:
temp = sm.OLS(dm['homeGoals'],sm.add_constant(dm['lnhsalary'])).fit()
temp.summary()

Logs transfrom the data into percent changes

A one percent increase in salary results in an increase of 0.40 goals per game

$$hgoals_i = \beta_0 + \beta_1 lnhsalary_i + \beta_2 lnasalary_i + mu_i$$

In [None]:
temp = sm.OLS(dm['homeGoals'],sm.add_constant(dm[['lnhsalary', 'lnasalary']])).fit()
temp.summary()

In [None]:
temp = sm.OLS(dm['dGoals'],sm.add_constant(dm[['lnhsalary', 'lnasalary']])).fit()
temp.summary()

In [None]:
plt.hist(dm['awayGoals'])

In [None]:
plt.hist(dm['dGoals'])

Estimate the impact of salary on goals allowed: $hgoals_i = \beta_0 + \beta_1 ln(asalary_i)  + \mu_i$

In [None]:
temp = sm.OLS(dm['homeGoals'],sm.add_constant(dm['awaySalary'])).fit()
temp.summary2()

# Contest model 

Represent the data relative to a team (e.g. home)

Difference: $$hwin_i = \beta_0 + \beta_1 (hsalary_i - asalary_i) + \mu_i $$
Difference: $$dgoals_i = \beta_0 + \beta_1 (hsalary_i - asalary_i) + \mu_i $$

Log difference: $$hwin_i = \beta_0 + \beta_1 ln(hsalary_i/asalary_i) + \mu_i $$
Log difference: $$dgoals_i = \beta_0 + \beta_1 ln(hsalary_i/asalary_i) + \mu_i $$


In [None]:
##Difference
temp = sm.OLS(dm['dGoals'],sm.add_constant(dm['DSalary'])).fit()
temp.summary()

In [None]:
temp = sm.OLS(dm['dGoals'],sm.add_constant(dm['lnDSalary'])).fit()
temp.summary()

A one percent increase in the difference in team salaries increases the goal differential by 0.22 goals per game

In [None]:
temp = sm.OLS(dm['homeWin'],sm.add_constant(dm['lnDSalary'])).fit()
temp.summary()

A one percent increase in the difference in team salaries increases the probability of a win by 0.23 points.

The home team advantage is 0.17 probabability points (0.5173 - 0.5000)