## DSC630 - Week 3
### Chris Goodwin

For this week's project, we are looking at some attendance data for Dodgers games. We want to determine if there is a specific time that they should run promotions. From our initial plots in R, it seems as though we could use more promotions on Mondays and Wednesdays. However, let's look at the data in a different way in Python to determine if that is truly the case. Let us start by loading the data as a pandas dataframe.

In [4]:
import pandas as pd
file_path = "C:/Users/goodw/Downloads/dodgers.csv"
data = pd.read_csv(file_path)
data.head()

Unnamed: 0,month,day,attend,day_of_week,opponent,temp,skies,day_night,cap,shirt,fireworks,bobblehead
0,APR,10,56000,Tuesday,Pirates,67,Clear,Day,NO,NO,NO,NO
1,APR,11,29729,Wednesday,Pirates,58,Cloudy,Night,NO,NO,NO,NO
2,APR,12,28328,Thursday,Pirates,57,Cloudy,Night,NO,NO,NO,NO
3,APR,13,31601,Friday,Padres,54,Cloudy,Night,NO,NO,YES,NO
4,APR,14,46549,Saturday,Padres,57,Cloudy,Night,NO,NO,NO,NO


Similar to what we did in R, we will not create a column called 'promo' that is 'YES' if there were any of the four promos for a given game.

In [5]:
data.loc[(data['cap'] == 'YES') | (data['shirt'] == 'YES') | (data['fireworks'] == 'YES') | (data['bobblehead'] == 'YES'), 'promo'] = 'YES'  
data.loc[(data['cap'] == 'NO') & (data['shirt'] == 'NO') & (data['fireworks'] == 'NO') & (data['bobblehead'] == 'NO'), 'promo'] = 'NO'
data.head()

Unnamed: 0,month,day,attend,day_of_week,opponent,temp,skies,day_night,cap,shirt,fireworks,bobblehead,promo
0,APR,10,56000,Tuesday,Pirates,67,Clear,Day,NO,NO,NO,NO,NO
1,APR,11,29729,Wednesday,Pirates,58,Cloudy,Night,NO,NO,NO,NO,NO
2,APR,12,28328,Thursday,Pirates,57,Cloudy,Night,NO,NO,NO,NO,NO
3,APR,13,31601,Friday,Padres,54,Cloudy,Night,NO,NO,YES,NO,YES
4,APR,14,46549,Saturday,Padres,57,Cloudy,Night,NO,NO,NO,NO,NO


I first want to fit a model that looks at attendance as a function of month and day of week. This will help us determine which months and/or days might need a boost in attendance. 

To create this model, we will use the ols function. Since month and day of the week are categorical variables, we denote this with the C(). We will evaluate the summary to determine if any variables are having a negative impact on attendance.

In [14]:
from statsmodels.formula.api import ols

fit = ols('attend ~ C(month) + C(day_of_week)', data=data).fit() 

fit.summary()

0,1,2,3
Dep. Variable:,attend,R-squared:,0.411
Model:,OLS,Adj. R-squared:,0.307
Method:,Least Squares,F-statistic:,3.954
Date:,"Sat, 20 Jun 2020",Prob (F-statistic):,0.000128
Time:,11:31:20,Log-Likelihood:,-823.91
No. Observations:,81,AIC:,1674.0
Df Residuals:,68,BIC:,1705.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.805e+04,2661.948,14.293,0.000,3.27e+04,4.34e+04
C(month)[T.AUG],3965.9784,2681.525,1.479,0.144,-1384.921,9316.878
C(month)[T.JUL],4768.3867,2868.802,1.662,0.101,-956.218,1.05e+04
C(month)[T.JUN],8753.4054,3057.367,2.863,0.006,2652.524,1.49e+04
C(month)[T.MAY],-1957.7296,2583.531,-0.758,0.451,-7113.086,3197.627
C(month)[T.OCT],-1500.1929,4561.773,-0.329,0.743,-1.06e+04,7602.683
C(month)[T.SEP],-692.4947,2839.495,-0.244,0.808,-6358.619,4973.630
C(day_of_week)[T.Monday],-4991.2625,2826.580,-1.766,0.082,-1.06e+04,649.091
C(day_of_week)[T.Saturday],3314.3441,2717.208,1.220,0.227,-2107.761,8736.449

0,1,2,3
Omnibus:,1.089,Durbin-Watson:,2.163
Prob(Omnibus):,0.58,Jarque-Bera (JB):,1.008
Skew:,0.266,Prob(JB):,0.604
Kurtosis:,2.877,Cond. No.,8.74


SO from this, we can see that some of the categories that we anticipated are having a negative impact on attendance. Monday and Wednesday games have a negative coefficient, meaning that games on these days actually lead to a decrease in attendance. The same goes for the months of May, September, and October. October I am going to ignore, because there are so few games that I don't think it would be proper to run an analysis on. When I did my plotting earlier, I did find it strange that there were no promotions in September. 

I now want to create another model that takes into account promotions. 

In [15]:
from statsmodels.formula.api import ols

fit = ols('attend ~ C(month) + C(day_of_week) + C(promo)', data=data).fit() 

fit.summary()

0,1,2,3
Dep. Variable:,attend,R-squared:,0.525
Model:,OLS,Adj. R-squared:,0.432
Method:,Least Squares,F-statistic:,5.687
Date:,"Sat, 20 Jun 2020",Prob (F-statistic):,7.14e-07
Time:,11:40:33,Log-Likelihood:,-815.24
No. Observations:,81,AIC:,1658.0
Df Residuals:,67,BIC:,1692.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.015e+04,3113.881,9.684,0.000,2.39e+04,3.64e+04
C(month)[T.AUG],2837.7305,2443.342,1.161,0.250,-2039.201,7714.662
C(month)[T.JUL],2802.9283,2642.576,1.061,0.293,-2471.675,8077.532
C(month)[T.JUN],7109.5093,2797.521,2.541,0.013,1525.634,1.27e+04
C(month)[T.MAY],-1571.9027,2340.311,-0.672,0.504,-6243.183,3099.378
C(month)[T.OCT],174.9760,4149.979,0.042,0.966,-8108.418,8458.370
C(month)[T.SEP],-193.9482,2573.013,-0.075,0.940,-5329.703,4941.806
C(day_of_week)[T.Monday],2741.3176,3206.211,0.855,0.396,-3658.306,9140.941
C(day_of_week)[T.Saturday],1.029e+04,3014.575,3.413,0.001,4272.865,1.63e+04

0,1,2,3
Omnibus:,3.445,Durbin-Watson:,2.258
Prob(Omnibus):,0.179,Jarque-Bera (JB):,3.424
Skew:,0.133,Prob(JB):,0.181
Kurtosis:,3.971,Cond. No.,11.7


We can see that when we take into account a promotion, the attendance bumps right up. Monday and Wednesday games now have a positive influence on attendance based on their coefficients. Interestingly however, these promos did not have a huge impact on the overall months. 

From my analysis in R and Python, I think I can safely say that running more promotions on Mondays and Wednesdays throughout the entire season will lead to a spike in attendance.