### Sample notebook of Poisson regression for Homerun data 
本塁打データのポアソン回帰分析の手順例  

Official: #Homerun of official 4-match competition  大会4試合での本塁打数  
Practice: Average #Homerun of 4 practice games  練習試合での平均本塁打数(4試合分)  
Stadium: Stadium for competition (A and B)  大会が開催される球場(A, B)  

#### Import libraries

In [11]:
import pandas as pd
import statsmodels.api as sm

#### Parameters  

In [12]:
csv_in = 'homerun.csv'

#### Read CSV file  

In [13]:
df = pd.read_csv(csv_in, delimiter=',', skiprows=0, header=0)
print(df.shape)
print(df.info())
display(df.head())

(100, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
Official    100 non-null int64
Practice    100 non-null float64
Stadium     100 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 2.4+ KB
None


Unnamed: 0,Official,Practice,Stadium
0,6,8.31,A
1,6,9.44,A
2,6,9.5,A
3,12,9.07,A
4,10,10.16,A


#### Make dummy variable  
ダミー変数化  

In [14]:
df_dumm = pd.get_dummies(df, drop_first=True)
display(df_dumm.head())

Unnamed: 0,Official,Practice,Stadium_B
0,6,8.31,0
1,6,9.44,0
2,6,9.5,0
3,12,9.07,0
4,10,10.16,0


#### Separate explanatory variables and objective variable  
説明変数と目的変数を分ける    

In [15]:
X = df_dumm[['Practice', "Stadium_B"]]
y = df_dumm['Official']
X_c = sm.add_constant(X)

#### Poisson regression  

In [16]:
model = sm.GLM(y, X_c, family=sm.families.Poisson())
results = model.fit()
print(results.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:               Official   No. Observations:                  100
Model:                            GLM   Df Residuals:                       97
Model Family:                 Poisson   Df Model:                            2
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -235.29
Date:                Wed, 24 Apr 2019   Deviance:                       84.808
Time:                        14:39:20   Pearson chi2:                     83.8
No. Iterations:                     4   Covariance Type:             nonrobust
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.2631      0.370      3.417      0.001       0.539       1.988
Practice       0.0801      0.037      2.162      0.0

#### Regression coefficients  

In [17]:
print(results.params)

const        1.263105
Practice     0.080073
Stadium_B   -0.031999
dtype: float64


#### Obtained best model:
$\ln(\rm{Official}) = 0.080 * \rm{Practice} + (-0.032) * \rm{Stadium\_B} +1.263$

#### Do prediction using the obtained model    
NOTICE: add 1 at the head of each data for constant.  
(You can use sm.add_constant() to add 1)
得られたモデルを用いて、予測を行う。  
注: 予測のための各Xデータの先頭には定数項用の1を付加すること。  
(sm.add_constant()を用いて付加してもよい)  

In [18]:
dfX_test = pd.DataFrame([[1, 10.5, 0],
                         [1, 11.5, 1],
                        ],
                        columns=results.params.index)  # example
print('X for prediction:')
display(dfX_test)
y_test = results.predict(dfX_test)
print('Predicted y:')
print(y_test)

X for prediction:


Unnamed: 0,const,Practice,Stadium_B
0,1,10.5,0
1,1,11.5,1


Predicted y:
0    8.197812
1    8.601534
dtype: float64
