### Sample notebook of logistic regression for Cake data 
ケーキデータのロジスティック回帰分析の手順例  

WedSatSun: 1 if the day is Wed or Sat or Sun.  
HighTemp: High temperature of the day.  
Sales: 1 if the specific cake has been sold.

#### Import libraries  

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn import preprocessing

#### Parameters  

In [2]:
csv_in = 'cake.csv'

#### Read CSV file

In [3]:
df = pd.read_csv(csv_in, delimiter=',', skiprows=0, header=0)
print(df.shape)
print(df.info())
display(df.head())

(21, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   WedSatSun  21 non-null     int64
 1   HighTemp   21 non-null     int64
 2   Sales      21 non-null     int64
dtypes: int64(3)
memory usage: 632.0 bytes
None


Unnamed: 0,WedSatSun,HighTemp,Sales
0,0,28,1
1,0,24,0
2,1,26,0
3,0,24,0
4,0,23,0


#### Separate explanatory variables and objective variable  
説明変数と目的変数を分ける  

In [4]:
y= df['Sales']
X = df[['WedSatSun','HighTemp']]
print('X:', X.shape)
print('y:', y.shape)

X: (21, 2)
y: (21,)


#### Logistic regression (standardized)

In [5]:
# NOTE: after scaling, X_scaled and Y_scaled are ndarray, not DataFrame.
X_scaled = preprocessing.scale(X)
dfX_scaled = pd.DataFrame(X_scaled, columns=X.columns)
dfX_scaled_c = sm.add_constant(dfX_scaled)
model = sm.Logit(y, dfX_scaled_c)
results = model.fit(disp=0)
print(results.summary2())

                         Results: Logit
Model:              Logit            Pseudo R-squared: 0.362    
Dependent Variable: Sales            AIC:              23.8021  
Date:               2020-04-28 22:05 BIC:              26.9357  
No. Observations:   21               Log-Likelihood:   -8.9010  
Df Model:           2                LL-Null:          -13.955  
Df Residuals:       18               LLR p-value:      0.0063834
Converged:          1.0000           Scale:            1.0000   
No. Iterations:     6.0000                                      
-----------------------------------------------------------------
               Coef.   Std.Err.     z     P>|z|    [0.025  0.975]
-----------------------------------------------------------------
const         -0.7774    0.6219  -1.2500  0.2113  -1.9963  0.4415
WedSatSun      1.2088    0.6139   1.9691  0.0489   0.0056  2.4120
HighTemp       1.2639    0.6891   1.8342  0.0666  -0.0867  2.6144



#### Regression coefficients (standardized)  

In [6]:
print(results.params)

const       -0.777380
WedSatSun    1.208797
HighTemp     1.263881
dtype: float64


**Contribution to objective variable (目的変数への寄与):**
**HighTemp(positive) > WedSatSun(positive)**

#### Logistic regression  

In [7]:
X_c = sm.add_constant(X)
model = sm.Logit(y, X_c)
results = model.fit(disp=0)  # disp=1: additional message is printed  
print(results.summary2())

                         Results: Logit
Model:              Logit            Pseudo R-squared: 0.362    
Dependent Variable: Sales            AIC:              23.8021  
Date:               2020-04-28 22:05 BIC:              26.9357  
No. Observations:   21               Log-Likelihood:   -8.9010  
Df Model:           2                LL-Null:          -13.955  
Df Residuals:       18               LLR p-value:      0.0063834
Converged:          1.0000           Scale:            1.0000   
No. Iterations:     7.0000                                      
-----------------------------------------------------------------
            Coef.    Std.Err.     z     P>|z|    [0.025    0.975]
-----------------------------------------------------------------
const      -15.2035    7.6567  -1.9856  0.0471  -30.2104  -0.1966
WedSatSun    2.4426    1.2405   1.9691  0.0489    0.0114   4.8739
HighTemp     0.5445    0.2969   1.8342  0.0666   -0.0373   1.1264



#### Regression coefficients  

In [8]:
print(results.params)

const       -15.203506
WedSatSun     2.442647
HighTemp      0.544505
dtype: float64


#### Obtained model:
$\ln(\frac{\rm{Sales}}{1-\rm{Sales}}) = 2.443 * \rm{WedSatSun} + 0.545 * \rm{HighTemp} - 15.204$

#### Ratio of correct answers  

In [9]:
n_data = y.shape[0]
n_correct_pred = ((results.predict(X_c) >= 0.5) == y).sum()
print(n_correct_pred / n_data)

0.8095238095238095


#### Odds ratio
How many times the objective variable will increase when corresponding variable is increased by 1  
and other variables are not changed  
得られたexp(回帰係数)の値(オッズ比)から、「各説明変数が1増えたときに目的変数が何倍になるか」がわかる。  

In [10]:
print(np.exp(results.params))

const        2.495752e-07
WedSatSun    1.150345e+01
HighTemp     1.723756e+00
dtype: float64


#### Do prediction using the obtained model    
NOTICE: add 1 at the head of each data for constant.  
(You can use sm.add_constant() to add 1)
得られたモデルを用いて、予測を行う。  
注: 予測のための各Xデータの先頭には定数項用の1を付加すること。  
(sm.add_constant()を用いて付加してもよい)  

In [11]:
dfX_test = pd.DataFrame([[1, 1, 23],
                         [1, 0, 28],
                        ],
                        columns=results.params.index)  # example
print('X for prediction:')
display(dfX_test)
y_test = results.predict(dfX_test)
print('Predicted y:')
print(y_test >= 0.5)

X for prediction:


Unnamed: 0,const,WedSatSun,HighTemp
0,1,1,23
1,1,0,28


Predicted y:
0    False
1     True
dtype: bool
