### AI-04 Practice 2  
#### Sample notebook of logistic regression
ロジスティック回帰分析の手順例  

#### Import libraries  

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn import preprocessing

#### Parameters  

In [2]:
csv_in = 'infection2.csv'

#### Read CSV file

In [3]:
df = pd.read_csv(csv_in, delimiter=',', skiprows=0, header=0)
print(df.shape)
print(df.info())
display(df.head())

(100, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   factor1   100 non-null    float64
 1   factor2   100 non-null    float64
 2   factor3   100 non-null    float64
 3   infected  100 non-null    int64  
dtypes: float64(3), int64(1)
memory usage: 3.2 KB
None


Unnamed: 0,factor1,factor2,factor3,infected
0,0.4,-2.9,2.5,1
1,1.2,-0.2,0.9,0
2,0.2,-0.1,7.1,1
3,1.3,-1.2,-2.6,0
4,1.4,-0.1,-2.7,0


**Ans1. 100**

#### Separate explanatory variables and objective variable  
説明変数と目的変数を分ける  

In [4]:
y= df['infected']
X = df.loc[:, 'factor1':'factor3']
print('X:', X.shape)
print('y:', y.shape)

X: (100, 3)
y: (100,)


#### Logistic regression (standardized)

In [5]:
# NOTE: after scaling, X_scaled and Y_scaled are ndarray, not DataFrame.
X_scaled = preprocessing.scale(X)
dfX_scaled = pd.DataFrame(X_scaled, columns=X.columns)
dfX_scaled_c = sm.add_constant(dfX_scaled)
model = sm.Logit(y, dfX_scaled_c)
results = model.fit(disp=0)
print(results.summary2())

                         Results: Logit
Model:              Logit            Pseudo R-squared: 0.291     
Dependent Variable: infected         AIC:              106.0484  
Date:               2020-04-28 22:14 BIC:              116.4690  
No. Observations:   100              Log-Likelihood:   -49.024   
Df Model:           3                LL-Null:          -69.135   
Df Residuals:       96               LLR p-value:      9.5661e-09
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     6.0000                                       
-------------------------------------------------------------------
            Coef.    Std.Err.      z      P>|z|     [0.025   0.975]
-------------------------------------------------------------------
const       0.1976     0.2498    0.7910   0.4289   -0.2920   0.6873
factor1    -0.2954     0.2550   -1.1583   0.2468   -0.7953   0.2045
factor2     0.0362     0.2523    0.1434   0.8860   -0.4583   0.5306
factor3     1.6796     0

#### Regression coefficients (standardized)  

In [6]:
print(results.params)

const      0.197619
factor1   -0.295404
factor2    0.036174
factor3    1.679615
dtype: float64


**Contribution to objective variable (目的変数への寄与):**
**factor3(positive) > factor1(negative) > factor2(positive)**

**Ans2. factor3**

#### Logistic regression  

In [7]:
X_c = sm.add_constant(X)
model = sm.Logit(y, X_c)
results = model.fit(disp=0)  # disp=1: additional message is printed  
print(results.summary2())

                         Results: Logit
Model:              Logit            Pseudo R-squared: 0.291     
Dependent Variable: infected         AIC:              106.0484  
Date:               2020-04-28 22:14 BIC:              116.4690  
No. Observations:   100              Log-Likelihood:   -49.024   
Df Model:           3                LL-Null:          -69.135   
Df Residuals:       96               LLR p-value:      9.5661e-09
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     6.0000                                       
-------------------------------------------------------------------
            Coef.    Std.Err.      z      P>|z|     [0.025   0.975]
-------------------------------------------------------------------
const      -0.3367     0.2737   -1.2301   0.2187   -0.8731   0.1998
factor1    -0.2914     0.2516   -1.1583   0.2468   -0.7846   0.2017
factor2     0.0182     0.1267    0.1434   0.8860   -0.2302   0.2666
factor3     0.5371     0

#### Regression coefficients  

In [8]:
print(results.params)

const     -0.336659
factor1   -0.291447
factor2    0.018174
factor3    0.537089
dtype: float64


**Ans3. Obtained model:
$\ln(\frac{\rm{infected}}{1-\rm{infected}}) = (-0.29) * \rm{factor1} + 0.02 * \rm{factor2} + 0.54 * \rm{factor3} - 0.34$**

#### Ratio of correct answers  

In [9]:
n_data = y.shape[0]
n_correct_pred = ((results.predict(X_c) >= 0.5) == y).sum()
print(n_correct_pred / n_data)

0.77


**Ans4. 0.77**  

#### Odds ratio
How many times the objective variable will increase when corresponding variable is increased by 1  
and other variables are not changed  
得られたexp(回帰係数)の値(オッズ比)から、「各説明変数が1増えたときに目的変数が何倍になるか」がわかる。  

In [10]:
print(np.exp(results.params))

const      0.714152
factor1    0.747182
factor2    1.018341
factor3    1.711019
dtype: float64


**Ans5. 1.02**  