## Diabetes Dataset 활용한 분류

**로지스틱 회귀** => 분류(Classification)
- Dataset : diabetes.csv
- 당뇨 발생 여부
- 조건) 당뇨병 발병 확률이 XX%입니다.

### [0] 모듈 로딩

In [106]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

### [1] 데이터 로딩

In [55]:
# 데이터 로딩
FILE='./diabetes_data/diabetes.csv'
df=pd.read_csv(FILE)

In [57]:
# 데이터 확인
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [60]:
df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [61]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [108]:
# 이상치 및 결측치 체크
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

### [2] 데이터 준비

In [79]:
# (1) null값 없음
# (2) shape확인 (768행 x 9열)
# (3) 0-7(피처), 8(타겟)

In [80]:
# (4) 최소값이 0인 부분들이 있음(0-5열)
# - Pregnancies를 제외한 5개 열은 0이면 이상함

print('Glucose값이 0인 행 개수 : {}개'.format(len(df[df.Glucose==0])))    
print('BloodPressure값이 0인 행 개수 : {}개'.format(len(df[df.BloodPressure==0])))
print('SkinThickness값이 0인 행 개수 : {}개'.format(len(df[df.SkinThickness==0])))
print('Insulin값이 0인 행 개수 : {}개'.format(len(df[df.Insulin==0])))
print('BMI값이 0인 행 개수 : {}개'.format(len(df[df.BMI==0])))

# - Insulin의 경우 거의 절반 가까이가 '0'이 되고, Skin의 경우도 200개가 넘으므로 값을 없애지는 못하겠음
# - 0인 값들 중앙값으로 대체

rep = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[rep] = df[rep].replace(0, np.median(df[rep]))

Glucose값이 0인 행 개수 : 5개
BloodPressure값이 0인 행 개수 : 35개
SkinThickness값이 0인 행 개수 : 227개
Insulin값이 0인 행 개수 : 374개
BMI값이 0인 행 개수 : 11개


In [81]:
# 적용된 중앙값 확인
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,121.1875,71.15625,33.83724,101.713542,32.637109,0.471876,33.240885,0.348958
std,3.369578,31.05534,13.38027,11.385657,101.418563,7.035022,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.0,62.0,25.0,45.0,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.0,35.0,45.0,32.4,0.3725,29.0,0.0
75%,6.0,140.25,80.0,45.0,127.25,36.825,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [97]:
# 피처와 타겟 잡아주기
X = df[df.columns[0:8]]
y = list(df[df.columns[-1]])

train_input, test_input, train_target, test_target = train_test_split(X,
                                                                      y,
                                                                      test_size = 0.2,
                                                                      random_state=42)

### [3] 모델 생성

In [100]:
model = LogisticRegression()

### [4] 모델 훈련

In [101]:
model.fit(train_input, train_target)

LogisticRegression()

In [102]:
# 모델 파라미터
print('분류 종류: ', model.classes_)
print('가중치 값: ', model.coef_)
print('절     편: ', model.intercept_)
print('피처 갯수: ', model.n_features_in_)

분류 종류:  [0 1]
가중치 값:  [[ 0.03752674  0.03170197 -0.0186442   0.0082625  -0.00146812  0.08024566
   0.81718359  0.03039383]]
절     편:  [-7.57954856]
피처 갯수:  8


### [5] 모델 평가

In [105]:
print('훈련용 : ', model.score(train_input, train_target))
print('검증용 : ', model.score(test_input, test_target))

훈련용 :  0.7638436482084691
검증용 :  0.7792207792207793


### [6] 모델 예측

In [104]:
preY = model.predict([[3, 101, 40, 45, 78, 26.9, 0.442, 34]])
probaY = model.predict_proba([[3, 101, 40, 45, 78, 26.9, 0.442, 34]])
print('발병여부 : ', preY)
print('발병확률 : ', probaY)
print('결과통보 : 당뇨병 발병 확률이 {}% 입니다.'.format(np.round(probaY[0][1] * 100, 3)))

발병여부 :  [0]
발병확률 :  [[0.76858615 0.23141385]]
결과통보 : 당뇨병 발병 확률이 23.141% 입니다.
