# Parkinsons Data Set

### This dataset is composed of a range of biomedical voice measurements from 31 people, <br>23 with Parkinson's disease (PD).

https://archive.ics.uci.edu/ml/datasets/parkinsons

![image.png](../Images/Parkinsons.png)

데이터셋을 이용하여 파킨슨병을 예측하는 모델을 로지스틱 회귀모형을 적용하여 생성하고, 이때 파킨슨병을 예측하는데 영향을 미치는 변수를 중요한 순서대로 3개 선정하시오.<br> 이 모델에서 파킨슨 병으로 진단하는 기준(threshold, 또는 cutoff)을 0.5로 했을 때와 0.8로 했을 때의 F1-Score를 비교하고 해석하시오.<br>(단, 다음 조건을 지켜서 물음에 답하시오)
    
- 분석조건
 + 필요없는 컬럼인 'name'제거
 + 로지스틱 회귀를 위해 상수항 추가
 + 트레이닝셋과 테스트셋의 비율은 9:1
 + 모델의 최적화 방법론으로 'bfgs' 사용
 + 데이터 정규화는 min-max 스케일러 사용
 + Status는 카테고리 타입으로 변환
 + 모델은 로지스틱 회귀분석 사용

### Library & Data Import

In [1]:
import numpy as np
import pandas as pd

In [2]:
parkinsons = pd.read_csv('../Datasets/Parkinsons.csv')

parkinsons.shape

(195, 24)

In [3]:
parkinsons.head(10)

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335
5,phon_R01_S01_6,120.552,131.162,113.787,0.00968,8e-05,0.00463,0.0075,0.01388,0.04701,...,0.06985,0.01222,21.378,1,0.415564,0.825069,-4.242867,0.299111,2.18756,0.357775
6,phon_R01_S02_1,120.267,137.244,114.82,0.00333,3e-05,0.00155,0.00202,0.00466,0.01608,...,0.02337,0.00607,24.886,1,0.59604,0.764112,-5.634322,0.257682,1.854785,0.211756
7,phon_R01_S02_2,107.332,113.84,104.315,0.0029,3e-05,0.00144,0.00182,0.00431,0.01567,...,0.02487,0.00344,26.892,1,0.63742,0.763262,-6.167603,0.183721,2.064693,0.163755
8,phon_R01_S02_3,95.73,132.068,91.754,0.00551,6e-05,0.00293,0.00332,0.0088,0.02093,...,0.03218,0.0107,21.812,1,0.615551,0.773587,-5.498678,0.327769,2.322511,0.231571
9,phon_R01_S02_4,95.056,120.103,91.226,0.00532,6e-05,0.00268,0.00332,0.00803,0.02838,...,0.04324,0.01022,21.862,1,0.547037,0.798463,-5.011879,0.325996,2.432792,0.271362


In [4]:
# name 변수 제거하기
dat_processing = parkinsons.drop(['name'], axis=1, inplace=False)

dat_processing.head(10)

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,0.426,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,0.626,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,0.482,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,0.517,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,0.584,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335
5,120.552,131.162,113.787,0.00968,8e-05,0.00463,0.0075,0.01388,0.04701,0.456,...,0.06985,0.01222,21.378,1,0.415564,0.825069,-4.242867,0.299111,2.18756,0.357775
6,120.267,137.244,114.82,0.00333,3e-05,0.00155,0.00202,0.00466,0.01608,0.14,...,0.02337,0.00607,24.886,1,0.59604,0.764112,-5.634322,0.257682,1.854785,0.211756
7,107.332,113.84,104.315,0.0029,3e-05,0.00144,0.00182,0.00431,0.01567,0.134,...,0.02487,0.00344,26.892,1,0.63742,0.763262,-6.167603,0.183721,2.064693,0.163755
8,95.73,132.068,91.754,0.00551,6e-05,0.00293,0.00332,0.0088,0.02093,0.191,...,0.03218,0.0107,21.812,1,0.615551,0.773587,-5.498678,0.327769,2.322511,0.231571
9,95.056,120.103,91.226,0.00532,6e-05,0.00268,0.00332,0.00803,0.02838,0.255,...,0.04324,0.01022,21.862,1,0.547037,0.798463,-5.011879,0.325996,2.432792,0.271362


In [5]:
# Normalization
import sklearn.preprocessing as preprocessing

dat_processing_norm = preprocessing.minmax_scale(dat_processing)
dat_processed = pd.DataFrame(dat_processing_norm)
dat_processed.columns = dat_processing.columns

dat_processed

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,0.184308,0.112592,0.054815,0.195680,0.249012,0.145472,0.247588,0.145288,0.312215,0.280197,...,0.332584,0.068307,0.511745,1.0,0.369155,0.960148,0.569875,0.585765,0.390661,0.497310
1,0.198327,0.094930,0.278323,0.254130,0.288538,0.191233,0.323687,0.191042,0.472887,0.444536,...,0.516048,0.059331,0.432577,1.0,0.470830,0.977024,0.703277,0.741337,0.473145,0.671326
2,0.165039,0.059128,0.265288,0.280178,0.328063,0.229287,0.369239,0.229411,0.390634,0.326212,...,0.443317,0.039596,0.496220,1.0,0.404416,1.000000,0.636745,0.686371,0.408819,0.596682
3,0.165004,0.072927,0.264200,0.263342,0.328063,0.209056,0.324759,0.208862,0.414278,0.354971,...,0.475478,0.040997,0.495936,1.0,0.416255,0.975885,0.695627,0.738089,0.436977,0.671949
4,0.161150,0.080909,0.260107,0.354511,0.407115,0.282755,0.437299,0.282870,0.499452,0.410025,...,0.584542,0.054174,0.455499,1.0,0.375159,0.992813,0.762472,0.513798,0.404336,0.757611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,0.499820,0.262986,0.165722,0.092440,0.090909,0.093931,0.089496,0.094076,0.286014,0.262942,...,0.362306,0.085909,0.450134,0.0,0.447684,0.333127,0.257894,0.260408,0.549049,0.183318
191,0.705488,0.307974,0.138243,0.125794,0.090909,0.126686,0.107181,0.126826,0.164050,0.146261,...,0.221338,0.055543,0.435097,0.0,0.408567,0.434101,0.319956,0.276956,0.605474,0.257558
192,0.502730,0.281413,0.050727,0.378653,0.288538,0.267823,0.252947,0.267940,0.123608,0.140509,...,0.156631,0.338988,0.383728,0.0,0.352318,0.324299,0.212945,0.342577,0.558967,0.180580
193,0.642893,0.601807,0.054279,0.181703,0.130435,0.145472,0.159700,0.145288,0.122512,0.128184,...,0.155989,0.227838,0.429936,0.0,0.454176,0.277579,0.220650,0.452885,0.318222,0.163137


In [6]:
# 상수항 추가 : sm.add_constant()
import statsmodels.api as sm
from sklearn import metrics

dat_processed = sm.add_constant(dat_processed, has_constant='add')
dat_processed.head(10)

Unnamed: 0,const,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,1.0,0.184308,0.112592,0.054815,0.19568,0.249012,0.145472,0.247588,0.145288,0.312215,...,0.332584,0.068307,0.511745,1.0,0.369155,0.960148,0.569875,0.585765,0.390661,0.49731
1,1.0,0.198327,0.09493,0.278323,0.25413,0.288538,0.191233,0.323687,0.191042,0.472887,...,0.516048,0.059331,0.432577,1.0,0.47083,0.977024,0.703277,0.741337,0.473145,0.671326
2,1.0,0.165039,0.059128,0.265288,0.280178,0.328063,0.229287,0.369239,0.229411,0.390634,...,0.443317,0.039596,0.49622,1.0,0.404416,1.0,0.636745,0.686371,0.408819,0.596682
3,1.0,0.165004,0.072927,0.2642,0.263342,0.328063,0.209056,0.324759,0.208862,0.414278,...,0.475478,0.040997,0.495936,1.0,0.416255,0.975885,0.695627,0.738089,0.436977,0.671949
4,1.0,0.16115,0.080909,0.260107,0.354511,0.407115,0.282755,0.437299,0.28287,0.499452,...,0.584542,0.054174,0.455499,1.0,0.375159,0.992813,0.762472,0.513798,0.404336,0.757611
5,1.0,0.187568,0.059232,0.278139,0.25413,0.288538,0.19027,0.352626,0.190079,0.342067,...,0.360829,0.036827,0.525766,1.0,0.370978,0.999128,0.672961,0.659218,0.339999,0.648753
6,1.0,0.185909,0.071647,0.284086,0.052414,0.090909,0.041908,0.05895,0.042061,0.059704,...,0.06246,0.017252,0.668333,1.0,0.792079,0.756277,0.421385,0.565955,0.191959,0.346328
7,1.0,0.110606,0.023873,0.223606,0.038755,0.090909,0.036609,0.048232,0.036442,0.055961,...,0.072089,0.008881,0.749858,1.0,0.88863,0.75289,0.324968,0.399458,0.28534,0.246912
8,1.0,0.043063,0.061082,0.151289,0.121665,0.209486,0.108382,0.128617,0.108525,0.10398,...,0.119014,0.031989,0.543404,1.0,0.837604,0.794025,0.44591,0.723731,0.400034,0.387368
9,1.0,0.039139,0.036658,0.148249,0.115629,0.209486,0.096339,0.128617,0.096163,0.171992,...,0.190012,0.030461,0.545436,1.0,0.677741,0.89313,0.533923,0.71974,0.449094,0.46978


In [7]:
feature_columns = list(dat_processed.columns.difference(['status']))

feature_columns

['D2',
 'DFA',
 'HNR',
 'Jitter:DDP',
 'MDVP:APQ',
 'MDVP:Fhi(Hz)',
 'MDVP:Flo(Hz)',
 'MDVP:Fo(Hz)',
 'MDVP:Jitter(%)',
 'MDVP:Jitter(Abs)',
 'MDVP:PPQ',
 'MDVP:RAP',
 'MDVP:Shimmer',
 'MDVP:Shimmer(dB)',
 'NHR',
 'PPE',
 'RPDE',
 'Shimmer:APQ3',
 'Shimmer:APQ5',
 'Shimmer:DDA',
 'const',
 'spread1',
 'spread2']

In [8]:
X = dat_processed[feature_columns]
y = dat_processed['status'].astype('category') # 질환여부 : 1 or 0    

In [9]:
X

Unnamed: 0,D2,DFA,HNR,Jitter:DDP,MDVP:APQ,MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Fo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),...,MDVP:Shimmer(dB),NHR,PPE,RPDE,Shimmer:APQ3,Shimmer:APQ5,Shimmer:DDA,const,spread1,spread2
0,0.390661,0.960148,0.511745,0.145288,0.172448,0.112592,0.054815,0.184308,0.195680,0.249012,...,0.280197,0.068307,0.497310,0.369155,0.332627,0.347354,0.332584,1.0,0.569875,0.585765
1,0.473145,0.977024,0.432577,0.191042,0.279424,0.094930,0.278323,0.198327,0.254130,0.288538,...,0.444536,0.059331,0.671326,0.470830,0.515986,0.535685,0.516048,1.0,0.703277,0.741337
2,0.408819,1.000000,0.496220,0.229411,0.219848,0.059128,0.265288,0.165039,0.280178,0.328063,...,0.326212,0.039596,0.596682,0.404416,0.443374,0.446133,0.443317,1.0,0.636745,0.686371
3,0.436977,0.975885,0.495936,0.208862,0.233785,0.072927,0.264200,0.165004,0.263342,0.328063,...,0.354971,0.040997,0.671949,0.416255,0.475539,0.466079,0.475478,1.0,0.695627,0.738089
4,0.404336,0.992813,0.455499,0.282870,0.286852,0.080909,0.260107,0.161150,0.354511,0.407115,...,0.410025,0.054174,0.757611,0.375159,0.584553,0.577341,0.584542,1.0,0.762472,0.513798
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,0.549049,0.333127,0.450134,0.094076,0.155142,0.262986,0.165722,0.499820,0.092440,0.090909,...,0.262942,0.085909,0.183318,0.447684,0.362288,0.261601,0.362306,1.0,0.257894,0.260408
191,0.605474,0.434101,0.435097,0.126826,0.088828,0.307974,0.138243,0.705488,0.125794,0.090909,...,0.146261,0.055543,0.257558,0.408567,0.221302,0.147490,0.221338,1.0,0.319956,0.276956
192,0.558967,0.324299,0.383728,0.267940,0.072594,0.281413,0.050727,0.502730,0.378653,0.288538,...,0.140509,0.338988,0.180580,0.352318,0.156587,0.107870,0.156631,1.0,0.212945,0.342577
193,0.318222,0.277579,0.429936,0.145288,0.066544,0.601807,0.054279,0.642893,0.181703,0.130435,...,0.128184,0.227838,0.163137,0.454176,0.156009,0.101900,0.155989,1.0,0.220650,0.452885


In [10]:
y.to_frame()

Unnamed: 0,status
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0
...,...
190,0.0
191,0.0
192,0.0
193,0.0


In [11]:
from sklearn.model_selection import train_test_split

# train_test_split 함수를 이용하여 학습 데이터와 검증 데이터로 9:1 나누어 데이터를 구분

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    stratify=y, 
                                                    train_size=0.9, 
                                                    test_size=0.1, 
                                                    random_state=1234)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(175, 23) (20, 23) (175,) (20,)


In [12]:
model = sm.Logit(y_train, X_train)

In [13]:
results = model.fit(method='bfgs', maxiter=1000)

results.summary()

Optimization terminated successfully.
         Current function value: 0.228587
         Iterations: 338
         Function evaluations: 339
         Gradient evaluations: 339


0,1,2,3
Dep. Variable:,status,No. Observations:,175.0
Model:,Logit,Df Residuals:,152.0
Method:,MLE,Df Model:,22.0
Date:,"Tue, 22 Nov 2022",Pseudo R-squ.:,0.59
Time:,15:00:20,Log-Likelihood:,-40.003
converged:,True,LL-Null:,-97.576
Covariance Type:,nonrobust,LLR p-value:,1.317e-14

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
D2,3.2745,3.270,1.001,0.317,-3.135,9.684
DFA,1.8174,2.392,0.760,0.447,-2.871,6.505
HNR,1.7236,5.949,0.290,0.772,-9.936,13.383
Jitter:DDP,37.3472,2700.240,0.014,0.989,-5255.025,5329.720
MDVP:APQ,33.0412,54.123,0.610,0.542,-73.038,139.121
MDVP:Fhi(Hz),-1.7137,1.837,-0.933,0.351,-5.315,1.888
MDVP:Flo(Hz),0.2092,2.209,0.095,0.925,-4.120,4.539
MDVP:Fo(Hz),-0.9928,3.952,-0.251,0.802,-8.738,6.752
MDVP:Jitter(%),-45.2445,37.651,-1.202,0.229,-119.039,28.550


In [14]:
# cut-off 정의
def cut_off(y, threshold):
    Y = y.copy()
    Y[Y > threshold] = 1
    Y[Y <= threshold] = 0
    
    return(Y.astype(int))

In [15]:
y_test_pred_prob = results.predict(X_test)

y_test_pred_prob

165    0.201295
108    0.770492
71     0.999465
16     0.879013
96     0.985982
69     0.847691
162    0.999667
133    0.608867
55     0.867729
61     0.126868
113    0.954568
18     1.000000
194    0.550681
94     0.920147
59     0.546519
185    0.961890
56     0.943411
48     0.122616
110    0.996505
109    0.896972
dtype: float64

In [16]:
y_test_pred = cut_off(y_test_pred_prob, 0.8)

y_test_pred

165    0
108    0
71     1
16     1
96     1
69     1
162    1
133    0
55     1
61     0
113    1
18     1
194    0
94     1
59     0
185    1
56     1
48     0
110    1
109    1
dtype: int64

In [17]:
from sklearn.metrics import f1_score

f1_score(y_test, y_test_pred)

0.8571428571428571

### Insight

cut-off를 0.5로 할 때는 0.9375이고, 0.8로 할 때는 0.8571이다.