<h1> 94% Accuracy in Predicting Parkinson's Disease through Ensemble Method 
    
(Stacking Classifier) </h1>

In [1]:
import pandas as pd
import numpy as np

from collections import Counter

from sklearn.model_selection import train_test_split,GridSearchCV

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.svm import SVC

from sklearn.ensemble import StackingClassifier

seed=42

In [2]:
df = pd.read_csv('../input/parkinsons-disease-data-set/parkinsons.data')
df.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


# Exploring the Data

In [3]:
df.shape

(195, 24)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

In [5]:
print(Counter(df['status']))

Counter({1: 147, 0: 48})


In [6]:
df.isnull().sum()

name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
status              0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
dtype: int64

# Data preprocessing

In [7]:
X = df.drop(['name','status'],axis=1)
y = df['status']

In [8]:
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=seed)

# Models

Here I have used:
* Logistic Regression
* Support Vector Classifier
* Extreme Gradient Boosting

And, used GridSearchCV to hyperparameter tune all of them.

Then used StackingClassifier to Stack all of them into an Ensemble.


# Logistic Regression

In [9]:
parameters = {'penalty': ['l1', 'l2'], 
              'C': [0.1, 0.4, 0.8, 1, 2, 5,10,20,30]}    

grid_search=GridSearchCV(estimator=LogisticRegression() ,param_grid=parameters,cv=10,n_jobs=-1,verbose=2)
grid_search.fit(x_train,y_train)

log_reg = grid_search.best_estimator_

grid_search.best_params_

Fitting 10 folds for each of 18 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    3.7s finished
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


{'C': 0.8, 'penalty': 'l2'}

In [10]:
y_pred=log_reg.predict(x_test)

print("\n",confusion_matrix(y_test,y_pred))
log_reg_acc = accuracy_score(y_test,y_pred)

print("\nAccuracy Score {}".format(log_reg_acc))
print("Classification report: \n{}".format(classification_report(y_test,y_pred)))


 [[ 3  4]
 [ 0 32]]

Accuracy Score 0.8974358974358975
Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.43      0.60         7
           1       0.89      1.00      0.94        32

    accuracy                           0.90        39
   macro avg       0.94      0.71      0.77        39
weighted avg       0.91      0.90      0.88        39



# Support Vector Classifier

In [11]:
from sklearn.svm import LinearSVC

svc = LinearSVC()

parameters = {
      'penalty':['l1', 'l2'],
      'max_iter': [10,20,50,100,1000], 
      'C': [0.1, 0.4, 0.8, 1, 2, 5,10,20,30],          
              }

grid_search=GridSearchCV(estimator=svc ,param_grid=parameters,cv=10,n_jobs=-1,verbose=2)
grid_search.fit(x_train,y_train)

svc = grid_search.best_estimator_

grid_search.best_params_

Fitting 10 folds for each of 90 candidates, totalling 900 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 900 out of 900 | elapsed:    2.6s finished


{'C': 30, 'max_iter': 1000, 'penalty': 'l2'}

In [12]:
y_pred=svc.predict(x_test)

print("\n",confusion_matrix(y_test,y_pred))
svc_acc = accuracy_score(y_test,y_pred)
print("\nAccuracy Score {}".format(svc_acc))
print("Classification report: \n{}".format(classification_report(y_test,y_pred)))


 [[ 0  7]
 [ 0 32]]

Accuracy Score 0.8205128205128205
Classification report: 
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         7
           1       0.82      1.00      0.90        32

    accuracy                           0.82        39
   macro avg       0.41      0.50      0.45        39
weighted avg       0.67      0.82      0.74        39



  _warn_prf(average, modifier, msg_start, len(result))


# Extreme Gradient Boosting

In [13]:
xgb = XGBClassifier()

parameters = {'min_child_weight' : np.arange(0,20),
              'max_depth': [2, 4, 5, 7, 9, 10]}

grid_search=GridSearchCV(estimator=xgb ,param_grid=parameters,cv=10,n_jobs=-1,verbose=2)
grid_search.fit(x_train,y_train)

xgb = grid_search.best_estimator_

grid_search.best_params_

Fitting 10 folds for each of 120 candidates, totalling 1200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done 516 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 1193 out of 1200 | elapsed:   13.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 1200 out of 1200 | elapsed:   13.0s finished


{'max_depth': 2, 'min_child_weight': 2}

In [14]:
y_pred=xgb.predict(x_test)

print("\n",confusion_matrix(y_test,y_pred))
xgb_acc = accuracy_score(y_test,y_pred)
print("\nAccuracy Score {}".format(xgb_acc))
print("Classification report: \n{}".format(classification_report(y_test,y_pred)))


 [[ 5  2]
 [ 0 32]]

Accuracy Score 0.9487179487179487
Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.71      0.83         7
           1       0.94      1.00      0.97        32

    accuracy                           0.95        39
   macro avg       0.97      0.86      0.90        39
weighted avg       0.95      0.95      0.95        39



# Stacking Ensemble Classifier

In [15]:
estimators = [ ('xgb', xgb ), 
              ('svc',svc ),
              ('log_Reg', log_reg)]


stack =StackingClassifier(estimators=estimators ,final_estimator= svc)

stack.fit(x_train,y_train)
stack_predicted = stack.predict(x_test)

stack_conf_matrix = confusion_matrix(y_test, stack_predicted)
stack_acc_score = accuracy_score(y_test, stack_predicted)

print("confussion matrix")
print(stack_conf_matrix)
print("\n")
print("Accuracy of Stacking Classifier:",stack_acc_score*100,'\n')
print(classification_report(y_test,stack_predicted))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

confussion matrix
[[ 5  2]
 [ 1 31]]


Accuracy of Stacking Classifier: 92.3076923076923 

              precision    recall  f1-score   support

           0       0.83      0.71      0.77         7
           1       0.94      0.97      0.95        32

    accuracy                           0.92        39
   macro avg       0.89      0.84      0.86        39
weighted avg       0.92      0.92      0.92        39



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
