# Detect the presence of Parkinson's disease

The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. The original study published the feature extraction methods for general voice disorders.

Source - [UCI Repository](https://archive.ics.uci.edu/ml/datasets/Parkinsons)

## Attribute Information:

Matrix column entries (attributes):
1. name - ASCII subject name and recording number
2. MDVP:Fo(Hz) - Average vocal fundamental frequency
3. MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
4. MDVP:Flo(Hz) - Minimum vocal fundamental frequency
5. MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
6. MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
7. NHR,HNR - Two measures of ratio of noise to tonal components in the voice
8. status - Health status of the subject (one) - Parkinson's, (zero) - healthy
9. RPDE,D2 - Two nonlinear dynamical complexity measures
10. DFA - Signal fractal scaling exponent
11. spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation



### 1 Data Sourcing

In [11]:
import pandas as pd
import numpy as np
import os, sys
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [12]:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data')
data.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


Obtain the features (all variables except `status`) and labels (`status`) from the dataset

In [13]:
features = data.loc[:, data.columns != 'status'].values[:,1:]
labels = data.loc[:, 'status'].values

Counting the number of 0s and 1s will help us evaluate the imbalance in the labels, if any.

In [14]:
print(labels[labels == 0].shape[0], \
      labels[labels == 1].shape[0])

48 147


There are 48 ones and 147 zeros in the status column of the dataset. Further, scaling the features between -1 and 1 to normalize them is carried out.

In [15]:
scaler = MinMaxScaler((-1,1))
x = scaler.fit_transform(features)
y = labels

### 2 Train Test Split

We split the dataset into testing and training sets in 20:80 ratio.

In [16]:
x_train, x_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 101)

### 3 Model Building

We initialize an XGBClassifier object and train the model. As ensemble learning, the extreme gradient boosting produces superior output.

In [17]:
XGB_model = XGBClassifier()
XGB_model.fit(x_train, Y_train)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

### 4 Model Evaluation

Perform out-of-bag model evaluation by comparing prediction of actual vs predicted (misclassification error). Display the confusion matrix.

In [18]:
Y_pred = XGB_model.predict(x_test)
print(accuracy_score(Y_test, Y_pred)*100)

92.3076923076923


In [19]:
pd.crosstab(Y_pred, Y_test, rownames = ["Predicted"], colnames = ["Actual"], margins = True)

Actual,0,1,All
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,9,1,10
1,2,27,29
All,11,28,39


Overall, the XGBoost model scores a 92.3% accuracy on the dataset. The above table also shows the higher true positives w.r.t true negatives.