# Detect Parkinsons - Using XGBoost Classifier
> Using XGBoost to classify Parkinsons disease.

- toc: false 
- badges: true
- comments: false
- categories: [jupyter, sklearn, xgboost]
- author: Venkataramani, Suja

## Overview

XGBoost is perfectly suited to large datasets with numerous features with a mixture of categorical and numerical features for non-deep learning problems. While our dataset is quite small, for the purposes of this example, we will use XGBoost. 

XGBoost does not need feature scaling/normalisation as xgboost is a ensemble of decision trees and distance between features are not used in the algorithm (unlike KNN, PCA).

We begin by splitting the dependent(Y) and independent (X) variables. There are 23 independent variables and "Status" is the label, the dependent variable. Then we split the dataset into training and test set. We fit XGBoost model with the training set and test with the test set. 

## Method

First, let's download the Parkingsons data set from [UCI Machine Learning](https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/).  

Install xgboost for Python with "pip install xgboost" at command prompt. Test with python "import xgboost" to make sure all is well.

In [26]:
# Import packages.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import xgboost as xgb

In [5]:
# Load data.
pk_data = pd.read_csv(".\\data\\parkinsons.csv")

In [6]:
# Check te number of rows and columns in the dataset.
pk_data.shape

(195, 24)

In [7]:
# Check the first 5 rows.
pk_data.head(5)

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


In [32]:
# loc[row_from: row_to, boolean list of columns to keep/leave]
# pk_data.columns != 'status' -> Gives an array of bool values, true for all but status. 
# .values - converts df into array.
x = pk_data.loc[:, pk_data.columns != 'status'].values[:, 1:]
# loc[row_from: row_to, column_from: column_to]
y = pk_data.loc[:,'status'].values

In [34]:
# Split the x and y into 80% train data and 20% test data.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 100)

### XGBoost 

XGB is a progression of Decision Trees. When bagging (taking random samples with replacement from the data) is applied to decision trees it results in Random Forest which is a ensemble of decision trees which results in better accuracy. When boosting (when every tree built aims to correct the errors in the previous tree, additive trees) is added to Random Forest, it result in Boosted Random Tree. When the errors are minimised using Gradient Descent, it results in Gradient Boosting. 

XGB added further optimisations to this Gradient Boosted Trees such as parallel processing, tree pruning (to avoid being penalised by regularisation term) along with an efficient missing value imputation and cross-validation. Along with the algorithmic advances the hardware is optimised resulting in significant performance improvements. 

[XGBoost](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)    
[TowardsDataScience](https://towardsdatascience.com/)  
[Medium](https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d)  
[MachineLearningMastery](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)  
[YouTube-GradientBoost](https://www.youtube.com/watch?v=3CC4N4z3GJc)  
[YouTube-XGBoost](https://www.youtube.com/watch?v=OtD8wVaFm6E)

In [38]:
xgbc = xgb.XGBClassifier()
xgbc.fit(x_train, y_train)

y_pred = xgbc.predict(x_test)
score = accuracy_score(y_test, y_pred)
print('XGB Accuracy: ', round(score * 100, 3))

XGB Accuracy:  94.872


## Conclusion

Even though the dataset is not massive, XBBoost performed well with **94%** accuracy.