## **Day 5**

### Parkinsons Disease Detection with XGBoost

#### XGBoost


XGBoost stands for eXtreme Gradient Boosting and is a machine learning algorithm mostly used for regression and classification. It is a prediction modeling that gives better results by adding new predictions for errors made by prior models, usually the decision tree models. It uses a form of ensemble method where models are added until no further improvements in the results can be made.

#### Data Source:

This data was collected from the UCI database: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/. It has 195 records and 24 columns.


In [160]:
#Import libraries and packages

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [184]:
# Import dataset and check the rows and columns

data = pd.read_csv("parkinsons.data")
data.head()


Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


In [162]:
# To understand the dimensions of data, null values and datatypes

data.info()

# Removing column 'name' as it is ID Object type and not needed

data.drop(columns = ['name'], inplace = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

In [163]:
# Group together variables and separate labels

variables = data.loc[:,data.columns != 'status']
labels = data.loc[:, 'status']

In [164]:
#Check how many labels are 1(Yes for Parkinsons) and 0(No for Parkinsons)

labels[labels == 1].shape[0], labels[labels == 0].shape[0]

(147, 48)

In [165]:
# Scale the values of all columns between -1 and 1

scaled = MinMaxScaler((-1,1)).fit_transform(variables)

In [180]:
# Divide the data into train and test set using train_test_split for 80 to 20 ratio

scaled_train, scaled_test, labels_train, labels_test = train_test_split(scaled, 
                    labels, test_size = 0.2, random_state = 42)

In [181]:
# Run the XGBClassifier model

model = XGBClassifier()
model.fit(scaled_train,labels_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
       importance_type='gain', interaction_constraints='',
       learning_rate=0.300000012, max_delta_step=0, max_depth=6,
       min_child_weight=1, missing=nan, monotone_constraints='()',
       n_estimators=100, n_jobs=0, num_parallel_tree=1,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
       validate_parameters=1, verbosity=None)

In [182]:
# Calculate the generated values for 

Predicted_labels = model.predict(scaled_test)
accuracy_score(labels_test, Predicted_labels)*100


94.87179487179486