This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals. The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. 

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.

Our end objective here is to differentiate between people who have Parkinson’s and who don’t, and then fine-tunes the parameters in an attempt to maximize the accuracy of the testing set and reducing the false positive rate.

With 19 different attempts and techniques, the highest accuracy score achieved is 94.9%. All the achieved scores have been tabulated in the end. 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import os
from sklearn.svm import SVC
import numpy as np
from sklearn.preprocessing import Normalizer, MaxAbsScaler, MinMaxScaler, StandardScaler, RobustScaler, KernelCenterer
from sklearn.decomposition import PCA
from sklearn import manifold

In [2]:
#
# Loading up the dataset
X = pd.read_csv("Datasets\\parkinsons.data", header=0)
print(X.head(2))

             name  MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
0  phon_R01_S01_1      119.992       157.302        74.997         0.00784   
1  phon_R01_S01_2      122.400       148.650       113.819         0.00968   

   MDVP:Jitter(Abs)  MDVP:RAP  MDVP:PPQ  Jitter:DDP  MDVP:Shimmer    ...     \
0           0.00007   0.00370   0.00554     0.01109       0.04374    ...      
1           0.00008   0.00465   0.00696     0.01394       0.06134    ...      

   Shimmer:DDA      NHR     HNR  status      RPDE       DFA   spread1  \
0      0.06545  0.02211  21.033       1  0.414783  0.815285 -4.813031   
1      0.09403  0.01929  19.085       1  0.458359  0.819521 -4.075192   

    spread2        D2       PPE  
0  0.266482  2.301442  0.284654  
1  0.335590  2.486855  0.368674  

[2 rows x 24 columns]


In [3]:
print(X.describe())

       MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
count   195.000000    195.000000    195.000000      195.000000   
mean    154.228641    197.104918    116.324631        0.006220   
std      41.390065     91.491548     43.521413        0.004848   
min      88.333000    102.145000     65.476000        0.001680   
25%     117.572000    134.862500     84.291000        0.003460   
50%     148.790000    175.829000    104.315000        0.004940   
75%     182.769000    224.205500    140.018500        0.007365   
max     260.105000    592.030000    239.170000        0.033160   

       MDVP:Jitter(Abs)    MDVP:RAP    MDVP:PPQ  Jitter:DDP  MDVP:Shimmer  \
count        195.000000  195.000000  195.000000  195.000000    195.000000   
mean           0.000044    0.003306    0.003446    0.009920      0.029709   
std            0.000035    0.002968    0.002759    0.008903      0.018857   
min            0.000007    0.000680    0.000920    0.002040      0.009540   
25%            0.000

In [4]:
print(X.dtypes)

name                 object
MDVP:Fo(Hz)         float64
MDVP:Fhi(Hz)        float64
MDVP:Flo(Hz)        float64
MDVP:Jitter(%)      float64
MDVP:Jitter(Abs)    float64
MDVP:RAP            float64
MDVP:PPQ            float64
Jitter:DDP          float64
MDVP:Shimmer        float64
MDVP:Shimmer(dB)    float64
Shimmer:APQ3        float64
Shimmer:APQ5        float64
MDVP:APQ            float64
Shimmer:DDA         float64
NHR                 float64
HNR                 float64
status                int64
RPDE                float64
DFA                 float64
spread1             float64
spread2             float64
D2                  float64
PPE                 float64
dtype: object


In [5]:
# Checking for null values in the dataframe
print (X[pd.isnull(X).any(axis=1)])
print("Has null values in the data: ",X.isnull().values.any())

Empty DataFrame
Columns: [name, MDVP:Fo(Hz), MDVP:Fhi(Hz), MDVP:Flo(Hz), MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP, MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ, Shimmer:DDA, NHR, HNR, status, RPDE, DFA, spread1, spread2, D2, PPE]
Index: []

[0 rows x 24 columns]
Has null values in the data:  False


In [6]:
# Since our data does not need any cleaning, we will proceed with 
# dropping the name column and splicing out the status column 
# into a variable y and delete it from X.
y = X.status
X.drop(["name", "status"], inplace=True, axis = 1)
print(X.head(2))

   MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  MDVP:Jitter(Abs)  \
0      119.992       157.302        74.997         0.00784           0.00007   
1      122.400       148.650       113.819         0.00968           0.00008   

   MDVP:RAP  MDVP:PPQ  Jitter:DDP  MDVP:Shimmer  MDVP:Shimmer(dB)    ...     \
0   0.00370   0.00554     0.01109       0.04374             0.426    ...      
1   0.00465   0.00696     0.01394       0.06134             0.626    ...      

   MDVP:APQ  Shimmer:DDA      NHR     HNR      RPDE       DFA   spread1  \
0   0.02971      0.06545  0.02211  21.033  0.414783  0.815285 -4.813031   
1   0.04368      0.09403  0.01929  19.085  0.458359  0.819521 -4.075192   

    spread2        D2       PPE  
0  0.266482  2.301442  0.284654  
1  0.335590  2.486855  0.368674  

[2 rows x 22 columns]


In [7]:
#Perform a train/test split. 30% test group size, 
#with a random_state equal to 7.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3, random_state=7)

Our ultimate aim is to be create a SVC classifier with a high accuracy score. We will try doing the following 
    1. Train a simple SVC Classifier without any feature scaling
    2. Perform feature scaling and test the accuracy of our simple SVC Classifier
    3. Preprocess the data using PCA and test the accuracy
    4. Preprocess the data using Isomap and test the accuracy
    
To proceed with step 1 of this, we will have to run just the below block of cell and comment everything after that.

In [8]:
# Creating a simple SVC classifier without 
# specifying any parameters and leaving them as default.
# svc = SVC()
# svc.fit(X_train, y_train)
# score = svc.score(X_test, y_test)
# print(score) 

To proceed with step 2, we will have to
    1. Make sure a restart and run all has been performed
    2. Make sure that the one cell above this is all commented
    3. uncomment only one of the feature scaling techniques and let the other techniques remain commented
    
We will then have to run the above svc model on this feature scaled dataset. 

Re-iterate this process 6 times, ensuring that a new feature scaling technique is selected everytime. Keep a tab for the accuracy score printed. 

In [9]:
#
# The features consist of different units mixed in together, so we will perform
# feature scaling. 

norm = Normalizer().fit(X_train)
maxabs = MaxAbsScaler().fit(X_train)
minmax = MinMaxScaler().fit(X_train)
stand = StandardScaler().fit(X_train)
robust = RobustScaler().fit(X_train)
kernel = KernelCenterer().fit(X_train)

# X_train = norm.transform(X_train) #79.6 
# X_test = norm.transform(X_test)

# X_train = maxabs.transform(X_train) #79.6 
# X_test = maxabs.transform(X_test)

# X_train = minmax.transform(X_train) #79.6
# X_test = minmax.transform(X_test)

X_train = stand.transform(X_train) #91.5
X_test = stand.transform(X_test) 

# X_train = robust.transform(X_train) #89.83
# X_test = robust.transform(X_test) 

# X_train = kernel.transform(X_train) #81.35
# X_test = kernel.transform(X_test) 

To proceed with step 3, we will have to
    1. Make sure a restart and run all has been performed
    2. Make sure that the the code used for step 1 is commented
    3. Uncomment only one of the feature scaling techniques and let the other techniques remain commented
    
We will then pass the normalized values to our PCA model and then train our svc model with this. 

Re-iterate this process 6 times, ensuring that a new feature scaling technique is selected everytime. Keep a tab for the accuracy score printed. 

Comment the below block of code completely to proceed with step 4 and repeat the same steps that we followed for step 3.

#### We are creating a best-parameter search by creating nested for-loops. The outer for-loop iterates a variable C from 0.05 to 2, using 0.05 unit increments. The inner for-loop increments a variable gamma from 0.001 to 0.1, using 0.001 unit increments. Also since Python ranges won't allow for float intervals, we'll use NumPy ARanges.

In [10]:
C = 0.05
gamma = 0.001
best_score = 0
c_range = np.arange(0.05,2.05, 0.05)
gamma_range = np.arange(0.001, .101, 0.001)
pca_range = np.arange(4,15,1)
iso_range_neighbors = np.arange(2,6,1)
iso_range_components = np.arange(4,7,1)
n_components = 0
n_neighbors = 0

In [11]:
# for b in pca_range:
#     pca = PCA(n_components=b)
#     pca.fit(X_train)
#     X_Ltrain = pca.transform(X_train)
#     X_Ltest = pca.transform(X_test)

#     for i in c_range:
#         for j in gamma_range:
#             #
#             # Creating an SVC model and passing in the C and gamma parameters. 
#             # Training and then scoring the model appropriately. 
#             # If the current best_score is less than the model's score, we
#             # update the best_score and print it out along with 
#             # the n_components, C and gamma values that resulted in it.
#             #
#             model = SVC(C=i, gamma=j)
#             model.fit(X_Ltrain, y_train)
#             score = model.score(X_Ltest, y_test)
#             if(best_score<score):
#                 best_score = score
#                 C = i
#                 gamma = j
#                 n_components = b

# print("Best score:", best_score)
# print("Best C:", C)
# print("Best gamma:", gamma)
# print("Best n_components:", n_components)

In [12]:
for a in iso_range_neighbors:
    for b in iso_range_components:
        iso = manifold.Isomap(n_neighbors = a, n_components = b)
        iso.fit(X_train)
        X_Ltrain = iso.transform(X_train)
        X_Ltest = iso.transform(X_test)

        for i in c_range:
            for j in gamma_range:
                    #
                    # Creating an SVC model and passing in the C and gamma parameters. 
                    # Training and then scoring the model appropriately. 
                    # If the current best_score is less than the model's score, we
                    # update the best_score and print it out along with 
                    # the n_components, C and gamma values that resulted in it.
                    #
                model = SVC(C=i, gamma=j)
                model.fit(X_Ltrain, y_train)
                score = model.score(X_Ltest, y_test)
                if(best_score<score):
                    best_score = score
                    C = i
                    gamma = j
                    n_components = b
                    n_neighbors = a

print("Best score:", best_score)
print("Best C:", C)
print("Best gamma:", gamma)
print("Best n_components:", n_components)
print("Best n_neighbors:", n_neighbors)

Best score: 0.9491525423728814
Best C: 0.7500000000000001
Best gamma: 0.05
Best n_components: 5
Best n_neighbors: 2


| Feature Scaling | Default SVC score | SVC after PCA score | SVC after Isomap score |
| --------------- | ----------------- | ------------------- | ---------------------- |
| None            | 81.4%             | NA                  | NA                     |
| Normalizer      | 79.6%             | 79.6%               | 79.7%                  |
| MaxAbsScaler    | 79.6%             | 88.1%               | 88.1%                  |
| MinMaxScaler    | 79.6%             | 88.1%               | 88.1%                  |
| StandardScaler  | 91.5%             | 93.2%               | 94.9%                  |
| RobustScaler    | 89.8%             | 91.5%               | 94.9%                  |
| KernelCenterer  | 81.4%             | 91.5%               | 91.5%                  |