In [43]:
import pandas as pd
import numpy as np
from numpy import math

In [44]:
data = pd.read_csv('dataset/raw_dataset.csv')

In [45]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5875 entries, 0 to 5874
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   subject#       5875 non-null   int64  
 1   age            5875 non-null   int64  
 2   sex            5875 non-null   int64  
 3   test_time      5875 non-null   float64
 4   motor_UPDRS    5875 non-null   float64
 5   total_UPDRS    5875 non-null   float64
 6   Jitter(%)      5875 non-null   float64
 7   Jitter(Abs)    5875 non-null   float64
 8   Jitter:RAP     5875 non-null   float64
 9   Jitter:PPQ5    5875 non-null   float64
 10  Jitter:DDP     5875 non-null   float64
 11  Shimmer        5875 non-null   float64
 12  Shimmer(dB)    5875 non-null   float64
 13  Shimmer:APQ3   5875 non-null   float64
 14  Shimmer:APQ5   5875 non-null   float64
 15  Shimmer:APQ11  5875 non-null   float64
 16  Shimmer:DDA    5875 non-null   float64
 17  NHR            5875 non-null   float64
 18  HNR     

This dataset consists of audio medical measurements of 42 people with early-stage Parkinson's disease. These people were hired for a six-month trial of a remote monitoring device to remotely diagnose the disease. Data were automatically recorded in patients' homes.

The columns include an index number of each patient, his age, his gender, the duration from the initial date of admission, two motor indicators UPDRS and total UPDRS, and 16 other medical sound measurements. Each line relates to one of 5,875 different sound measurements from those of the volunteers. Our main goal is to predict the UPDRS (ie 'motor_UPDRS' and 'total_UPDRS') from the 16 sound measurements.

Some more information given to us about the data set are the following:
    

subject - Integer that uniquely identifies each subject

age - Subject age

sex - Subject gender '0' - male, '1' - female

test_time - Time since recruitment into the trial. The integer part is the number of days since recruitment.

motor_UPDRS - Clinician's motor UPDRS score, linearly interpolated

total_UPDRS - Clinician's total UPDRS score, linearly interpolated

Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP - Several measures of variation in fundamental frequency

Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA - Several measures of variation in amplitude

NHR,HNR - Two measures of ratio of noise to tonal components in the voice

RPDE - A nonlinear dynamical complexity measure

DFA - Signal fractal scaling exponent

PPE - A nonlinear measure of fundamental frequency variation

In [46]:
data.head()

Unnamed: 0,subject#,age,sex,test_time,motor_UPDRS,total_UPDRS,Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,...,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA,NHR,HNR,RPDE,DFA,PPE
0,1,72,0,5.6431,28.199,34.398,0.00662,3.4e-05,0.00401,0.00317,...,0.23,0.01438,0.01309,0.01662,0.04314,0.01429,21.64,0.41888,0.54842,0.16006
1,1,72,0,12.666,28.447,34.894,0.003,1.7e-05,0.00132,0.0015,...,0.179,0.00994,0.01072,0.01689,0.02982,0.011112,27.183,0.43493,0.56477,0.1081
2,1,72,0,19.681,28.695,35.389,0.00481,2.5e-05,0.00205,0.00208,...,0.181,0.00734,0.00844,0.01458,0.02202,0.02022,23.047,0.46222,0.54405,0.21014
3,1,72,0,25.647,28.905,35.81,0.00528,2.7e-05,0.00191,0.00264,...,0.327,0.01106,0.01265,0.01963,0.03317,0.027837,24.445,0.4873,0.57794,0.33277
4,1,72,0,33.642,29.187,36.375,0.00335,2e-05,0.00093,0.0013,...,0.176,0.00679,0.00929,0.01819,0.02036,0.011625,26.126,0.47188,0.56122,0.19361


In [47]:
data.drop(['subject#'],axis=1,inplace=True)

We see that in our data set there is a feature called subject #, which basically shows us which patient each one is. Its layout, however, does not make sense (it is unordered), so it can then confuse the training of our classifiers. So we need to find a way to handle it so that it gives us information that makes sense.

We can either convert this feature with one-hot encoding or discard it completely. The first method will add too many features, so we will try the second.





Before doing anything, we must make sure that there are no missing values ​​in various features. This is an important step, because if there are any we have to decide how to handle them. We can say we delete them either if we follow other techniques such as those of Expectation Maximization (EM) or pseudo-EM.

In our case fortunately there are no missing values ​​at all, so we will not worry about this problem.


In [48]:
data.isnull().sum()

age              0
sex              0
test_time        0
motor_UPDRS      0
total_UPDRS      0
Jitter(%)        0
Jitter(Abs)      0
Jitter:RAP       0
Jitter:PPQ5      0
Jitter:DDP       0
Shimmer          0
Shimmer(dB)      0
Shimmer:APQ3     0
Shimmer:APQ5     0
Shimmer:APQ11    0
Shimmer:DDA      0
NHR              0
HNR              0
RPDE             0
DFA              0
PPE              0
dtype: int64

According to our work we have to divide dataset into 4 categories and work upon 4 individually


In [49]:
motor_16_data = data.copy()
total_16_data = data.copy()
motor_18_data = data.copy()
total_18_data = data.copy()

We have 4 datasets from raw datasets:
    1 -- All features to predict motorUPDRS
    2 -- Drop features (Age And Sex ) to predict motorUPDRS
    3 -- All features to predict totalUPDRS
    4 -- Drop features (Age And Sex ) to predict totalUPDRS

In [50]:
def drop_column(x):
    if(x==0):
        dataset = motor_16_data
        motor_16_data.drop(['age','sex'],axis=1,inplace=True)
    elif(x==1):
        dataset = total_16_data
        total_16_data.drop(['age','sex'],axis=1,inplace=True)

In [51]:
for i in range(2):
    drop_column(i)

In [52]:
motor_16_data.to_csv("motor_16_data.csv",index=False)
total_16_data.to_csv("total_16_data.csv",index=False)
motor_18_data.to_csv("motor_18_data.csv",index=False)
total_18_data.to_csv("total_18_data.csv",index=False)

In [53]:
m16 = pd.read_csv('motor_16_data.csv')
t16 = pd.read_csv('total_16_data.csv')
m18 = pd.read_csv('motor_18_data.csv')
t18 = pd.read_csv('total_18_data.csv')

Now we Analyse all the datsets and use the Algorithm according to that.

In [54]:
m16.describe()

Unnamed: 0,test_time,motor_UPDRS,total_UPDRS,Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP,Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA,NHR,HNR,RPDE,DFA,PPE
count,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0
mean,92.863722,21.296229,29.018942,0.006154,4.4e-05,0.002987,0.003277,0.008962,0.034035,0.31096,0.017156,0.020144,0.027481,0.051467,0.03212,21.679495,0.541473,0.65324,0.219589
std,53.445602,8.129282,10.700283,0.005624,3.6e-05,0.003124,0.003732,0.009371,0.025835,0.230254,0.013237,0.016664,0.019986,0.039711,0.059692,4.291096,0.100986,0.070902,0.091498
min,-4.2625,5.0377,7.0,0.00083,2e-06,0.00033,0.00043,0.00098,0.00306,0.026,0.00161,0.00194,0.00249,0.00484,0.000286,1.659,0.15102,0.51404,0.021983
25%,46.8475,15.0,21.371,0.00358,2.2e-05,0.00158,0.00182,0.00473,0.01912,0.175,0.00928,0.01079,0.015665,0.02783,0.010955,19.406,0.469785,0.59618,0.15634
50%,91.523,20.871,27.576,0.0049,3.5e-05,0.00225,0.00249,0.00675,0.02751,0.253,0.0137,0.01594,0.02271,0.04111,0.018448,21.92,0.54225,0.6436,0.2055
75%,138.445,27.5965,36.399,0.0068,5.3e-05,0.00329,0.00346,0.00987,0.03975,0.365,0.020575,0.023755,0.032715,0.061735,0.031463,24.444,0.614045,0.711335,0.26449
max,215.49,39.511,54.992,0.09999,0.000446,0.05754,0.06956,0.17263,0.26863,2.107,0.16267,0.16702,0.27546,0.48802,0.74826,37.875,0.96608,0.8656,0.73173


In [55]:
m18.describe()

Unnamed: 0,age,sex,test_time,motor_UPDRS,total_UPDRS,Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP,...,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA,NHR,HNR,RPDE,DFA,PPE
count,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,...,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0
mean,64.804936,0.317787,92.863722,21.296229,29.018942,0.006154,4.4e-05,0.002987,0.003277,0.008962,...,0.31096,0.017156,0.020144,0.027481,0.051467,0.03212,21.679495,0.541473,0.65324,0.219589
std,8.821524,0.465656,53.445602,8.129282,10.700283,0.005624,3.6e-05,0.003124,0.003732,0.009371,...,0.230254,0.013237,0.016664,0.019986,0.039711,0.059692,4.291096,0.100986,0.070902,0.091498
min,36.0,0.0,-4.2625,5.0377,7.0,0.00083,2e-06,0.00033,0.00043,0.00098,...,0.026,0.00161,0.00194,0.00249,0.00484,0.000286,1.659,0.15102,0.51404,0.021983
25%,58.0,0.0,46.8475,15.0,21.371,0.00358,2.2e-05,0.00158,0.00182,0.00473,...,0.175,0.00928,0.01079,0.015665,0.02783,0.010955,19.406,0.469785,0.59618,0.15634
50%,65.0,0.0,91.523,20.871,27.576,0.0049,3.5e-05,0.00225,0.00249,0.00675,...,0.253,0.0137,0.01594,0.02271,0.04111,0.018448,21.92,0.54225,0.6436,0.2055
75%,72.0,1.0,138.445,27.5965,36.399,0.0068,5.3e-05,0.00329,0.00346,0.00987,...,0.365,0.020575,0.023755,0.032715,0.061735,0.031463,24.444,0.614045,0.711335,0.26449
max,85.0,1.0,215.49,39.511,54.992,0.09999,0.000446,0.05754,0.06956,0.17263,...,2.107,0.16267,0.16702,0.27546,0.48802,0.74826,37.875,0.96608,0.8656,0.73173


In [56]:
t16.describe()

Unnamed: 0,test_time,motor_UPDRS,total_UPDRS,Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP,Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA,NHR,HNR,RPDE,DFA,PPE
count,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0
mean,92.863722,21.296229,29.018942,0.006154,4.4e-05,0.002987,0.003277,0.008962,0.034035,0.31096,0.017156,0.020144,0.027481,0.051467,0.03212,21.679495,0.541473,0.65324,0.219589
std,53.445602,8.129282,10.700283,0.005624,3.6e-05,0.003124,0.003732,0.009371,0.025835,0.230254,0.013237,0.016664,0.019986,0.039711,0.059692,4.291096,0.100986,0.070902,0.091498
min,-4.2625,5.0377,7.0,0.00083,2e-06,0.00033,0.00043,0.00098,0.00306,0.026,0.00161,0.00194,0.00249,0.00484,0.000286,1.659,0.15102,0.51404,0.021983
25%,46.8475,15.0,21.371,0.00358,2.2e-05,0.00158,0.00182,0.00473,0.01912,0.175,0.00928,0.01079,0.015665,0.02783,0.010955,19.406,0.469785,0.59618,0.15634
50%,91.523,20.871,27.576,0.0049,3.5e-05,0.00225,0.00249,0.00675,0.02751,0.253,0.0137,0.01594,0.02271,0.04111,0.018448,21.92,0.54225,0.6436,0.2055
75%,138.445,27.5965,36.399,0.0068,5.3e-05,0.00329,0.00346,0.00987,0.03975,0.365,0.020575,0.023755,0.032715,0.061735,0.031463,24.444,0.614045,0.711335,0.26449
max,215.49,39.511,54.992,0.09999,0.000446,0.05754,0.06956,0.17263,0.26863,2.107,0.16267,0.16702,0.27546,0.48802,0.74826,37.875,0.96608,0.8656,0.73173


In [57]:
t18.describe()

Unnamed: 0,age,sex,test_time,motor_UPDRS,total_UPDRS,Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP,...,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA,NHR,HNR,RPDE,DFA,PPE
count,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,...,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0
mean,64.804936,0.317787,92.863722,21.296229,29.018942,0.006154,4.4e-05,0.002987,0.003277,0.008962,...,0.31096,0.017156,0.020144,0.027481,0.051467,0.03212,21.679495,0.541473,0.65324,0.219589
std,8.821524,0.465656,53.445602,8.129282,10.700283,0.005624,3.6e-05,0.003124,0.003732,0.009371,...,0.230254,0.013237,0.016664,0.019986,0.039711,0.059692,4.291096,0.100986,0.070902,0.091498
min,36.0,0.0,-4.2625,5.0377,7.0,0.00083,2e-06,0.00033,0.00043,0.00098,...,0.026,0.00161,0.00194,0.00249,0.00484,0.000286,1.659,0.15102,0.51404,0.021983
25%,58.0,0.0,46.8475,15.0,21.371,0.00358,2.2e-05,0.00158,0.00182,0.00473,...,0.175,0.00928,0.01079,0.015665,0.02783,0.010955,19.406,0.469785,0.59618,0.15634
50%,65.0,0.0,91.523,20.871,27.576,0.0049,3.5e-05,0.00225,0.00249,0.00675,...,0.253,0.0137,0.01594,0.02271,0.04111,0.018448,21.92,0.54225,0.6436,0.2055
75%,72.0,1.0,138.445,27.5965,36.399,0.0068,5.3e-05,0.00329,0.00346,0.00987,...,0.365,0.020575,0.023755,0.032715,0.061735,0.031463,24.444,0.614045,0.711335,0.26449
max,85.0,1.0,215.49,39.511,54.992,0.09999,0.000446,0.05754,0.06956,0.17263,...,2.107,0.16267,0.16702,0.27546,0.48802,0.74826,37.875,0.96608,0.8656,0.73173
