# CITS5508 Lab 3: Random Forest Classifier on Parkinsons

Name: Joey Koh<br>
Student number: 21506379  
Date created: 13 April 2020  
Last modified: 20 April 2020  

This notebook goes through the steps of a Random Forest Classification project. It is addressing a classification task on predicting Parkinsons status using vocal.<br>

Two different Random Forest Classifiers are trained and tested. With the need for data normalisation investigated.


## 1. Setup, Data Cleaning
Import libraries to be used and bring the data in. Clean the data for use.<br>

In [1]:
import pandas as pd
import numpy as np
import sklearn as sk
#Use jupyter's backend to render plots
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from pandas.plotting import scatter_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier 
from sklearn import metrics
import warnings
#warnings.filterwarnings("ignore")

#Prepare the data, separate cols by whitespace, give col names
raw_data = pd.read_csv("parkinsons.data")
#View first few lines and inspect columns
raw_data.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


### 1.1 Data Cleaning
There are quite a few columns, so let's look at all of them just to confirm we are not missing anything.

In [2]:
#view all columns
with pd.option_context( 'display.max_columns', None):  # more options can be specified also
    print(raw_data)

               name  MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
0    phon_R01_S01_1      119.992       157.302        74.997         0.00784   
1    phon_R01_S01_2      122.400       148.650       113.819         0.00968   
2    phon_R01_S01_3      116.682       131.111       111.555         0.01050   
3    phon_R01_S01_4      116.676       137.871       111.366         0.00997   
4    phon_R01_S01_5      116.014       141.781       110.655         0.01284   
..              ...          ...           ...           ...             ...   
190  phon_R01_S50_2      174.188       230.978        94.261         0.00459   
191  phon_R01_S50_3      209.516       253.017        89.488         0.00564   
192  phon_R01_S50_4      174.688       240.005        74.287         0.01360   
193  phon_R01_S50_5      198.764       396.961        74.904         0.00740   
194  phon_R01_S50_6      214.289       260.277        77.973         0.00567   

     MDVP:Jitter(Abs)  MDVP:RAP  MDVP:P

With the data description as well, we confirm that only 'name' column is non-numerical. It is the 'ASCII subject name and recording number' of 31 people with around 6 recordings per person. As we are trying to use speech signals and not the individual's identity to classify the health status, we can **drop the 'name' column**.

In [3]:
#Drop name column
data = raw_data.drop('name', axis= 1)
print("Are there any undefined values?")
print(data.isnull().values.any()) #ensure no undefined values left to fix


Are there any undefined values?
False


Good, no problems here with NaN values.

## 2. Pipeline: Data Splitting, and Normalisation
Split dataset, extract labels, investigate need for data normalisation, and create pipeline.

In [4]:
#Let's look at the distribution to see what kind of feature scaling we would use if so
data.describe()

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
count,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,...,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0
mean,154.228641,197.104918,116.324631,0.00622,4.4e-05,0.003306,0.003446,0.00992,0.029709,0.282251,...,0.046993,0.024847,21.885974,0.753846,0.498536,0.718099,-5.684397,0.22651,2.381826,0.206552
std,41.390065,91.491548,43.521413,0.004848,3.5e-05,0.002968,0.002759,0.008903,0.018857,0.194877,...,0.030459,0.040418,4.425764,0.431878,0.103942,0.055336,1.090208,0.083406,0.382799,0.090119
min,88.333,102.145,65.476,0.00168,7e-06,0.00068,0.00092,0.00204,0.00954,0.085,...,0.01364,0.00065,8.441,0.0,0.25657,0.574282,-7.964984,0.006274,1.423287,0.044539
25%,117.572,134.8625,84.291,0.00346,2e-05,0.00166,0.00186,0.004985,0.016505,0.1485,...,0.024735,0.005925,19.198,1.0,0.421306,0.674758,-6.450096,0.174351,2.099125,0.137451
50%,148.79,175.829,104.315,0.00494,3e-05,0.0025,0.00269,0.00749,0.02297,0.221,...,0.03836,0.01166,22.085,1.0,0.495954,0.722254,-5.720868,0.218885,2.361532,0.194052
75%,182.769,224.2055,140.0185,0.007365,6e-05,0.003835,0.003955,0.011505,0.037885,0.35,...,0.060795,0.02564,25.0755,1.0,0.587562,0.761881,-5.046192,0.279234,2.636456,0.25298
max,260.105,592.03,239.17,0.03316,0.00026,0.02144,0.01958,0.06433,0.11908,1.302,...,0.16942,0.31482,33.047,1.0,0.685151,0.825288,-2.434031,0.450493,3.671155,0.527367


Looking at the distribution of the values at min, 25%, 50%, 75% and max – there are likely outliers in attributes which a StandardScaler would deal better with compared to a MinMaxScaler. For this reason, standardization was used in the pipeline to get normalised data. In this, firstly the column's mean value is subtracted from every value in the column, then divided by the standard deviation. The resulting distribution has unit variance. Thus, the outliers has less of an impact compared to MinMax scaling.<br>

**Note:** Feature scaling was used to investigate the effects of normalisation in Random Forest Classifier. However, it is concluded to not actually be needed in later sections.

In [5]:
#Split dataset at 8:2 ratio
train_data, test_data = train_test_split(data, test_size=0.2, random_state= 42) #with set randomseed
train_data = train_data.reset_index(drop=True) #renumber index
test_data = test_data.reset_index(drop=True) #renumber index
#Ensure no NaN values
print("are there any NaN values created for test data and train data?")
print(test_data.isnull().values.any()," and ",train_data.isnull().values.any(), "for both" )


are there any NaN values created for test data and train data?
False  and  False for both


As checked, there are no undefined values after data splitting.

In [6]:
#Produce normalised dataset and un-normalised dataset for comparison.
def separate_pipeline(df): #separate data into labels and predictors
    df_y = df["status"]  #Extract class labels
    print("For this dataset, these are the number of instances for each class label")
    print(df_y.value_counts(),"\n")   #Count instances for class labels in training set
    df_x = df.drop("status", axis=1) #separate predictors from labels
    return df_y, df_x

def scaler_pipeline(df,df_train_x): #for feature scaling
    std_scaler = StandardScaler()
    std_scaler.fit(df_train_x)
    return std_scaler.transform(df)
    
def main_pipeline(train_dataset, test_dataset): #bringing it all together
    print("For the training dataset:\n")
    train_y, train_x = separate_pipeline(train_dataset)
    print("For the testing dataset:\n")
    test_y, test_x = separate_pipeline(test_dataset)
    train_x_tr = scaler_pipeline(train_x, train_x)
    test_x_tr = scaler_pipeline(test_x, train_x)
    return train_y, train_x, test_y, test_x, train_x_tr, test_x_tr

train_y, train_x, test_y, test_x, train_x_tr, test_x_tr = main_pipeline(train_data, test_data)


For the training dataset:

For this dataset, these are the number of instances for each class label
1    115
0     41
Name: status, dtype: int64 

For the testing dataset:

For this dataset, these are the number of instances for each class label
1    32
0     7
Name: status, dtype: int64 



Good, there are enough instances for each of the 2 classes to train with.<br>
**Note: 0 = healthy, 1 = unhealthy (has Parkinsons)**

## 3. Random Forest Classifier #1
All available cores with 500 trees and max 16 nodes.

In [7]:
randf1_clf = RandomForestClassifier(random_state=0,n_estimators= 500, max_leaf_nodes= 16, n_jobs= -1)

#fit with transformed training data
randf1_clf.fit(train_x_tr,train_y);
#generate test data label predictions
test_y_randf1_tr_pred = randf1_clf.predict(test_x_tr)

#fit with non-normalised training data
randf1_clf.fit(train_x, train_y);
#generate predictions
test_y_randf1_pred = randf1_clf.predict(test_x)


### 3.1 Scores

In [8]:
def report_scores(actual_y, pred_y): #F1 score function
    print("F1 scores of prediction for each class:")
    print(metrics.f1_score(actual_y, pred_y, average= "micro"))

In [9]:
#Get scores
print("For transformed data trained model:")
report_scores(test_y, test_y_randf1_tr_pred)

print("\n For not normalised data trained model:")
report_scores(test_y, test_y_randf1_pred)


For transformed data trained model:
F1 scores of prediction for each class:
0.9487179487179487

 For not normalised data trained model:
F1 scores of prediction for each class:
0.9487179487179487


**Results**: Both F1 scores of 94.87 are **identical**. Let's look at another Random Forest classifier.

## 4. Random Forest Classifer #2

In [10]:
#Now with reduced number of trees and max_features.
randf2_clf = RandomForestClassifier(random_state=0, max_features=0.1, n_estimators=6)

#fit with transformed training data
randf2_clf.fit(train_x_tr,train_y);
#generate test data label predictions
test_y_randf2_tr_pred = randf2_clf.predict(test_x_tr)

#fit with non-normalised training data
randf2_clf.fit(train_x, train_y);
#generate predictions
test_y_randf2_pred = randf2_clf.predict(test_x)


N_estimators: controls the number of trees, the default being 100. Setting it to 6 it limits the number of voters in the ensemble. This impacts the classifier accuracy.

Max_features controls the no. of features considered when looking for the best split. Thus, a 0.1 setting means that only 0.1 of the number of features will be considered. This means less diversity and reach of the tree.



### 4.1 Scores

In [11]:
#Get scores
print("For transformed data trained model:")
report_scores(test_y, test_y_randf2_tr_pred)

print("\n For not normalised data trained model:")
report_scores(test_y, test_y_randf2_pred)


For transformed data trained model:
F1 scores of prediction for each class:
0.8461538461538461

 For not normalised data trained model:
F1 scores of prediction for each class:
0.8461538461538461


**Results**: Both F1 scores of 84.62% are **identical**.

## 5. Conclusion

<p>The F1 score of Random Forest Classifier #1 for both the normalised and untouched training data is identical at 94.87%</p>
<p>Similarly, the F1 score of Random Forest Classifier #2 for both the normalised and untouched training data is identical at 84.62%</p>

<p>Therefore, the need for data normalisation/feature scaling is not necessary for Random Forest classifiers as they have identical performance.</p>
<p>This is due to the nature of the algorithm being based of decision trees. One feature is never compared to another in magnitude. Hence, the ranges do not matter. As it is only the range of one feature that is split at each stage.</p>

<p>Overall, the performance of the Random Forest Classifier is comparable with or without feature scaling.</p>