# Handout #3: Classification of Parkinson's Disease

Content Authors:


*   Chris Malone Ph D, Professor of Data Science and Statistics, Winona State University; Email: cmalone@winona.edu
*   Collin Engstrom PhD, Assistant Professor of Computer Science, Winona State University; Email: collin.engstrom@winona.edu

## Libraries and Custom Functions

The following Python libraries will be used throughout this handout.

In [1]:
# Load Numpy, MatPlot, Pandas, and Seaborn libraries for data processing and graphing
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame
from pandas.plotting import scatter_matrix
import seaborn as sns

# Load TensorFlow and Keres libraries to facilitate fitting of neural net
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# Sci-Kit Learn Libraries to faciliate model fitting, metrics,
#from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.utils import resample

print('Tensorflow Version:' + tf.__version__)

Tensorflow Version:2.17.1


The following are custom functions that will be used in this handout.

In [2]:
# Need a function that plots a confusion matrix for us
def plot_cm(labels, predictions, threshold=0.5):
  cm = confusion_matrix(labels, predictions > threshold)
  plt.figure(figsize=(5,5))
  sns.heatmap(cm, annot=True, fmt="d")
  plt.title('Confusion matrix @{:.2f}'.format(threshold))
  plt.ylabel('Actual label')
  plt.xlabel('Predicted label')

In [3]:
#ROC Curve
def plot_roc(name, labels, predictions, **kwargs):
  fp, tp, _ = roc_curve(labels, predictions)

  plt.plot(fp, tp, label=name, linewidth=2, **kwargs)
  plt.xlabel('False positives [%]')
  plt.ylabel('True positives [%]')
  plt.xlim([-0.1,1.1])
  plt.ylim([0,1.1])
  plt.grid(True)
  ax = plt.gca()
  ax.set_aspect('equal')

## Example: Parkinson's Disease

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

This dataset is composed of a range of biomedical voice measurements from
31 people, 23 with Parkinson's disease (PD). Each column in the table is a
particular voice measure, and each row corresponds one of 195 voice
recording from these individuals ("name" column). The main aim of the data
is to discriminate healthy people from those with PD, according to "status"
column which is set to 0 for healthy and 1 for PD.

<table>
  <tr>
    <td width='100%'>
      <ul>
        <li><strong>Status</strong>: Labels are 0 (Healthy) and 1 (Parkinson's)</li><br>
        <li><strong>Features</strong>:</li>
        <ul>
          <li>Name - ASCII subject name and recording number</li>
          <li>MDVP:Fhi(Hz) - Maximum vocal fundamental frequency</li>
          <li>MDVP:Flo(Hz) - Minimum vocal fundamental frequency</li>
          <li>MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency</li>
          <li>MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude</li>
          <li>NHR,HNR - Two measures of ratio of noise to tonal components in the voice</li>
          <li>RPDE,D2 - Two nonlinear dynamical complexity measures</li>
          <li>DFA - Signal fractal scaling exponent</li>
          <li>spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation</li>
         </ul>
    </ul>
    </td>
</tr>
</table>

[Data - Local Copy](https://raw.githubusercontent.com/christophermalone/mayo_ml_workshop/refs/heads/main/datasets/Parkinsons_Disease.csv)

[Data Descriptions - Local Copy](https://github.com/christophermalone/mayo_ml_workshop/blob/main/datasets/ParkinsonsDisease_FeatureDescriptions.txt)

[Research Article - Local Copy](https://github.com/christophermalone/mayo_ml_workshop/blob/main/datasets/ParkinsonsDisease_Article.pdf)

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1phCTqDvhyjXcElxDe3cRKre9w2vW4Yq7" width='75%' height='75%'></img></p>

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>


The following code will read in the Parkinson's Disease dataset.

In [None]:
url= 'https://raw.githubusercontent.com/christophermalone/mayo_ml_workshop/refs/heads/main/datasets/Parkinsons_Disease.csv'
ParkinsonsDisease_DF = pd.read_csv(url)

Taking a look at the Parkinson's Disease dataset.

In [None]:
ParkinsonsDisease_DF.head()
#ParkinsonsDisease_DF.shape
#ParkinsonsDisease_DF.dtypes


Unnamed: 0,Name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,RPDE,DFA,Spread1,Spread2,D2,PPE,Status
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654,1
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674,1
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634,1
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975,1
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335,1


## Task: Build a Neural Net Model to Predict Status


1.   Provide an initial investigation into the Class variable (Status) [For simplicity you can skip upsampling to balance the labels within the Class variable]
2.   Identify the relevant features to include in $\bf{X}$
3.   Perform any necessary transformations of the features (i.e. OneHot encoding, rescaling of numerical features, etc).
4.   Use split-sample to divide the data into a *training* set and a *test* set
5.   Fit a nueral net model using the *training* set
6.   Obtain predictions for Status using your predictive model for the subjects in the *test* set
7.   Evaluate the performance of your predictive model
8.   Consider possible variations of the tuning parameters to improve predictive performance of your neural net.

*    **Note**: Due to the relatively low sample size, there is considerable variation in the performance of the neural net from one set of parameter setting to the next.




---



---


The End...

