# SC1015 Mini-Project

Group: FCEE

Lee Heng Sheng Brandon, U2322900C \
Alan Lee Leman, \
Zi Hao

## Introduction

Parkinson's disease (PD) is a progressive disorder that affects the nervous system and parts of the body controlled by the nerves. The prevalence of PD has doubled in the past 25 years. Global estimates in 2019 showed over 8.5 million individuals with PD. Although PD is incurable, medications can significantly alleviate the symptoms. Detecting PD in its early stage allows for early intervention and treatment that can help maintain quality of life and slow disease progression. 

It is common for PD to have a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range. The MDVP (MultiDimensional Voice Program) parameters measure these through frequency, cycle differences and amplitude. The feature descriptions of the exact parameters and measurements can be found below. 

## Problem Statement

How can we detect Parkinson's disease in patients through their voice?

For our own convenience:

1. Problem statement should be interesting.
2. Clean data if necessary.
3. EDA accounts for 20% of the data.
4. We must include a new ML technique in our project. Examples include clustering models or anomaly detection.
5. It is important that we consult our TA periodically to ensure we meet their expectations, as they are the ones grading. 
6. We must submit our presentation video, PPT/PDF slides used for the presentation and all code on github with their references.

**Abbreviations	Feature description** \
> MDVP:F0 (Hz)	Average vocal fundamental frequency \
MDVP:Fhi (Hz)	Maximum vocal fundamental frequency \
MDVP:Flo (Hz)	Minimum vocal fundamental frequency \
MDVP:Jitter(%)	MDVP jitter in percentage \
MDVP:Jitter(Abs)	MDVP absolute jitter in ms \
MDVP:RAP	MDVP relative amplitude perturbation \
MDVP:PPQ	MDVP five-point period perturbation quotient \
Jitter:DDP	Average absolute difference of differences between jitter cycles \
MDVP:Shimmer	MDVP local shimmer \
MDVP:Shimmer(dB)	MDVP local shimmer in dB \
Shimmer:APQ3	Three-point amplitude perturbation quotient \
Shimmer:APQ5	Five-point amplitude perturbation quotient \
MDVP:APQ11	MDVP 11-point amplitude perturbation quotient \
Shimmer:DDA	Average absolute differences between the amplitudes of consecutive periods \
NHR	Noise-to-harmonics ratio \
HNR	Harmonics-to-noise ratio \
RPDE	Recurrence period density entropy measure \
D2	Correlation dimension \
DFA	Signal fractal scaling exponent of detrended fluctuation analysis \
Spread1	Two nonlinear measures of fundamental \
Spread2	Frequency variation \
PPE	Pitch period entropy 

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [10]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot

# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import plot_tree

sb.set() # set the default Seaborn style for graphics

---

## Import the Dataset

Dataset on [Parkinson Disease Detection](https://www.kaggle.com/datasets/debasisdotcom/parkinson-disease-detection/data). 

'Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007)

In [8]:
data = pd.read_csv("parkinsons.data")

data.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

## Data Cleaning

## Exploratory Data Analysis

First, we find all possible predictor variables by checking the correlation of all variables. This is possible because all variables in the dataset are numerical variables.