# Use of Regression in the Diagnosis of Parkinson's Disease
In this notebook we will implement multiple regression to diagnose parkinson disease, the dataset is obtained from Oxford Parkinson's Disease Detection Dataset (https://archive.ics.uci.edu/dataset/174/parkinsons)

## Tools
In this project these libraries are used: 
- NumPy, a popular library for scientific computing
- Matplotlib, a popular library for plotting data

In [1]:
from utils import *
import copy, math
import numpy as np
import matplotlib.pyplot as plt

## Problem Statement
bashfvashfb

Begin by looking at the structure of the csv that contains the data:

In [2]:
PARKINSONS_CSV = './data/PARKINSONS.csv'

with open(PARKINSONS_CSV, 'r') as csvfile:
    print(f"Header looks like this:\n\n{csvfile.readline()}")    
    print(f"First data point looks like this:\n\n{csvfile.readline()}")
    print(f"Second data point looks like this:\n\n{csvfile.readline()}")

Header looks like this:

name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE

First data point looks like this:

phon_R01_S01_1,119.99200,157.30200,74.99700,0.00784,0.00007,0.00370,0.00554,0.01109,0.04374,0.42600,0.02182,0.03130,0.02971,0.06545,0.02211,21.03300,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654

Second data point looks like this:

phon_R01_S01_2,122.40000,148.65000,113.81900,0.00968,0.00008,0.00465,0.00696,0.01394,0.06134,0.62600,0.03134,0.04518,0.04368,0.09403,0.01929,19.08500,1,0.458359,0.819521,-4.075192,0.335590,2.486855,0.368674



## Dataset
### Information
- This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD. 

- The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

- Reference:

  Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering

### Load data
- The `load_data()` function shown below loads the data into variables `x_train` and `y_train`
  - `x_train` is a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD)
  - `y_train` is whether or not the person has Parkinson's disease. in this dataset it is the `status` column 
  - Both `X_train` and `y_train` are numpy arrays.

In [3]:
X_train, y_train = parse_data()

#### View the variables and Dimension
The code below prints the variable `x_train` and the type of the variable.

In [4]:
# print x_train
print(f"X Shape: {X_train.shape}, X Type:{type(X_train)})")
print("Type of x_train:",type(X_train))
print("First three elements of x_train are:\n", X_train[:3]) 

X Shape: (195, 22), X Type:<class 'numpy.ndarray'>)
Type of x_train: <class 'numpy.ndarray'>
First three elements of x_train are:
 [[ 1.199920e+02  1.573020e+02  7.499700e+01  7.840000e-03  7.000000e-05
   3.700000e-03  5.540000e-03  1.109000e-02  4.374000e-02  4.260000e-01
   2.182000e-02  3.130000e-02  2.971000e-02  6.545000e-02  2.211000e-02
   2.103300e+01  4.147830e-01  8.152850e-01 -4.813031e+00  2.664820e-01
   2.301442e+00  2.846540e-01]
 [ 1.224000e+02  1.486500e+02  1.138190e+02  9.680000e-03  8.000000e-05
   4.650000e-03  6.960000e-03  1.394000e-02  6.134000e-02  6.260000e-01
   3.134000e-02  4.518000e-02  4.368000e-02  9.403000e-02  1.929000e-02
   1.908500e+01  4.583590e-01  8.195210e-01 -4.075192e+00  3.355900e-01
   2.486855e+00  3.686740e-01]
 [ 1.166820e+02  1.311110e+02  1.115550e+02  1.050000e-02  9.000000e-05
   5.440000e-03  7.810000e-03  1.633000e-02  5.233000e-02  4.820000e-01
   2.757000e-02  3.858000e-02  3.590000e-02  8.270000e-02  1.309000e-02
   2.065100e+01

In [5]:
# print y_train
print(f"y Shape: {y_train.shape}, y Type:{type(y_train)})")
print("Type of y_train:",type(y_train))
print("First five elements of y_train are:\n", y_train[:5])  

y Shape: (195,), y Type:<class 'numpy.ndarray'>)
Type of y_train: <class 'numpy.ndarray'>
First five elements of y_train are:
 [1 1 1 1 1]


# Visualizing Data
Here is the plot of multiple variable regression