

<img src="http://cdn2.hubspot.net/hubfs/650754/Stock_Images_(no_text)/parkinsons.jpg" style="width:700px">


In this notebook we will study a Parkinson's Disease data set.

In [1]:
print(open('UCIMLR/Parkinsons/parkinsons.txt', 'r').read())

Title: Parkinsons Disease Data Set

Abstract: Oxford Parkinson's Disease Detection Dataset

-----------------------------------------------------	

Data Set Characteristics: Multivariate
Number of Instances: 197
Area: Life
Attribute Characteristics: Real
Number of Attributes: 23
Date Donated: 2008-06-26
Associated Tasks: Classification
Missing Values? N/A

-----------------------------------------------------	

Source:

The dataset was created by Max Little of the University of Oxford, in 
collaboration with the National Centre for Voice and Speech, Denver, 
Colorado, who recorded the speech signals. The original study published the 
feature extraction methods for general voice disorders.

-----------------------------------------------------

Data Set Information:

This dataset is composed of a range of biomedical voice measurements from 
31 people, 23 with Parkinson's disease (PD). Each column in the table is a 
particular voice measure, and each row corresponds one of 195 voice 
rec

## Loading in the libraries


In [2]:
import numpy as np
import pandas as pa
import sklearn
import matplotlib.pylab as py
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap

%matplotlib inline

## Introduction

## Reading the data

The first thing that you do with data is to take a a look at this, so let's begin our analysis by reading in the data.   This was not as trivial as we have originally thought since we kept running into errors.  The problem was that:

<b> we needed to make sure the data was in the correct directory</b>

The eaisiest thing to do is have the data in the same directory at the Jupyter notebook.

In [3]:
data = pa.read_csv('UCIMLR/Parkinsons/parkinsons.csv')

In [4]:
data.head()

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE,status
0,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,0.426,...,0.06545,0.02211,21.033,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654,1
1,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,0.626,...,0.09403,0.01929,19.085,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674,1
2,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,0.482,...,0.0827,0.01309,20.651,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634,1
3,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,0.517,...,0.08771,0.01353,20.644,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975,1
4,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,0.584,...,0.1047,0.01767,19.649,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335,1


The size of our data is:

In [6]:
data.shape

(195, 23)

Which means we have 195 total measurements and each measurement has 23 values.  For example our first measurement is 

In [7]:
data.iloc[0,:]

MDVP:Fo(Hz)         119.992000
MDVP:Fhi(Hz)        157.302000
MDVP:Flo(Hz)         74.997000
MDVP:Jitter(%)        0.007840
MDVP:Jitter(Abs)      0.000070
MDVP:RAP              0.003700
MDVP:PPQ              0.005540
Jitter:DDP            0.011090
MDVP:Shimmer          0.043740
MDVP:Shimmer(dB)      0.426000
Shimmer:APQ3          0.021820
Shimmer:APQ5          0.031300
MDVP:APQ              0.029710
Shimmer:DDA           0.065450
NHR                   0.022110
HNR                  21.033000
RPDE                  0.414783
DFA                   0.815285
spread1              -4.813031
spread2               0.266482
D2                    2.301442
PPE                   0.284654
status                1.000000
Name: 0, dtype: float64

Observe that we want to predict if a person has Parkinson's or not based on 