## Ensemble Techniques Project

### Steps and tasks:
1. Load the dataset
2. It is always a good practice to eye-ball raw data to get a feel of the data in terms of number of records, structure of the file, number of attributes, types of attributes and a general idea of likely challenges in the dataset. Mention a few comments in this regard (5 points)
3. Using univariate & bivariate analysis to check the individual attributes for their basic statistics such as central values, spread, tails, relationships between variables etc. mention your observations (15 points)
4. Split the dataset into training and test set in the ratio of 70:30 (Training:Test) (5 points)
5. Prepare the data for training - Scale the data if necessary, get rid of missing values (if any) etc (5 points)
6. Train at least 3 standard classification algorithms - Logistic Regression, Naive Bayes’, SVM, k-NN etc, and note down their accuracies on the test data (10 points)
7. Train a meta-classifier and note the accuracy on test data (10 points)
8. Train at least one standard Ensemble model - Random forest, Bagging, Boosting etc, and note the accuracy (10 points)
9. Compare all the models (minimum 5) and pick the best one among them (10 points)


In [None]:
# Importing the necessary libraries
import numpy                            as np                        # importing numpy library
import pandas                           as pd                        # importing pandas library
import seaborn                          as sns                       # For Data Visualization 
import matplotlib.pyplot                as plt                       # Necessary module for plotting purpose
import warnings                                                      # importing warning library

# add graphs into jupiter notebook
%matplotlib inline                             
warnings.filterwarnings('ignore')                                    # for ignoring warnings in notebook

import statsmodels.api                  as sm                        # importing statsmodel api
from sklearn import model_selection                                  # For model_selection
from sklearn.model_selection            import train_test_split      # For train-test split

# getting methods for confusion matrix, F1 score, Accuracy Score
from sklearn import metrics                                          
from sklearn.metrics                    import confusion_matrix,f1_score,accuracy_score,classification_report,roc_curve,auc,average_precision_score
from sklearn.linear_model               import LogisticRegression    # For logistic Regression
from sklearn.naive_bayes                import GaussianNB            # For Naive Bayes classifier
from sklearn.neighbors                  import KNeighborsClassifier  # For K-NN Classifier
from sklearn.svm                        import SVC                   # For support vector machine based classifier

## Scaling
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### 1. Load the dataset

In [None]:
pdDataOrg = pd.read_csv("../input/parkinson-disease-detection/Parkinsson disease.csv")        # using pandas read_csv function to load dataset into pdData variable
pdDataOrg.head()                                    # fetching and showing top 5 rows of the pdData variable

In [None]:
'''
To use columns of pdDataOrg df more conveniently following are some changes I have done
    a. pushing target column i.e 'status' to last column
    b. converting all column names in lower case
    c. replacing spaces in column names with '_'
    d. replacing ':' in column names with '_'
    e. replacing '(' in column names with '_'
    f. replacing ')' in column names with '' i.e blank
    g. replacing '%' in column names with 'in_percent'
'''

pdData = pdDataOrg.copy()                                               # creating a copy of loanDataOrg into loanData

targetCol = 'status'                                                    # defining target column
targetColDf = pdData.pop(targetCol)                                     # popping target column from loanData df
pdData.insert(len(pdData.columns),targetCol, targetColDf)               # inserting target column to last column

# deleting variables that were used for changing column position of target column
del targetCol 
del targetColDf

# converting column names into lower case
pdData.columns = [c.lower() for c in pdData.columns]
# replacing spaces in column names with '_'
pdData.columns = [c.replace(' ', '_') for c in pdData.columns]
# replacing ':' in column names with '_'
pdData.columns = [c.replace(':', '_') for c in pdData.columns]
# replacing '(' in column names with '_'
pdData.columns = [c.replace('(', '_') for c in pdData.columns]
# replacing ')' in column names with '' i.e blank
pdData.columns = [c.replace(')', '') for c in pdData.columns]
# replacing '%' in column names with 'in_percent'
pdData.columns = [c.replace('%', 'in_percent') for c in pdData.columns]

# to check the above printing top 5 rows
pdData.head()

### 2. It is always a good practice to eye-ball raw data to get a feel of the data in terms of number of records, structure of the file, number of attributes, types of attributes and a general idea of likely challenges in the dataset. Mention a few comments in this regard (5 points)

**Attribute Information:**
1. **name** - ASCII subject name and recording number
2. **mdvp_fo_hz** - Average vocal fundamental frequency (Actualy column name MDVP:Fo(Hz) )
3. **mdvp_fhi_hz** - Maximum vocal fundamental frequency (Actualy column name MDVP:Fhi(Hz) )
4. **mdvp_flo_hz** - Minimum vocal fundamental frequency (Actualy column name MDVP:Flo(Hz) )
5. **mdvp_jitter_in_percent, mdvp_jitter_abs, mdvp_rap, mdvp_ppq, jitter_ddp** - Several measures of variation in fundamental frequency (Actualy column names MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP respectively)
6. **mdvp_shimmer, mdvp_shimmer_db, shimmer_apq3, shimmer_apq5, mdvp_apq, shimmer_dda** - Several measures of variation in amplitude (Actualy column names MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ, Shimmer:DDA respectively)
7. **nhr, hnr** - Two measures of ratio of noise to tonal components in the voice (Actualy column names NHR, HNR respectively)
8. **rpde, d2** - Two nonlinear dynamical complexity measures (Actualy column names RPDE, D2 respectively)
9. **dfa** - Signal fractal scaling exponent (Actualy column name DFA )
10. **spread1, spread2, ppe** - Three nonlinear measures of fundamental frequency variation (Actualy column names spread1, spread2, PPE respectively)
11. **status** - Health status of the subject (one) - Parkinson's, (zero) - healthy (**Target Varibale / attribute**)

In [None]:
print('\033[1mThe Parkinson\'s disease dataset having "{0}" rows and "{1}" columns\033[0m.'.format(pdData.shape[0],pdData.shape[1]))

In [None]:
pdData.info()

**Setting 'name' attribute as index of the pdData dataframe as the attribute / column does not have an significance towards identifying patients have Parkinson's disease or not i.e 'status' column [Health status of the subject (one) - Parkinson's, (zero) - healthy]**

In [None]:
# setting name column as index column
pdData.set_index('name',inplace=True)

In [None]:
# after setting column 'name' as index now we have less columns to confirm that printing number of rows and column once again
print('\033[1mAfter setting \'name\' column as index of the Dataset,\033[0m now there are \033[1m"{0}"\033[0m Rows and \033[1m"{1}"\033[0m Columns in the given Dataset.'.format(pdData.shape[0],pdData.shape[1]))

In [None]:
# printing top 5 rows once again to check
pd.options.display.max_columns = None
pdData.head()

In [None]:
# printing datatypes of each columns of the dataset

print("\033[1m*"*100)
print("a.\nColumn_Names        Data_Types")
print("*"*30)
print("\033[0m{0}\033[1m".format(pdData.dtypes))
print("*"*30)
print()

# printing No of Columns having different Types of Datatype

print("*"*100)
print("b.\nNumber of Columns with each DataTypes as follows :")
print("*"*50)
print("Column_Names     No_of_Columns\033[0m")
print("*"*30)
print(pdData.dtypes.value_counts())
print("\033[1m*"*30)
print("\033[0m")

# printing Different Column Names of the dataset

print("\033[1m*"*100)
print("c.\nEach Column Names of the dataset")
print("*"*80)
print("\033[0m{0}\033[1m".format(pdData.columns))
print("*"*80)
print("\033[0m")

**After observing the dataset and column description given we can conclude the followings:**
* **Columns having only two datatypes, int64, float64. (column 'name' was object datatype which was set as index of the dataframe)**
* **Column 'status' is only having int64 datatype, remaining all columns datatype is float64.** 
* **All columns except 'status' are Numeric column.**
* **Columns 'status' is Nominal Categorical column with binary response.**

In [None]:
# checking missing values in dataset for each attributes / columns 

print("\033[1m*"*100)
print("Column_Name       No_of_Missing_Values")
print("*"*50)
print("\033[0m{0}".format(pdData.isnull().sum()))
print("\033[1m*"*50)
print()

# checking if any duplicate rows available in the dataset

print("*"*100)
print("Showing Duplicate rows if any in the dataset: ")
print("*"*50)
print("\033[0m{0}".format(pdData[pdData.duplicated()]))
print("\033[1m*"*100)
print("\033[0m")

**As shown above, <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(a.) There are no missing values<br>and (b.) No duplicate rows in the given dataset**


In [None]:
# Five point summary of each attribute
pdData.describe().T

In [None]:
# checking skewness of the data
pdData.skew().sort_values(ascending=False)

**As from above we understand the following:**
* Independent variables are measured in different units e.g. Hz, dB, % and absoulute etc i.e variation in units of data exists and gap between feature values extreamly high. Requires data scalling techniques to scale different quantities of measurements.
* Symmetrical distribution : Values close to 0
    MDVP:Fo(Hz)
    spread1
    spread2
    PPE
* Negative skewness and Tail is larger towards the left hand side of the distribution
    HNR
    status
    RPDE
    DFA
* Positive skewness and Tail is larger towards the Right hand side of the distribution All other attributes have a very high distribution towards right of the median

### 3. Using univariate & bivariate analysis to check the individual attributes for their basic statistics such as central values, spread, tails, relationships between variables etc. mention your observations (15 points)

#### A. 'mdvp_fo_hz' attribute : (MDVP:Fo(Hz) - Average vocal fundamental frequency )

In [None]:
feature = 'mdvp_fo_hz'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),4))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about Average vocal fundamental frequency (mdvp_fo_hz) attribute of the dataset:**
* Mean value of the attribute is 154.2286 with skewness of 0.5917, which shows that the datapoints of the attribute is slightly right / positive skewed.
* Maximum datapoints are ranging from 110 to 130 Hz.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, most of the patients with Parkinson's disease have Average vocal fundamental frequency (mdvp_fo_hz) between 90 to 190 hz. Even though some of healthy patients have Average vocal fundamental frequency between 110-130 Hz and 170-180 Hz.**
* Lets bucket Average vocal fundamental frequency (mdvp_fo_hz) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [50,100,150,200,250,300]                                         # defining mdvp_fo_hz bins,
# defining labels of mdvp_fo_hz groups as per bins defined as above
mdvp_fo_hz_group = ['mdvp_fo_hz : 50-100', 'mdvp_fo_hz : 100-150', 'mdvp_fo_hz : 150-200', 'mdvp_fo_hz : 200-250', 'mdvp_fo_hz : 250-300']
pdData_mdvp_fo_hz_bin = pd.cut(pdData.mdvp_fo_hz,bins,labels=mdvp_fo_hz_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to mdvp_fo_hz_group_col variable
mdvp_fo_hz_group_col = pd.crosstab(pdData_mdvp_fo_hz_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(mdvp_fo_hz_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
mdvp_fo_hz_group_col.div(mdvp_fo_hz_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different mdvp_fo_hz group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patient with Average vocal fundamental frequency (mdvp_fo_hz) group between 50-100 are having Parkinson's disease.**
    * **Average vocal fundamental frequency (mdvp_fo_hz) group between 150-200 having second higest Parkinson's patient with percentage of 88.525, followed by Average vocal fundamental frequency (mdvp_fo_hz) group between 100-150 having Parkinson's patient with percentage of 80.435 .**
    * **Average vocal fundamental frequency (mdvp_fo_hz) group between 200-250 having Healthy patient with percentage of 65.625 .**
    * **All the patient with Average vocal fundamental frequency (mdvp_fo_hz) group between 250-300 are Healthy.**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**There are no outliers presnt in the 'mdvp_fo_hz' feature / attribute as we can see from above boxplot.**
* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed there are no outliers present in mdvp_fo_hz attribute for different 'status' attributes.**

#### B. 'mdvp_fhi_hz' attribute : (MDVP:Fhi(Hz) - Maximum vocal fundamental frequency )

In [None]:
feature = 'mdvp_fhi_hz'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),4))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about Maximum vocal fundamental frequency (mdvp_fhi_hz) attribute of the dataset:**
* Mean value of the attribute is 197.1049 with skewness of 2.5421, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 100 to 260 Hz.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, most of the patients with Parkinson's disease have Maximum vocal fundamental frequency (mdvp_fhi_hz) between 100 to 210 hz.**
* Lets bucket Maximum vocal fundamental frequency (mdvp_fhi_hz) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [100,200,300,400,500,600]                                         # defining mdvp_fhi_hz bins,
# defining labels of mdvp_fhi_hz groups as per bins defined as above
mdvp_fhi_hz_group = ['mdvp_fhi_hz : 100-200', 'mdvp_fhi_hz : 200-300', 'mdvp_fhi_hz : 300-400', 'mdvp_fhi_hz : 400-500',
                     'mdvp_fhi_hz : 500-600']
pdData_mdvp_fhi_hz_bin = pd.cut(pdData[feature],bins,labels=mdvp_fhi_hz_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to mdvp_fhi_hz_group_col variable
mdvp_fhi_hz_group_col = pd.crosstab(pdData_mdvp_fhi_hz_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(mdvp_fhi_hz_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
mdvp_fhi_hz_group_col.div(mdvp_fhi_hz_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different mdvp_fhi_hz group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients with Maximum vocal fundamental frequency (mdvp_fhi_hz) group between 400-500 are having Parkinson's disease.**
    * **Maximum vocal fundamental frequency (mdvp_fhi_hz) group between 100-200 having second highest Parkinson's patient with percentage of 87.069, followed by Maximum vocal fundamental frequency (mdvp_fhi_hz) group between 500-600 having Parkinson's patient with percentage of 60.000 .**
    * **Maximum vocal fundamental frequency (mdvp_fhi_hz) group between 200-300 having Parkinson's patient with percentage of 55.224 .**
    * **Exactly half of the patients are from Maximum vocal fundamental frequency (mdvp_fhi_hz) group between 300-400 are Healthy.**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'mdvp_fhi_hz' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),3),round(np.median(pdData[feature]),3),round(IQR,3))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'mdvp_fhi_hz' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### C. 'mdvp_flo_hz' attribute : (MDVP:Flo(Hz) - Minimum vocal fundamental frequency )

In [None]:
feature = 'mdvp_flo_hz'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),4))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about Minimum vocal fundamental frequency (mdvp_flo_hz) attribute of the dataset:**
* Mean value of the attribute is 116.3246 with skewness of 1.2174, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 65 to 120 Hz.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, most of the patients with Parkinson's disease have Minimum vocal fundamental frequency (mdvp_flo_hz) between 60 to 110 hz.**
* Lets bucket Minimum vocal fundamental frequency (mdvp_flo_hz) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [50,100,150,200,250]                                         # defining mdvp_flo_hz bins,
# defining labels of mdvp_flo_hz groups as per bins defined as above
mdvp_flo_hz_group = ['mdvp_flo_hz : 50-100', 'mdvp_flo_hz : 100-150', 'mdvp_flo_hz : 150-200', 'mdvp_flo_hz : 200-250']
pdData_mdvp_flo_hz_bin = pd.cut(pdData[feature],bins,labels=mdvp_flo_hz_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to mdvp_flo_hz_group_col variable
mdvp_flo_hz_group_col = pd.crosstab(pdData_mdvp_flo_hz_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(mdvp_flo_hz_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
mdvp_flo_hz_group_col.div(mdvp_flo_hz_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different mdvp_flo_hz group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **Minimum vocal fundamental frequency (mdvp_flo_hz) group between 50-100 are having highest Parkinson's patients with percentage of 83.146 .**
    * **Minimum vocal fundamental frequency (mdvp_flo_hz) group between 100-150 having second higest Parkinson's patient with percentage of 80.000, followed by Minimum vocal fundamental frequency (mdvp_flo_hz) group between 150-200 having Parkinson's patient with percentage of 70.833 .**
    * **All the patient from Minimum vocal fundamental frequency (mdvp_flo_hz) group between 200-250 are Healthy patient.**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'mdvp_flo_hz' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),3),round(np.median(pdData[feature]),3),round(IQR,3))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'mdvp_flo_hz' attribute w.r.t different target attribute status i.e Healthy or Parkinson's there are no outliers but combining datapoints are having outliers. Reason for this is patients with Parkinson's disease have lower Minimum vocal fundamental frequency whereas Healthy patients have higher Minimum vocal fundamental frequency as we can deduce from the above boxplot.**

#### D. 'mdvp_jitter_in_percent' attribute : (MDVP:Jitter(%) - One of the measure of variation in fundamental frequency )

In [None]:
feature = 'mdvp_jitter_in_percent'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),4))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about mdvp_jitter_in_percent (MDVP:Jitter(%)) attribute of the dataset:**
* Mean value of the attribute is 0.0062 with skewness of 3.0849, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.001 to 0.007.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with mdvp_jitter_in_percent (MDVP:Jitter(%)) values greater than 0.005 are more likly to have Parkinson's disease.**
* Lets bucket mdvp_jitter_in_percent (MDVP:Jitter(%)) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.001,0.005,0.010,0.015,0.020,0.025,0.030,0.035]                                         # defining mdvp_jitter_in_percent bins,
# defining labels of mdvp_jitter_in_percent groups as per bins defined as above
mdvp_jitter_in_percent_group = ['0.001-0.005', '0.005-0.010', '0.010-0.015', '0.015-0.020', '0.020-0.025', '0.025-0.030',
                                '0.030-0.035']
pdData_mdvp_jitter_in_percent_bin = pd.cut(pdData[feature],bins,labels=mdvp_jitter_in_percent_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to mdvp_jitter_in_percent_group_col variable
mdvp_jitter_in_percent_group_col = pd.crosstab(pdData_mdvp_jitter_in_percent_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(mdvp_jitter_in_percent_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
mdvp_jitter_in_percent_group_col.div(mdvp_jitter_in_percent_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different mdvp_jitter_in_percent group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under mdvp_jitter_in_percent (MDVP:Jitter(%)) groups ranging from 0.015 have Parkinson's disease.**
    * **mdvp_jitter_in_percent (MDVP:Jitter(%)) group between 0.005-0.010 having second higest Parkinson's patient with percentage of 89.189, followed by mdvp_jitter_in_percent (MDVP:Jitter(%)) group between 0.010-0.015 having Parkinson's patient with percentage of 87.500 .**
    * **mdvp_jitter_in_percent (MDVP:Jitter(%)) group between 0.001-0.005 having Parkinson's patient with percentage of 61.765 .**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'mdvp_jitter_in_percent' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),3),round(np.median(pdData[feature]),3),round(IQR,3))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now we will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'mdvp_jitter_in_percent' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### E. 'mdvp_jitter_abs' attribute : (MDVP:Jitter(Abs) - One of the measure of variation in fundamental frequency )

In [None]:
feature = 'mdvp_jitter_abs'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about mdvp_jitter_abs (MDVP:Jitter(Abs)) attribute of the dataset:**
* Mean value of the attribute is 0.000044 with skewness of 2.6491, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.00001 to 0.00004.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with mdvp_jitter_abs (MDVP:Jitter(Abs)) values greater than 0.00002 are more likly to have Parkinson's disease.**
* Lets bucket mdvp_jitter_abs (MDVP:Jitter(Abs)) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.00004,0.00010,0.00015,0.00020,0.00026]                                         # defining mdvp_jitter_abs bins,
# defining labels of mdvp_jitter_abs groups as per bins defined as above
mdvp_jitter_abs_group = ['0.00004-0.00010', '0.00010-0.00015', '0.00015-0.00020', '0.00020-0.00026']
pdData_mdvp_jitter_abs_bin = pd.cut(pdData[feature],bins,labels=mdvp_jitter_abs_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to mdvp_jitter_abs_group_col variable
mdvp_jitter_abs_group_col = pd.crosstab(pdData_mdvp_jitter_abs_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(mdvp_jitter_abs_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
mdvp_jitter_abs_group_col.div(mdvp_jitter_abs_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different mdvp_jitter_abs group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under mdvp_jitter_abs (MDVP:Jitter(Abs)) groups ranging more than 0.0001 have Parkinson's disease.**
    * **mdvp_jitter_abs (MDVP:Jitter(Abs)) group between 0.00004-0.00010 having Parkinson's patient with percentage of 96.552 .**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'mdvp_jitter_abs' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'mdvp_jitter_abs' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### F. 'mdvp_rap' attribute : (MDVP:RAP - One of the measure of variation in fundamental frequency )

In [None]:
feature = 'mdvp_rap'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about mdvp_rap (MDVP:RAP) attribute of the dataset:**
* Mean value of the attribute is 0.0033 with skewness of 3.3607, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.001 to 0.004.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with mdvp_rap (MDVP:RAP) values greater than 0.002 are more likly to have Parkinson's disease.**
* Lets bucket mdvp_rap (MDVP:RAP) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.000,0.005,0.010,0.015,0.020,0.025]                                         # defining mdvp_rap bins,
# defining labels of mdvp_rap groups as per bins defined as above
mdvp_rap_group = ['0.000-0.005', '0.005-.010', '0.010-0.015', '0.015-0.020', '0.020-0.025']
pdData_mdvp_rap_bin = pd.cut(pdData[feature],bins,labels=mdvp_rap_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to mdvp_rap_group_col variable
mdvp_rap_group_col = pd.crosstab(pdData_mdvp_rap_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(mdvp_rap_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
mdvp_rap_group_col.div(mdvp_rap_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different mdvp_rap group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under mdvp_rap (MDVP:RAP) groups ranging more than 0.01 have Parkinson's disease.**
    * **mdvp_rap (MDVP:RAP) group between 0.005-0.010 having Parkinson's patient with percentage of 94.444 followed by mdvp_rap (MDVP:RAP) group between 0.000-0.005 having Parkinson's patient with percentage of 72.353  .**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'mdvp_rap' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'mdvp_rap' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### G. 'mdvp_ppq' attribute : (MDVP:PPQ - One of the measure of variation in fundamental frequency )

In [None]:
feature = 'mdvp_ppq'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about mdvp_ppq (MDVP:PPQ) attribute of the dataset:**
* Mean value of the attribute is 0.0034 with skewness of 3.0739, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.0010 to 0.0025.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with mdvp_ppq (MDVP:PPQ) values greater than 0.0025 are more likly to have Parkinson's disease.**
* Lets bucket mdvp_ppq (MDVP:PPQ) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.000,0.005,0.010,0.015,0.020]                                         # defining mdvp_ppq bins,
# defining labels of mdvp_ppq groups as per bins defined as above
mdvp_ppq_group = ['0.000-0.005', '0.005-.010', '0.010-0.015', '0.015-0.020']
pdData_mdvp_ppq_bin = pd.cut(pdData[feature],bins,labels=mdvp_ppq_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to mdvp_ppq_group_col variable
mdvp_ppq_group_col = pd.crosstab(pdData_mdvp_ppq_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(mdvp_ppq_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
mdvp_ppq_group_col.div(mdvp_ppq_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different mdvp_ppq group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under mdvp_ppq (MDVP:PPQ) groups ranging more than 0.01 have Parkinson's disease.**
    * **mdvp_ppq (MDVP:PPQ) group between 0.005-0.010 having Parkinson's patient with percentage of 94.737 followed by mdvp_ppq (MDVP:PPQ) group between 0.000-0.005 having Parkinson's patient with percentage of 72.353, suprisingly which is exactly same for the same group of mdvp_rap (MDVP:RAP) attribute.**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'mdvp_ppq' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'mdvp_ppq' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### H. 'jitter_ddp' attribute : (Jitter:DDP - One of the measure of variation in fundamental frequency )

In [None]:
feature = 'jitter_ddp'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about jitter_ddp (Jitter:DDP) attribute of the dataset:**
* Mean value of the attribute is 0.0099 with skewness of 3.3621, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.002 to 0.012.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with jitter_ddp (Jitter:DDP) values greater than 0.008 are more likly to have Parkinson's disease.**
* Lets bucket jitter_ddp (Jitter:DDP) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.00,0.02,0.04,0.06,0.80]                                         # defining jitter_ddp bins,
# defining labels of jitter_ddp groups as per bins defined as above
jitter_ddp_group = ['0.00-0.02', '0.02-0.04', '0.04-0.06', '0.06-0.08']
pdData_jitter_ddp_bin = pd.cut(pdData[feature],bins,labels=jitter_ddp_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to jitter_ddp_group_col variable
jitter_ddp_group_col = pd.crosstab(pdData_jitter_ddp_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(jitter_ddp_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
jitter_ddp_group_col.div(jitter_ddp_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different jitter_ddp group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under jitter_ddp (Jitter:DDP) groups ranging more than 0.02 have Parkinson's disease.**
    * **jitter_ddp (Jitter:DDP) group between 0.00-0.02 having Parkinson's patient with percentage of 73.480 .**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'jitter_ddp' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'jitter_ddp' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### I. 'mdvp_shimmer' attribute : (MDVP:Shimmer - One of the measure of variation in amplitude )

In [None]:
feature = 'mdvp_shimmer'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about mdvp_shimmer (MDVP:Shimmer) attribute of the dataset:**
* Mean value of the attribute is 0.0297 with skewness of 1.6665, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.009 to 0.02.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with mdvp_shimmer (MDVP:Shimmer) values greater than 0.025 are more likly to have Parkinson's disease.**
* Lets bucket mdvp_shimmer (MDVP:Shimmer) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.00,0.02,0.04,0.06,0.08,0.10,0.12]                                         # defining mdvp_shimmer bins,
# defining labels of mdvp_shimmer groups as per bins defined as above
mdvp_shimmer_group = ['0.00-0.02', '0.02-0.04', '0.04-0.06', '0.06-0.08', '0.08-0.10', '0.10-0.12']
pdData_mdvp_shimmer_bin = pd.cut(pdData[feature],bins,labels=mdvp_shimmer_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to mdvp_shimmer_group_col variable
mdvp_shimmer_group_col = pd.crosstab(pdData_mdvp_shimmer_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(mdvp_shimmer_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
mdvp_shimmer_group_col.div(mdvp_shimmer_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different mdvp_shimmer group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under mdvp_shimmer (MDVP:Shimmer) groups ranging more than 0.06 have Parkinson's disease.**
    * **mdvp_shimmer (MDVP:Shimmer) group between 0.04-0.06 having Parkinson's patient with percentage of 96.296 followed by mdvp_shimmer (MDVP:Shimmer) group between 0.02-0.04 having Parkinson's patient with percentage of 83.562.**
    * **mdvp_shimmer (MDVP:Shimmer) group between 0.00-0.02 having Parkinson's patient with percentage of 55.128 .**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'mdvp_shimmer' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'mdvp_shimmer' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### J. 'mdvp_shimmer_db' attribute : (MDVP:Shimmer(dB) - One of the measure of variation in amplitude )

In [None]:
feature = 'mdvp_shimmer_db'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about mdvp_shimmer_db (MDVP:Shimmer(dB)) attribute of the dataset:**
* Mean value of the attribute is 0.2823 with skewness of 1.9994, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.008 to 0.35 .

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with mdvp_shimmer_db (MDVP:Shimmer(dB)) values greater than 0.25 are more likly to have Parkinson's disease.**
* Lets bucket mdvp_shimmer_db (MDVP:Shimmer(dB)) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.00, 0.25, 0.50, 0.75, 1.00, 1.25, 1.50]                                         # defining mdvp_shimmer_db bins,
# defining labels of mdvp_shimmer_db groups as per bins defined as above
mdvp_shimmer_db_group = ['0.00-0.25', '0.25-0.50', '0.50-0.75', '0.75-1.00', '1.00-1.25', '1.25-1.50']
pdData_mdvp_shimmer_db_bin = pd.cut(pdData[feature],bins,labels=mdvp_shimmer_db_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to mdvp_shimmer_db_group_col variable
mdvp_shimmer_db_group_col = pd.crosstab(pdData_mdvp_shimmer_db_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(mdvp_shimmer_db_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
mdvp_shimmer_db_group_col.div(mdvp_shimmer_db_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different mdvp_shimmer_db group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under mdvp_shimmer_db (MDVP:Shimmer(dB)) groups ranging more than 0.50 have Parkinson's disease.**
    * **mdvp_shimmer_db (MDVP:Shimmer(dB)) group between 0.25-0.50 having Parkinson's patient with percentage of 93.220 followed by mdvp_shimmer_db (MDVP:Shimmer(dB)) group between 0.00-0.25 having Parkinson's patient with percentage of 61.404 .**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'mdvp_shimmer_db' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'mdvp_shimmer_db' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### K. 'shimmer_apq3' attribute : (Shimmer:APQ3) - One of the measure of variation in amplitude )

In [None]:
feature = 'shimmer_apq3'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about shimmer_apq3 (Shimmer:APQ3) attribute of the dataset:**
* Mean value of the attribute is 0.0157 with skewness of 1.5806, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.004 to 0.0175 .

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with shimmer_apq3 (Shimmer:APQ3) values greater than 0.015 are more likly to have Parkinson's disease.**
* Lets bucket shimmer_apq3 (Shimmer:APQ3) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06]                                         # defining shimmer_apq3 bins,
# defining labels of shimmer_apq3 groups as per bins defined as above
shimmer_apq3_group = ['0.00-0.01', '0.01-0.02', '0.02-0.03', '0.03-0.04', '0.04-0.05', '0.05-0.06']
pdData_shimmer_apq3_bin = pd.cut(pdData[feature],bins,labels=shimmer_apq3_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to shimmer_apq3_group_col variable
shimmer_apq3_group_col = pd.crosstab(pdData_shimmer_apq3_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(shimmer_apq3_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
shimmer_apq3_group_col.div(shimmer_apq3_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different shimmer_apq3 group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under shimmer_apq3 (Shimmer:APQ3) groups ranging more than 0.03 have Parkinson's disease.**
    * **shimmer_apq3 (Shimmer:APQ3) group between 0.02-0.03 having Parkinson's patient with percentage of 96.667 followed by shimmer_apq3 (Shimmer:APQ3) group between 0.01-0.02 having Parkinson's patient with percentage of 76.389 .**
    * **shimmer_apq3 (Shimmer:APQ3) group between 0.00-0.01 having Parkinson's patient with percentage of 58.904 .**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'shimmer_apq3' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'shimmer_apq3' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### L. 'shimmer_apq5' attribute : (Shimmer:APQ5) - One of the measure of variation in amplitude )

In [None]:
feature = 'shimmer_apq5'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about shimmer_apq5 (Shimmer:APQ5) attribute of the dataset:**
* Mean value of the attribute is 0.0179 with skewness of 1.7987, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.004 to 0.02 .

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with shimmer_apq5 (Shimmer:APQ5) values greater than 0.015 are more likly to have Parkinson's disease.**
* Lets bucket shimmer_apq5 (Shimmer:APQ5) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.08]                                         # defining shimmer_apq5 bins,
# defining labels of shimmer_apq5 groups as per bins defined as above
shimmer_apq5_group = ['0.00-0.01', '0.01-0.02', '0.02-0.03', '0.03-0.04', '0.04-0.05', '0.05-0.06', '0.06-0.08']
pdData_shimmer_apq5_bin = pd.cut(pdData[feature],bins,labels=shimmer_apq5_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to shimmer_apq5_group_col variable
shimmer_apq5_group_col = pd.crosstab(pdData_shimmer_apq5_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(shimmer_apq5_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
shimmer_apq5_group_col.div(shimmer_apq5_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different shimmer_apq5 group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under shimmer_apq5 (Shimmer:APQ5) groups ranging more than 0.03 have Parkinson's disease.**
    * **shimmer_apq5 (Shimmer:APQ5) group between 0.02-0.03 having Parkinson's patient with percentage of 96.154 followed by shimmer_apq3 (Shimmer:APQ3) group between 0.01-0.02 having Parkinson's patient with percentage of 72.093 .**
    * **shimmer_apq5 (Shimmer:APQ5) group between 0.00-0.01 having Parkinson's patient with percentage of 58.182 .**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'shimmer_apq5' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'shimmer_apq5' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### M. 'mdvp_apq' attribute : (MDVP:APQ) - One of the measure of variation in amplitude )

In [None]:
feature = 'mdvp_apq'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about mdvp_apq (MDVP:APQ) attribute of the dataset:**
* Mean value of the attribute is 0.0241 with skewness of 2.618, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.007 to 0.03 .

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with mdvp_apq (MDVP:APQ) values greater than 0.02 are more likly to have Parkinson's disease.**
* Lets bucket mdvp_apq (MDVP:APQ) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.00, 0.02, 0.04, 0.06, 0.08, 0.10, 0.14]                                         # defining mdvp_apq bins,
# defining labels of mdvp_apq groups as per bins defined as above
mdvp_apq_group = ['0.00-0.02', '0.02-0.04', '0.04-0.06', '0.06-0.08', '0.08-0.10', '0.10-0.14']
pdData_mdvp_apq_bin = pd.cut(pdData[feature],bins,labels=mdvp_apq_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to mdvp_apq_group_col variable
mdvp_apq_group_col = pd.crosstab(pdData_mdvp_apq_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(mdvp_apq_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
mdvp_apq_group_col.div(mdvp_apq_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different mdvp_apq group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under mdvp_apq (MDVP:APQ) groups ranging more than 0.04 have Parkinson's disease.**
    * **mdvp_apq (MDVP:APQ) group between 0.02-0.04 having Parkinson's patient with percentage of 98.246 followed by mdvp_apq (MDVP:APQ) group between 0.00-0.02 having Parkinson's patient with percentage of 57.273 .**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'mdvp_apq' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'mdvp_apq' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### N. 'shimmer_dda' attribute : (Shimmer:DDA) - One of the measure of variation in amplitude )

In [None]:
feature = 'shimmer_dda'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about shimmer_dda (Shimmer:DDA) attribute of the dataset:**
* Mean value of the attribute is 0.0470 with skewness of 1.5806, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.013 to 0.06 .

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with shimmer_dda (Shimmer:DDA) values greater than 0.04 are more likly to have Parkinson's disease.**
* Lets bucket shimmer_dda (Shimmer:DDA) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.010, 0.025, 0.050, 0.075, 0.100, 0.125, 0.150]                                         # defining shimmer_dda bins,
# defining labels of shimmer_dda groups as per bins defined as above
shimmer_dda_group = ['0.010-0.025', '0.025-0.050', '0.050-0.075', '0.075-0.100', '0.100-0.125', '0.125-0.150']
pdData_shimmer_dda_bin = pd.cut(pdData[feature],bins,labels=shimmer_dda_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to shimmer_dda_group_col variable
shimmer_dda_group_col = pd.crosstab(pdData_shimmer_dda_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(shimmer_dda_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
shimmer_dda_group_col.div(shimmer_dda_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different shimmer_dda group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under shimmer_dda (Shimmer:DDA) groups ranging more than 0.075 have Parkinson's disease.**
    * **shimmer_dda (Shimmer:DDA) group between 0.050-0.075 having Parkinson's patient with percentage of 96.970 followed by shimmer_dda (Shimmer:DDA) group between 0.025-0.050 having Parkinson's patient with percentage of 65.000 .**
    * **shimmer_dda (Shimmer:DDA) group between 0.010-0.025 having Parkinson's patient with percentage of 62.745 **
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'shimmer_dda' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'shimmer_dda' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### O. 'nhr' attribute : (NHR) - Measures of ratio of noise to tonal components in the voice )

In [None]:
feature = 'nhr'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about nhr (NHR) attribute of the dataset:**
* Mean value of the attribute is 0.0248 with skewness of 4.2207, which shows that the datapoints of the attribute is highly right / positive skewed.
* Maximum datapoints are ranging from 0.00 to 0.030 .

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with nhr (NHR) values greater than 0.02 are more likly to have Parkinson's disease.**
* Lets bucket nhr (NHR) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.00, 0.05, 0.10, 0.15, 0.20, 0.25, 0.32]                                         # defining nhr bins,
# defining labels of nhr groups as per bins defined as above
nhr_group = ['0.00-0.05', '0.05-0.10', '0.10-0.15', '0.15-0.20', '0.20-0.25', '0.25-0.32']
pdData_nhr_bin = pd.cut(pdData[feature],bins,labels=nhr_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to nhr_group_col variable
nhr_group_col = pd.crosstab(pdData_nhr_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(nhr_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
nhr_group_col.div(nhr_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different nhr group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under nhr (NHR) groups ranging more than 0.15 have Parkinson's disease.**
    * **nhr (NHR) group between 0.05-0.10 having Parkinson's patient with percentage of 90.000 followed by nhr (NHR) group between 0.010-0.15 having Parkinson's patient with percentage of 80.000 .**
    * **nhr (NHR) group between 0.00-0.05 having Parkinson's patient with percentage of 73.714 **
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'nhr' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'nhr' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### P. 'hnr' attribute : (HNR) - Measures of ratio of noise to tonal components in the voice )

In [None]:
feature = 'hnr'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about hnr (HNR) attribute of the dataset:**
* Mean value of the attribute is 21.8860 with skewness of -0.5143, which shows that the datapoints of the attribute is slightly left / negative skewed.
* Maximum datapoints are ranging from 18 to 27 .

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with hnr (HNR) values less than 22.5 are more likly to have Parkinson's disease.**
* Lets bucket hnr (HNR) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [8, 10, 15, 20, 25 , 30, 34]                                         # defining hnr bins,
# defining labels of hnr groups as per bins defined as above
hnr_group = ['8-10', '10-15', '15-20', '20-25', '25-30', '30-34']
pdData_hnr_bin = pd.cut(pdData[feature],bins,labels=hnr_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to hnr_group_col variable
hnr_group_col = pd.crosstab(pdData_hnr_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(hnr_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
hnr_group_col.div(hnr_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different hnr group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All the patients under hnr (HNR) groups ranging less than 15 have Parkinson's disease.**
    * **hnr (HNR) group between 15-20 having Parkinson's patient with percentage of 86.957 followed by hnr (HNR) group between 20-25 having Parkinson's patient with percentage of 77.778 .**
    * **hnr (HNR) group between 25-30 having Parkinson's patient with percentage of 60.417 .**
    * **All patient having hnr (HNR) more than 30 are healthy.**
<br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'hnr' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'hnr' attribute with Parkinson's disease have outliers present below lower quartile range whereas for healthy patients outliers present both lower and upper quartile range..**

#### Q. 'rpde' attribute : (RPDE - Nonlinear dynamical complexity measure) 

In [None]:
feature = 'rpde'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about rpde (RPDE) attribute of the dataset:**
* Mean value of the attribute is 0.4985 with skewness of -0.1434, which shows that the skewness of the attribute is negligible.
* Maximum datapoints are ranging from 0.4 to 0.68 .

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with rpde (RPDE) values less than 0.49 are more likly to have Parkinson's disease.**
* Lets bucket rpde (RPDE) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.25, 0.35, 0.45, 0.55, 0.65, 0.75]                                         # defining rpde bins,
# defining labels of rpde groups as per bins defined as above
rpde_group = ['0.25-0.35', '0.35-0.45', '0.45-0.55', '0.55-0.65', '0.65-0.75']
pdData_rpde_bin = pd.cut(pdData[feature],bins,labels=rpde_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to rpde_group_col variable
rpde_group_col = pd.crosstab(pdData_rpde_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(rpde_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
rpde_group_col.div(rpde_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different rpde group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **rpde (RPDE) group between 0.65-0.75 having Parkinson's patient with highest percentage of 91.667 .**
    * **rpde (RPDE) group between 0.55-0.65 having Parkinson's patient with percentage of 88.135 followed by rpde (RPDE) group between 0.45-0.55 having Parkinson's patient with percentage of 78.571 .**
    * **Both rpde (RPDE) group between 0.25-0.35 and 0.35-0.45 having Parkinson's patient with percentage of 58.824 .**
    <br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**there are no outliers presnt in the 'rpde' feature / attribute as we can see from above boxplot.**
* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed there are no outliers present in mdvp_fo_hz attribute for different 'status' attributes.**

#### R. 'd2' attribute : (D2 - Nonlinear dynamical complexity measure) 

In [None]:
feature = 'd2'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about d2 (D2) attribute of the dataset:**
* Mean value of the attribute is 2.382 with skewness of 0.4304, which shows that the skewness of the attribute is negligible.
* Maximum datapoints are ranging from 2.0 to 2.75 .

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with d2 (D2) values greater than 2.4 are more likly to have Parkinson's disease.**
* Lets bucket d2 (D2) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [1.0, 1.5, 2.0, 2.5, 3.0, 3.7]                                         # defining d2 bins,
# defining labels of d2 groups as per bins defined as above
d2_group = ['1.0-1.5', '1.5-2.0', '2.0-2.5', '2.5-3.0', '3.0-3.7']
pdData_d2_bin = pd.cut(pdData[feature],bins,labels=d2_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to d2_group_col variable
d2_group_col = pd.crosstab(pdData_d2_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(d2_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
d2_group_col.div(d2_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different d2 group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All patient with d2 (D2) group with value more than 3.0 are having Parkinson's disease.**
    * **d2 (D2) group between 2.5-3.0 having Parkinson's patient with percentage of 90.385 followed by d2 (D2) group between 2.0-2.5 having Parkinson's patient with percentage of 70.588 .**
    * **All patient with d2 (D2) group with value less than 1.5 are healthy patients.**
    <br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'd2' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'd2' attribute with Parkinson's disease have outliers present above upper quartile range whereas for healthy patients outliers present below lower quartile range.**

#### S. 'dfa' attribute : (DFA - Signal fractal scaling exponent) 

In [None]:
feature = 'dfa'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about dfa (DFA) attribute of the dataset:**
* Mean value of the attribute is 0.7180 with skewness of -0.0332, which shows that the skewness of the attribute is negligible.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with dfa (DFA) values greater than 0.68 are more likly to have Parkinson's disease.**
* Lets bucket dfa (DFA) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.5, 0.6, 0.7, 0.8, 0.9]                                         # defining dfa bins,
# defining labels of dfa groups as per bins defined as above
dfa_group = ['0.5-0.6', '0.6-0.7', '0.7-0.8', '0.8-0.9']
pdData_dfa_bin = pd.cut(pdData[feature],bins,labels=dfa_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to dfa_group_col variable
dfa_group_col = pd.crosstab(pdData_dfa_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(dfa_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
dfa_group_col.div(dfa_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different dfa group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All patient with dfa (DFA) group with value more than 0.8 and value less than 0.6 are having Parkinson's disease.**
    * **dfa (DFA) group between 0.7-0.8 having Parkinson's patient with percentage of 81.731 followed by dfa (DFA) group between 0.6-0.7 having Parkinson's patient with percentage of 60.811 .**
    <br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**there are no outliers presnt in the 'dfa' feature / attribute as we can see from above boxplot.**
* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed there are no outliers present in dfa attribute for different 'status' attributes.**

#### T. 'spread1' attribute : (Nonlinear measures of fundamental frequency variation ) 

In [None]:
feature = 'spread1'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about spread1 attribute of the dataset:**
* Mean value of the attribute is -5.6843 with skewness of 0.4321, which shows that the skewness of the attribute is negligible.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with spread1 values greater than -6.2 are more likly to have Parkinson's disease.**
* Lets bucket spread1 and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [-8,-6,-4,-2]                                         # defining spread1 bins,
# defining labels of spread1 groups as per bins defined as above
spread1_group = ['-8 : -6', '-6 : -4', '-4 : -2']
pdData_spread1_bin = pd.cut(pdData[feature],bins,labels=spread1_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to spread1_group_col variable
spread1_group_col = pd.crosstab(pdData_spread1_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(spread1_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
spread1_group_col.div(spread1_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different spread1 group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All patient with spread1 group with value more than -4.0 are having Parkinson's disease.**
    * **spread1 group between -6.0 to -4.0 having Parkinson's patient with percentage of 93.070 followed by spread1 group between -8.0 to -6.0 having Parkinson's patient with percentage of 49.383 .**
    <br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'spread1' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'spread1' attribute with Parkinson's disease have outliers present on above upper quartile range whereas for healthy patients have no outliers.**

#### U. 'spread2' attribute : (Nonlinear measures of fundamental frequency variation  

In [None]:
feature = 'spread2'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about spread2 attribute of the dataset:**
* Mean value of the attribute is 0.2265 with skewness of 0.1444, which shows that the skewness of the attribute is negligible.

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with spread2 values greater than 0.21 are more likly to have Parkinson's disease.**
* Lets bucket spread2 and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.0, 0.1, 0.2, 0.3, 0.4,0.5]                                         # defining spread2 bins,
# defining labels of spread2 groups as per bins defined as above
spread2_group = ['0.0-0.1', '0.1-0.2', '0.2-0.3', '0.3-0.4', '0.4-0.5']
pdData_spread2_bin = pd.cut(pdData[feature],bins,labels=spread2_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to spread2_group_col variable
spread2_group_col = pd.crosstab(pdData_spread2_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(spread2_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
spread2_group_col.div(spread2_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different spread2 group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All patient with spread2 group with value more than 0.3 are having Parkinson's disease.**
    * **spread2 group between 0.2-0.3 having Parkinson's patient with percentage of 86.585 followed by spread2 group between 0.1-0.2 having Parkinson's patient with percentage of 53.333 .**
    * **spread2 group between 0.0-0.1 having Parkinson's patient with percentage of 35.714 .**
    <br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'spread2' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'spread2' attribute for healthy patients have outliers.**

#### V. 'ppe' attribute : (PPE - Nonlinear measures of fundamental frequency variation ) 

In [None]:
feature = 'ppe'
meanData = 'Mean : ' + str(round(pdData[feature].mean(),6))        # variable to contain mean of the attribute
skewData = 'Skewness : ' + str(round(pdData[feature].skew(),4))    # variable to contain skewness of the attribute
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
fig = sns.distplot(pdData[feature], bins=30, kde=True)             # seaborn distplot to examine distribution of the feature
plt.title("Distribution of feature : "+feature+" having "+meanData+" and "+skewData)   # setting title of the figure
plt.show()

**From above we can understand the following about ppe (PPE) attribute of the dataset:**
* Mean value of the attribute is 0.2066 with skewness of 0.7975, which shows that the datapoints of the attribute is slightly right / positive skewed.
* Maximum datapoints are ranging from 0.1 to 0.27 .

In [None]:
plt.figure(figsize=(10,5))                                         # setting figure size with width = 10 and height = 5
# seaborn distplot to examine distribution of the feature of healthy patient
fig = sns.distplot(pdData[pdData['status'] == 0][feature], bins=30, kde=True, label='Healthy')
# seaborn distplot to examine distribution of the feature of Parkinson's patient
fig = sns.distplot(pdData[pdData['status'] == 1][feature], bins=30, kde=True, label='Parkinson\'s')
plt.legend()
plt.title("Distribution of feature : "+feature)                    # setting title of the figure
plt.show()

**From the above we can observe that, patients with ppe (PPE) values greater than 0.16 are more likly to have Parkinson's disease.**
* Lets bucket ppe (PPE) and check w.r.t the different status i.e Healthy or Parkinson's:

In [None]:
bins = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]                                         # defining ppe bins,
# defining labels of ppe groups as per bins defined as above
ppe_group = ['0.0-0.1', '0.1-0.2', '0.2-0.3', '0.3-0.4', '0.4-0.5']
pdData_ppe_bin = pd.cut(pdData[feature],bins,labels=ppe_group)  # segmenting data as per bins defined

# putting into pandas crosstab and applying lambda function to take percentage and assigning to ppe_group_col variable
ppe_group_col = pd.crosstab(pdData_ppe_bin,pdData.status).apply(lambda r: r/r.sum()*100, axis=1)
print(ppe_group_col)                                                    # printing above crosstab

# plotting a stacked bar chart to show PD status for different mdvp_fo_hz group
ppe_group_col.div(ppe_group_col.sum(1).astype(float), axis=0).plot(kind='bar',stacked=True)
plt.title("PD status with different ppe group")                     # setting title of the figure

- We can find out the following from the above crosstab:
    * **All patient with ppe (PPE) (DFA) group with value more than 0.3 are having Parkinson's disease.**
    * **ppe (PPE) group between 0.2-0.3 having Parkinson's patient with percentage of 93.750 followed by ppe (PPE) group between 0.1-0.2 having Parkinson's patient with percentage of 67.470 .**
    * **ppe (PPE) group between 0.0-0.1 having very less Parkinson's patient with percentage of 5.556 .**
    <br><br>
- Let's check outliers for the attribute :

In [None]:
ax = sns.boxplot(x=pdData[feature])        # seaborn boxplot to examine outliers of the feature

**In the 'ppe' attribute some outliers are present, let's check for the same :**

In [None]:
Q1 = pdData[feature].quantile(0.25)        # evaluating lower / first quartile
Q3 = pdData[feature].quantile(0.75)        # evaluating upper / third quartile
IQR = Q3 - Q1                              # evaluating Inter Quartile Range i.e IQR
'''
finding outliers which are mild outliers (Lower quartile - 1.5 times IQR) or
extreme outliers (Upper quartile + 1.5 times IQR)
'''
outliers = pdData[((pdData[feature] < (Q1 - 1.5 * IQR)) |(pdData[feature] > (Q3 + 1.5 * IQR)))][feature]

print("*"*125)
# printing mean, median and IQR for the feature
print("\033[1mFeature {0} : Mean = {1}, Median = {2} and Inter-Quartile-Range (IQR) = {3}\033[0m"
      .format(feature,round(np.mean(pdData[feature]),6),round(np.median(pdData[feature]),6),round(IQR,6))
     )
print()
print("*"*125)
# printing No of outliers, percentage of the data points are outliers and the values of the outliers
print("There are \033[1m{0} outliers\033[0m ({1} % of the data points) in \033[1m{2}\033[0m feature and the values are \033[1m{3}\033[0m"
.format(outliers.shape[0],round(((outliers.shape[0]/pdData[feature].shape[0])*100),3),feature,outliers.tolist()))
print("*"*125)

* Now will check if any outliers present for different target attributes i.e 'status'

In [None]:
sns.boxplot(x=pdData['status'],y=pdData[feature]) 

**From above, it is observed that for 'ppe' attribute patients with Parkinson's disease have more outliers than Healthy patients.**

#### W. 'status' attribute : (Health status of the subject (one) - Parkinson's, (zero) - healthy) 

In [None]:
plt.figure(figsize=(10,5))                                 # setting figure size with width = 10 and height = 5
# seaborn count catplot to examine distribution of the status
ax = sns.catplot(x='status', kind="count", data=pdData)
plt.title("Distribution of column : 'Status'")      # setting title of the figure
y = []                                                     # creating a null or empty array
for val in range(pdData.status.nunique()):        # looping for number of unique values in the status
    # appending count of each unique values from status to array y
    y.append(pdData.groupby(pdData.status,sort=False)['status'].count()[val])
for i, v in enumerate(y):                                  # looping count of each unique value in the status
    # including count of each unique values in the plot 
    plt.annotate(str(v), xy=(i,float(v)), xytext=(i-0.1, v+3), color='black', fontweight='bold')

* Lets check the percentage and plot a pie chart to show :

In [None]:
plt.figure(figsize=(5,5))                               # setting figure size with width = 10 and height = 5
# seaborn pie chart to examine distribution of the status
pdData.groupby(['status']).status.count().plot(kind='pie',labels=['Healthy : 0','Parkinson\'s : 1'],
                                                               startangle=90, autopct='%1.1f%%')
plt.title("Distribution of column : 'status'")   # setting title of the figure

**From above we can see out of 195 patients, 48 patients (24.6 %) are healthy and 147 patients (75.4%) patients are having Parkinson's disease.**

In [None]:
sns.pairplot(pdData,hue='status',diag_kind='hist')

In [None]:
plt.figure(figsize=(20,7))
# create a mask so we only see the correlation values once
mask = np.zeros_like(pdData.corr())
mask[np.triu_indices_from(mask, 1)] = True
a = sns.heatmap(pdData.corr(),mask=mask, annot=True, fmt='.2f')
rotx = a.set_xticklabels(a.get_xticklabels(), rotation=90)

**We can observe from the above pairplot and heatmap of correlation of different attributes:**
* **mdvp_jitter_in_percent (MDVP:Jitter(%)) have high correlation with mdvp_jitter_abs (MDVP:Jitter(Abs) ), mdvp_rap (MDVP:RAP), mdvp_ppq (MDVP:PPQ), jitter_ddp (Jitter:DDP) and nhr (NHR).**
* **mdvp_jitter_abs (MDVP:Jitter(Abs) have high correlation with mdvp_rap (MDVP:RAP), mdvp_ppq (MDVP:PPQ), jitter_ddp (Jitter:DDP).**
* **mdvp_rap (MDVP:RAP) have high correlation with mdvp_ppq (MDVP:PPQ), jitter_ddp (Jitter:DDP), nhr (NHR).**
* **mdvp_ppq (MDVP:PPQ) have high correlation with jitter_ddp (Jitter:DDP).**
* **jitter_ddp (Jitter:DDP) have high correlation with nhr (NHR).**
* **mdvp_shimmer (MDVP:Shimmer) have high correlation with mdvp_shimmer_db (MDVP:Shimmer(dB)), shimmer_apq3 (Shimmer:APQ3), shimmer_apq5 (Shimmer:APQ5), mdvp_apq (MDVP:APQ), shimmer_dda (Shimmer:DDA).**
* **mdvp_shimmer_db (MDVP:Shimmer(dB)) have high correlation with shimmer_apq3 (Shimmer:APQ3), shimmer_apq5 (Shimmer:APQ5), mdvp_apq (MDVP:APQ), shimmer_dda (Shimmer:DDA).**
* **shimmer_apq3 (Shimmer:APQ3) have high correlation with shimmer_apq5 (Shimmer:APQ5), mdvp_apq (MDVP:APQ), shimmer_dda (Shimmer:DDA).**
* **shimmer_apq5 (Shimmer:APQ5) have high correlation with mdvp_apq (MDVP:APQ), shimmer_dda (Shimmer:DDA).**
* **mdvp_apq (MDVP:APQ) have high correlation with shimmer_dda (Shimmer:DDA).**
* **spread1 have high correlation with ppe (PPE).**

### 4. Split the dataset into training and test set in the ratio of 70:30 (Training:Test) (5 points)

In [None]:
#Split the data into training and test set in the ratio of 70:30 respectively
X = pdData.drop(['status'],axis=1)
y = pdData['status']

# split data into train subset and test subset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)

# checking the dimensions of the train & test subset
# printing dimension of train set
print(X_train.shape)
# printing dimension of test set
print(X_test.shape)

### 5. Prepare the data for training - Scale the data if necessary, get rid of missing values (if any) etc (5 points)

* **As we have seen earlier, there are no missing values in the dataset**
* **As from the earlier correlation heatmap of different attributes we found that mdvp_jitter_in_percent (MDVP:Jitter(%)) have high correlation with mdvp_jitter_abs (MDVP:Jitter(Abs) ), mdvp_rap (MDVP:RAP), mdvp_ppq (MDVP:PPQ), jitter_ddq (Jitter:DDQ) and nhr (NHR). So, in this case we will drop mdvp_jitter_in_percent (MDVP:Jitter(%)).**

In [None]:
X_train.drop(['mdvp_jitter_in_percent'],axis=1,inplace=True)
X_test.drop(['mdvp_jitter_in_percent'],axis=1,inplace=True)

* **Also from the earlier correlation heatmap of different attributes we found that mdvp_shimmer (MDVP:Shimmer) have high correlation with mdvp_shimmer_db (MDVP:Shimmer(dB)), shimmer_apq3 (Shimmer:APQ3), shimmer_apq5 (Shimmer:APQ5), mdvp_apq (MDVP:APQ), shimmer_dda (Shimmer:DDA). So, in this case we will drop mdvp_shimmer (MDVP:Shimmer).**

In [None]:
X_train.drop(['mdvp_shimmer'],axis=1,inplace=True)
X_test.drop(['mdvp_shimmer'],axis=1,inplace=True)

* **Also we will drop hnr (HNR).**

In [None]:
X_train.drop(['hnr'],axis=1,inplace=True)
X_test.drop(['hnr'],axis=1,inplace=True)

In [None]:
# re checking the dimensions of the train & test subset after dropping several columns from the subsets
# printing dimension of train set
print(X_train.shape)
# printing dimension of test set
print(X_test.shape)

In [None]:
# Let us scale train as well as test data using StandardScaler
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

### 6. Train at least 3 standard classification algorithms - Logistic Regression, Naive Bayes’, SVM, k-NN etc, and note down their accuracies on the test data (10 points)

### A. Logistic Regression:

In [None]:
# Train and Fit model
lr = LogisticRegression(random_state=0)
lr.fit(X_train_scaled, y_train)

#predict status for X_test_scaled dataset 
lr_y_pred = lr.predict(X_test_scaled)

# Confusion Matrix for the Logistic Regression Model
print("Confusion Matrix : Logistic Regression")
print(confusion_matrix(y_test,lr_y_pred))

# Classification Report for the Logistic Regression Model
classRep = classification_report(y_test, lr_y_pred, digits=2)
print(classRep)

**From the above Logistic Regression Model, we can find out the following details:**
* **Accuracy of the model:- 86%**
* **Re-call of the model:- 91%**
* **Precision of the model:- 91%**
* **F1-Score of the model:- 91%**

### B. K-nearest neighbors:

**First let's find out the value of neighbors.**

In [None]:
# creating odd list of K for KNN
myList = list(range(3,40,2))

# creating empty list for F1 scores od different value of K
f1ScoreList = []

# perform accuracy metrics for values from 3,5....29
for k in myList:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    # predict the response
    y_pred = knn.predict(X_test_scaled)
    # evaluate F1 Score
    f1Score = f1_score(y_test, y_pred)
    f1ScoreList.append(f1Score)

# changing to misclassification error
MSE = [1 - x for x in f1ScoreList]

# determining best k
bestk = myList[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % bestk)

In [None]:
# instantiate learning model (k = 29)
knn = KNeighborsClassifier(n_neighbors = 29, weights = 'uniform', metric='euclidean')

# fitting the model
knn.fit(X_train_scaled, y_train)

# predict the response
knn_y_pred = knn.predict(X_test_scaled)

# Confusion Matrix for the K-nearest neighbors Model
print("Confusion Matrix : K-nearest neighbors")
print(confusion_matrix(y_test,knn_y_pred))

# Classification Report for the K-nearest neighbors Model
classRep = classification_report(y_test, knn_y_pred, digits=2)
print(classRep)

**From the above K-nearest neighbors Model, we can find out the following details:**
* **Accuracy of the model:- 92%**
* **Re-call of the model:- 100%**
* **Precision of the model:- 90%**
* **F1-Score of the model:- 95%**

### C. SVM (Support Vector Machine):

In [None]:
svm = SVC(gamma=0.05, C=70,random_state=47)
svm.fit(X_train_scaled , y_train)

# predict the response
svm_y_pred = svm.predict(X_test_scaled)

# Confusion Matrix for the Support Vector Machine Model
print("Confusion Matrix : Support Vector Machine")
print(confusion_matrix(y_test,svm_y_pred))

# Classification Report for the Support Vector Machine Model
classRep = classification_report(y_test, svm_y_pred, digits=2)
print(classRep)

**From the above Support Vector Machine Model, we can find out the following details:**
* **Accuracy of the model:- 95%**
* **Re-call of the model:- 100%**
* **Precision of the model:- 94%**
* **F1-Score of the model:- 97%**

#### Determining which standard model performed better

In [None]:
#Using K fold to check how the above algorighms varies throughout the dataset with 10 different subset of equal bins
models = []
models.append(('Logistic Regression', LogisticRegression(random_state=47)))
models.append(('K-NN', KNeighborsClassifier(n_neighbors = 29, weights = 'uniform', metric='euclidean')))
models.append(('SVM', SVC(gamma=0.05, C=70,random_state=47)))

# evaluate each model
results = []
names = []
scoring = 'f1'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=47)
    cv_results = model_selection.cross_val_score(model, X_train_scaled, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    print("\033[1m{0}\033[0m model have \033[1mmean F1-Score\033[0m of {1} and \033[1mSD F1-Score\033[0m of {2}".format(name, cv_results.mean(), cv_results.std()))

In [None]:
plt.title('Algorithm Comparison')
plt.plot(results[0],label='Logistic')
plt.plot(results[1],label='KNN')
plt.plot(results[2],label='SVM')
plt.legend()

**From the above comparision of different algorithms (Logistic Regression, K-nearest neighbors and Support Vector Machine) we can conclude that SVM (Support Vector Machine) performed slightly better than other algorithms.**


### 7. Train a meta-classifier and note the accuracy on test data (10 points)

* **STACKING:**

In [None]:
# defining level hetrogenious model
level0 = list()
level0.append(('lr', LogisticRegression(random_state=47)))
level0.append(('knn', KNeighborsClassifier(n_neighbors = 29, weights = 'uniform', metric='euclidean')))
level0.append(('cart', DecisionTreeClassifier()))
level0.append(('svm', SVC(gamma=0.05, C=70,random_state=47)))
level0.append(('bayes', GaussianNB()))

# define meta learner model
level1 = SVC(gamma=0.05, C=3,random_state=47)

# define the stacking ensemble with cross validation of 5
Stack_model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)

# predict the response
Stack_model.fit(X_train_scaled, y_train)
prediction_Stack = Stack_model.predict(X_test_scaled)

# Confusion Matrix for the Stacking Model
print("Confusion Matrix : Stacking")
print(confusion_matrix(y_test,prediction_Stack))

# Classification Report for the Stacking Model
print(classification_report(y_test, prediction_Stack, digits=2))

#### AUC-ROC for stacking

In [None]:
#determining false positive rate and True positive rate, threshold
fpr, tpr, threshold = metrics.roc_curve(y_test, prediction_Stack)
roc_auc_stack = metrics.auc(fpr, tpr)

#plotting ROC curve
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc_stack)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

**From the above Stacked meta classifier Model, we can find out the following details:**
* **Accuracy of the model:- 95%**
* **Re-call of the model:- 100%**
* **Precision of the model:- 94%**
* **F1-Score of the model:- 97%**
* **ROC-AUC : 88%**


### 8.Train at least one standard Ensemble model - Random forest, Bagging, Boosting etc, and note the accuracy (10 points)
**A. Random Forest**

In [None]:
#creating model of Random Forest
RandomForest = RandomForestClassifier(n_estimators = 100,criterion='entropy',max_features=10,random_state=47)
RandomForest = RandomForest.fit(X_train_scaled, y_train)

# predict the response
RandomForest_pred = RandomForest.predict(X_test_scaled)

# Confusion Matrix for the Random Forest Model
print("Confusion Matrix : Random Forest")
print(confusion_matrix(y_test,RandomForest_pred))

# Classification Report for the Randome Forest Model
print(classification_report(y_test, RandomForest_pred, digits=2))

#### AUC-ROC for Random Forest

In [None]:
#determining false positive rate and True positive rate, threshold
fpr, tpr, threshold = metrics.roc_curve(y_test, RandomForest_pred)
roc_auc_rf = metrics.auc(fpr, tpr)

#plotting ROC curve
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc_rf)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

**From the above Random Forest Model, we can find out the following details:**
* **Accuracy of the model:- 92%**
* **Re-call of the model:- 98%**
* **Precision of the model:- 92%**
* **F1-Score of the model:- 95%**
* **ROC-AUC : 84%**


In [None]:
# Lets check features importance
feature_imp = pd.Series(RandomForest.feature_importances_,index=X_train.columns).sort_values(ascending=False)
feature_imp

In [None]:
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')

**B. Adaptive Boosting**

In [None]:
#creating model of Adaptive Boosting
AdBs = AdaBoostClassifier( n_estimators= 50)
AdBs  = AdBs.fit(X_train_scaled, y_train)

# predict the response
AdBs_y_pred = AdBs.predict(X_test_scaled)

# Confusion Matrix for the Adaptive Boosting Model
print("Confusion Matrix : Adaptive Boosting")
print(confusion_matrix(y_test,AdBs_y_pred))

# Classification Report for the Adaptive Boosting Model
print(classification_report(y_test, AdBs_y_pred, digits=2))

#### AUC-ROC for AdaBoost

In [None]:
#determining false positive rate and True positive rate, threshold
fpr, tpr, threshold = metrics.roc_curve(y_test, AdBs_y_pred)
roc_auc_ada = metrics.auc(fpr, tpr)

#plotting ROC curve
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc_ada)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

**From the above Adaptive Boosting Model, we can find out the following details:**
* **Accuracy of the model:- 90%**
* **Re-call of the model:- 96%**
* **Precision of the model:- 92%**
* **F1-Score of the model:- 94%**
* **ROC-AUC : 82%**


### 9. Compare all the models (minimum 5) and pick the best one among them (10 points)

In [None]:
#Using K fold to check how the various algorighms varies throughout the dataset with 10 different subset of equal bins
models = []
models.append(('Logistic Regression', LogisticRegression(random_state=47)))
models.append(('K-NN', KNeighborsClassifier(n_neighbors = 29, weights = 'uniform', metric='euclidean')))
models.append(('SVM', SVC(gamma=0.05, C=70,random_state=47)))
models.append(('Stacking', StackingClassifier(estimators=level0, final_estimator=level1, cv=5)))
models.append(('Random Forest', RandomForestClassifier(n_estimators = 100,criterion='entropy',max_features=10,random_state=47)))
models.append(('Adaptive Boosting', AdaBoostClassifier( n_estimators= 50)))

# evaluate each model with scoring method accuracy
print("*"*125)
print("Accuracy scoring of the Models")
print("*"*125)
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=47)
    cv_results = model_selection.cross_val_score(model, X_train_scaled, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    print("\033[1m{0}\033[0m model have \033[1mmean Accuracy\033[0m of {1} and \033[1mSD Accuracy\033[0m of {2}"
          .format(name, round(cv_results.mean(),2), round(cv_results.std(),2))) 


print()
print("*"*125)
print("F1 scoring of the Models")
print("*"*125)

# evaluate each model with scoring method f1
results = []
names = []
scoring = 'f1'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=47)
    cv_results = model_selection.cross_val_score(model, X_train_scaled, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    print("\033[1m{0}\033[0m model have \033[1mmean F1-Score\033[0m of {1} and \033[1mSD F1-Score\033[0m of {2}"
          .format(name, round(cv_results.mean(),2), round(cv_results.std(),2)))    

**We can conclude from the above Accuracy and F1 scoring method that, Stacking Model performs better than other models.**
* **Stacking Model have mean Accuracy of 93% with standard deviation of 5% .**
* **And, Stacking Model have mean F1-Score of 94% with standard deviation of 5% .**