Dear Participant,

Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased.

Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician.

Use the provided dataset in order to do your analysis.

#Attribute Information:

#Matrix column entries (attributes):
#name - ASCII subject name and recording number
#MDVP:Fo(Hz) - Average vocal fundamental frequency
#MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
#MDVP:Flo(Hz) - Minimum vocal fundamental frequency
#MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several 
#measures of variation in fundamental frequency
#MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
#NHR,HNR - Two measures of ratio of noise to tonal components in the voice
#status - Health status of the subject (one) - Parkinson's, (zero) - healthy
#RPDE,D2 - Two nonlinear dynamical complexity measures
#DFA - Signal fractal scaling exponent
#spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

In [None]:
# Import required library
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Q1. Load the dataset 

In [None]:
df=pd.read_csv('parkinsons.csv')
df

### Q2. Use the .describe() method on the dataset and state any insights you may come across.

In [None]:
df.describe()

### Q3. Check for class imbalance. Do people with Parkinson's have greater representation in the dataset?

In [None]:
df['status'].value_counts().plot(kind='bar')
plt.show()
print('from the graph we can see that the data is baised having Parkinsons greater representation')

### Q4. Check for missing vaues and take necessary measures by dropping observation or imputing them.

In [None]:
df.isnull().sum()

### Q5. Plot the distribution of all the features. State any observations you can make based on the distribution plots.

In [None]:
import seaborn as sns
sns.pairplot(df)

### Q6. Check for outliers in the data. Are there any variables with high amount of outliers.

In [None]:
df.plot(kind='box')

### Q7. Are there any strong correlations among the independent features?

In [None]:
df.corr()

### Q8. Split dataset into training & test dataset  

In [None]:
x=df.drop(['name','status'],axis=1)
y=df.status
from sklearn.model_selection import train_test_split
(train_x, test_x, train_y, test_y) = train_test_split(x,y, train_size=0.8, random_state=1)
print(x.shape)
print(y.shape)

### Q9. Create a default decision tree model using criterion = Entropy 

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier 

dtc = DecisionTreeClassifier(criterion='entropy')
dtc.fit(train_x, train_y)


In [None]:
y_pred = dtc.predict(test_x)

print('\nAccuracy: {0:.4f}'.format(accuracy_score(test_y, y_pred)))
y_pred

### Q10.  Use regularization parameters of max_depth, min_sample_leaf to recreate the model. What is the impact on the model accuracy? How does regularization help? 

In [None]:
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV 


param_dist = {"max_depth": [4, None], 
              "max_features": randint(1, 20), 
              "min_samples_leaf": randint(1, 15), 
              "criterion": ["gini", "entropy"]} 
  
# Instantiating Decision Tree classifier 
tree = DecisionTreeClassifier() 
  
# Instantiating RandomizedSearchCV object 
tree_cv = RandomizedSearchCV(tree, param_dist, cv = 10) 
  
tree_cv.fit(x, y) 
  
# Print the tuned parameters and score 
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_)) 
print("Best score is {}".format(tree_cv.best_score_)) 
print()
print('Regularization does not improve the performance on the data set it improves the generalization performance, i.e., the performance on new, unseen data')

### Q11. Implement a Random Forest model. What is the optimal number of trees that gives the best result?

In [None]:
from sklearn.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(criterion='gini',random_state=5)
rnd_clf.fit(train_x, train_y)

In [None]:
y_pred = rnd_clf.predict(test_x)
y_pred

In [None]:
print('\nAccuracy: {0:.4f}'.format(accuracy_score(test_y, y_pred)))

In [None]:
param_dist = {"max_depth": [4, None], 
              "max_features": randint(1, 20), 
              "min_samples_leaf": randint(1, 15), 
              "criterion": [ "entropy"]} 
  
# Instantiating Decision Tree classifier 
tree = RandomForestClassifier(random_state=5)
  
# Instantiating RandomizedSearchCV object 
tree_cv = RandomizedSearchCV(tree, param_dist, cv = 10) 
  
tree_cv.fit(x, y) 
  
# Print the tuned parameters and score 
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_)) 
print("Best score is {}".format(tree_cv.best_score_)) 
print()