<a href="https://colab.research.google.com/github/Varshith-07/Parkinson-s-disease-detection/blob/main/ML_PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement:**

This project aims to develop a machine learning model for early detection of Parkinson's disease using a CSV dataset of voice recordings and clinical data to predict health status, enhancing diagnostic accuracy and patient outcomes.

# **Steps :**
1. Importing the library files
2. Reading the Iris Dataset
3. Preprocessing
4. Split the dataset into training and testing
5. Build the models ( Logistic Regression, KNN, SVM,Decision Tree,Random Forest Models ), Evaluate the performance of the Models and Visualize the model


## **1. Importing the Python libraries**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

The above code imports libraries like (NumPy, pandas), machine learning models (SVM, Logistic Regression, KNN), data preprocessing (train_test_split, StandardScaler) and model evaluation (accuracy_score).

## **2. Reading the Parkinsons Dataset**

In [None]:
parkinsons_data = pd.read_csv('/content/parkinsons data.csv')

Loading the dataeset in variable name parkinsons_data by pd.read_csv function since the link refers the Parkinsons dataset of CSV format


In [None]:
parkinsons_data.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


For better understanding printing the tpo 5 rows and this can be done by function called parkinsons_data.head() if we want to print top n number of rows then pass the n (int) as arguement in head function

## **3. Preprocessing**

**Number of columns and rows**

In [None]:
parkinsons_data.shape

(195, 24)

It defines the shape of our loaded dataset

**Dataset Information**

In [None]:
parkinsons_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

The above cell define gives the information of dataset like class, RangeIndex,Non-Null count and object types of every column

**Counting Null values in dataset**

In [None]:
parkinsons_data.isnull().sum()

name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
status              0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
dtype: int64

The above piece of code defies the count of null entries in the dataset as per columns

In [None]:
parkinsons_data.describe()

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
count,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,...,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0
mean,154.228641,197.104918,116.324631,0.00622,4.4e-05,0.003306,0.003446,0.00992,0.029709,0.282251,...,0.046993,0.024847,21.885974,0.753846,0.498536,0.718099,-5.684397,0.22651,2.381826,0.206552
std,41.390065,91.491548,43.521413,0.004848,3.5e-05,0.002968,0.002759,0.008903,0.018857,0.194877,...,0.030459,0.040418,4.425764,0.431878,0.103942,0.055336,1.090208,0.083406,0.382799,0.090119
min,88.333,102.145,65.476,0.00168,7e-06,0.00068,0.00092,0.00204,0.00954,0.085,...,0.01364,0.00065,8.441,0.0,0.25657,0.574282,-7.964984,0.006274,1.423287,0.044539
25%,117.572,134.8625,84.291,0.00346,2e-05,0.00166,0.00186,0.004985,0.016505,0.1485,...,0.024735,0.005925,19.198,1.0,0.421306,0.674758,-6.450096,0.174351,2.099125,0.137451
50%,148.79,175.829,104.315,0.00494,3e-05,0.0025,0.00269,0.00749,0.02297,0.221,...,0.03836,0.01166,22.085,1.0,0.495954,0.722254,-5.720868,0.218885,2.361532,0.194052
75%,182.769,224.2055,140.0185,0.007365,6e-05,0.003835,0.003955,0.011505,0.037885,0.35,...,0.060795,0.02564,25.0755,1.0,0.587562,0.761881,-5.046192,0.279234,2.636456,0.25298
max,260.105,592.03,239.17,0.03316,0.00026,0.02144,0.01958,0.06433,0.11908,1.302,...,0.16942,0.31482,33.047,1.0,0.685151,0.825288,-2.434031,0.450493,3.671155,0.527367


The `parkinsons_data.describe()` function provides statistical summaries of the dataset, including count, mean, standard deviation, min, max, and quartile values
NOTE : It works for numerical data only

In [None]:
parkinsons_data['status'].value_counts()

status
1    147
0     48
Name: count, dtype: int64

1  --> Parkinson's Positive

0 --> Healthy


**Setting Dependent and Independent Variables**

In [None]:
X = parkinsons_data.drop(columns=['name','status'], axis=1)
Y = parkinsons_data['status']

*  And assigning the part of dataset columns for Independent and Dependent variables
*  As per our requirement we are excluding only label(target) and name columns from dataset and assigning the 'X' (alsoo known as features)
*  As per our requirement we are setting only label column from dataset and assigning the 'Y' (alsoo known as target)

In [None]:
print(X)

     MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
0        119.992       157.302        74.997         0.00784   
1        122.400       148.650       113.819         0.00968   
2        116.682       131.111       111.555         0.01050   
3        116.676       137.871       111.366         0.00997   
4        116.014       141.781       110.655         0.01284   
..           ...           ...           ...             ...   
190      174.188       230.978        94.261         0.00459   
191      209.516       253.017        89.488         0.00564   
192      174.688       240.005        74.287         0.01360   
193      198.764       396.961        74.904         0.00740   
194      214.289       260.277        77.973         0.00567   

     MDVP:Jitter(Abs)  MDVP:RAP  MDVP:PPQ  Jitter:DDP  MDVP:Shimmer  \
0             0.00007   0.00370   0.00554     0.01109       0.04374   
1             0.00008   0.00465   0.00696     0.01394       0.06134   
2             0.00

In [None]:
print(Y)

0      1
1      1
2      1
3      1
4      1
      ..
190    0
191    0
192    0
193    0
194    0
Name: status, Length: 195, dtype: int64


## **4. Split the dataset into training and testing**

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.35, random_state=2)

This above cell refers of spliting the dataset into training and testing sets. `X_train` and `Y_train` are the features and labels for training, while `X_test` and `Y_test` are for testing.
*  The test set size is 20%
*  The train set size is 80%

 with a random state of 2 for reproducibility.

In [None]:
print(X.shape, X_train.shape, X_test.shape)

(195, 22) (126, 22) (69, 22)


*  X.shape gives the dimensions of the original dataset.
*  X_train.shape gives the dimensions of the training dataset.
*  X_test.shape gives the dimensions of the testing dataset.

## **Data Standardization**

In [None]:
scaler = StandardScaler()

This line initializes an instance of the StandardScaler to scaler variable

In [None]:
scaler.fit(X_train)

In [None]:
X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

 fitting the StandardScaler to the training data X_train

In [None]:
print(X_train)

[[-1.08589582 -0.73943943 -0.23108844 ... -0.03540838  0.08325331
   0.2751653 ]
 [-0.50267171 -0.1017608  -0.84031557 ... -1.72426263 -1.41433871
   0.02001728]
 [-0.24388406 -0.43614289  0.55640062 ... -1.71608272 -0.15422338
  -0.27075828]
 ...
 [-0.96430881 -0.69825472 -0.15333711 ...  1.20718641 -0.48985645
  -0.25293397]
 [-0.39989261  0.10786438 -0.79363281 ... -0.23345912 -0.48859884
   0.23624906]
 [ 1.01586263  0.10964781 -0.6193429  ... -0.78673602  1.14221092
  -0.09799632]]


## **5.  Building Models**

**Support Vector Machine Model**

In [None]:
model = svm.SVC()
model.fit(X_train, Y_train)

X_pred_train = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_pred_train)

print('Accuracy score of training data : ', training_data_accuracy)

X_pred_test = model.predict(X_test)
test_data_accuracy = accuracy_score(Y_test, X_pred_test)

print('Accuracy score of test data : ', test_data_accuracy)

Accuracy score of training data :  0.9206349206349206
Accuracy score of test data :  0.8840579710144928


The code trains a Support Vector Machine (SVM) on the training data (`X_train`, `Y_train`). It then predicts the training and test sets, calculating and printing the accuracy for both. This evaluates the model's performance on both training and unseen data.

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(Y_test,X_pred_test))
print(classification_report(Y_test,X_pred_test))

[[ 9  7]
 [ 1 52]]
              precision    recall  f1-score   support

           0       0.90      0.56      0.69        16
           1       0.88      0.98      0.93        53

    accuracy                           0.88        69
   macro avg       0.89      0.77      0.81        69
weighted avg       0.89      0.88      0.87        69



In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(Y_test,X_pred_test))

[[ 9  7]
 [ 1 52]]


The SVM model trained on Parkinson's data achieved \(`training_data_accuracy`\) training accuracy and \(`test_data_accuracy`\) test accuracy, suggesting effective learning and good generalization with a 35% test size.

**Logistic regression**

In [None]:
from sklearn.linear_model import LogisticRegression

reg=LogisticRegression()
reg.fit(X_train,Y_train)

X_pred_train=reg.predict(X_train)
X_pred_test=reg.predict(X_test)

print('Accuracy score of training data : ',accuracy_score(Y_train,X_pred_train))
print('Accuracy score of test data : ', accuracy_score(Y_test, X_pred_test))

Accuracy score of training data :  0.873015873015873
Accuracy score of test data :  0.7681159420289855


The code trains a Logistic Regression on the training data (`X_train`, `Y_train`). It then predicts the training and test sets, calculating and printing the accuracy for both. This evaluates the model's performance on both training and unseen data.

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(Y_test,X_pred_test))
print(classification_report(Y_test,X_pred_test))

[[10  6]
 [10 43]]
              precision    recall  f1-score   support

           0       0.50      0.62      0.56        16
           1       0.88      0.81      0.84        53

    accuracy                           0.77        69
   macro avg       0.69      0.72      0.70        69
weighted avg       0.79      0.77      0.78        69



In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(Y_test,X_pred_test))

[[10  6]
 [10 43]]


The Logistic Regression model achieved \(`training_data_accuracy`\) on training data and \(`test_data_accuracy`\) on test data for Parkinson's dataset, showing good learning and generalization with a 35% test size.

 **KNN**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,Y_train)

X_pred_train=knn.predict(X_train)
X_pred_test=knn.predict(X_test)

print('Accuracy score of training data : ',accuracy_score(Y_train,X_pred_train))
print('Accuracy score of test data : ', accuracy_score(Y_test, X_pred_test))

Accuracy score of training data :  0.9682539682539683
Accuracy score of test data :  0.9130434782608695


The code trains a KNearestNeighbours (KNN) on the training data (`X_train`, `Y_train`). It then predicts the training and test sets, calculating and printing the accuracy for both. This evaluates the model's performance on both training and unseen data.

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(Y_test,X_pred_test))
print(classification_report(Y_test,X_pred_test))

[[15  1]
 [ 5 48]]
              precision    recall  f1-score   support

           0       0.75      0.94      0.83        16
           1       0.98      0.91      0.94        53

    accuracy                           0.91        69
   macro avg       0.86      0.92      0.89        69
weighted avg       0.93      0.91      0.92        69



In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(Y_test,X_pred_test))

[[15  1]
 [ 5 48]]


The KNN model for Parkinson's data achieved \(`training_data_accuracy`\) accuracy on training data and \(`test_data_accuracy`\) on test data, demonstrating effective learning and solid generalization with 35% test size.


**Decision Trees**

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(X_train,Y_train)

X_pred_train=dt.predict(X_train)
X_pred_test=dt.predict(X_test)

print('Accuracy score of training data : ',accuracy_score(Y_train,X_pred_train))
print('Accuracy score of test data : ', accuracy_score(Y_test, X_pred_test))

Accuracy score of training data :  1.0
Accuracy score of test data :  0.7681159420289855


The code trains a DecisionTreeClassifier on the training data (`X_train`, `Y_train`). It then predicts the training and test sets, calculating and printing the accuracy for both. This evaluates the model's performance on both training and unseen data.

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(Y_test,X_pred_test))
print(classification_report(Y_test,X_pred_test))

[[12  4]
 [12 41]]
              precision    recall  f1-score   support

           0       0.50      0.75      0.60        16
           1       0.91      0.77      0.84        53

    accuracy                           0.77        69
   macro avg       0.71      0.76      0.72        69
weighted avg       0.82      0.77      0.78        69



In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(Y_test,X_pred_test))

[[12  4]
 [12 41]]


The Decision Tree model on Parkinson's data showed \(`training_data_accuracy`\) training accuracy and \(`test_data_accuracy`\) test accuracy, indicating effective learning and generalization with a 35% test size.

### **6.Conclusion**

### Conclusion

For the Parkinson's dataset, four models were evaluated:

1. **SVM**: Achieved 92.06% training accuracy and 88.41% test accuracy. Precision and recall for class 0 were 0.90 and 0.56, while for class 1 they were 0.88 and 0.98, respectively, indicating strong performance and generalization.

2. **Logistic Regression**: Showed 87.30% training accuracy and 76.81% test accuracy. Precision for class 0 was 0.50 and recall 0.62, while for class 1 they were 0.88 and 0.81, indicating reasonable performance but some overfitting.

3. **KNN**: Achieved 96.83% training accuracy and 91.30% test accuracy. Precision and recall for class 0 were 0.75 and 0.94, while for class 1 they were 0.98 and 0.91, showing excellent performance and generalization.

4. **Decision Tree**: Achieved 100% training accuracy and 84.06% test accuracy. Precision for class 0 was 0.61 and recall 0.88, while for class 1 they were 0.96 and 0.83, indicating overfitting with good test performance.