**The dataset contains various features related to voice measurements. Here's an overview of the first few rows:**

- Each row appears to represent a voice recording, with multiple acoustic features measured for each recording.
- The features include measurements like MDVP:Fo(Hz), MDVP:Fhi(Hz), MDVP:Flo(Hz), jitter, shimmer, and other voice parameters.
- There's a column named status, which might indicate a particular condition or classification.
- The name column seems to be an identifier for each recording.

In [None]:
import pandas as pd

# Load the data
file_path = '/content/parkinsons.data'
parkinsons_data = pd.read_csv(file_path)

# Display the first few rows of the dataset
parkinsons_data.head()

**We will perform the following analyses:**

- Descriptive Statistics: This will give us basic statistics for each feature, like mean, median, standard deviation, etc.
- Data Quality Check: We'll look for missing values or anomalies in the dataset.
Feature Distribution: Understanding the distribution of various features, possibly using histograms or box plots.
- Correlation Analysis: Analyzing how different features are correlated with each other, particularly with the 'status' variable if it's a target variable.
- Status Breakdown: If 'status' is a key variable (e.g., indicating a condition or classification), we can look at its distribution.

**The comprehensive overview of the dataset, we can perform the following analyses:**

1. **Descriptive Statistics**: This will give us basic statistics for each feature, like mean, median, standard deviation, etc.
2. **Data Quality Check**: We'll look for missing values or anomalies in the dataset.
3. **Feature Distribution**: Understanding the distribution of various features, possibly using histograms or box plots.
4. **Correlation Analysis**: Analyzing how different features are correlated with each other, particularly with the 'status' variable if it's a target variable.
5. **Status Breakdown**: If 'status' is a key variable (e.g., indicating a condition or classification), we can look at its distribution.

Let's start with the first three points: Descriptive Statistics, Data Quality Check, and Feature Distribution. After that, we'll move on to Correlation Analysis and Status Breakdown. I'll begin with Descriptive Statistics and Data Quality Check.

### Descriptive Statistics

**The Summary of the descriptive statistics for the dataset:**

- **Count**: The dataset has 195 entries.
- **Mean and Standard Deviation**: These values provide an insight into the average and variability of each feature. For instance, the average frequency (`MDVP:Fo(Hz)`) is about 154 Hz with a standard deviation of 41 Hz, indicating some variability in voice frequency among the samples.
- **Min and Max**: These values show the range of each feature. For example, the minimum and maximum values of `MDVP:Fhi(Hz)` are 102.145 Hz and 592.030 Hz, respectively.
- **Quartiles (25%, 50%, 75%)**: These values give an idea about the distribution of the data. The 50% value (median) is particularly useful for understanding the central tendency of the data.

### Data Quality Check

- There are no missing values in any of the columns, which is excellent for analysis purposes.

Given that the data is clean and well-structured, we can proceed to the next steps: Feature Distribution, Correlation Analysis, and Status Breakdown. Let's start by visualizing the distribution of some key features, including the target variable 'status'.

**The distribution plots for selected features from the Parkinson's dataset are as follows:**

1. **MDVP:Fo(Hz), MDVP:Fhi(Hz), MDVP:Flo(Hz)**: These features, representing various frequency measures of the voice, show somewhat skewed distributions. For example, `MDVP:Fo(Hz)` and `MDVP:Flo(Hz)` are right-skewed, indicating that lower frequencies are more common.

2. **MDVP:Jitter(%)**: This measure of frequency variation is also right-skewed, suggesting that higher jitter percentages are less common.

3. **MDVP:Shimmer**: Similar to Jitter, the shimmer values are right-skewed, indicating that most voice samples have lower shimmer.

4. **HNR (Harmonics-to-Noise Ratio)**: This feature shows a more varied distribution, slightly skewed to the left.

5. **Status**: It appears that the majority of the samples in the dataset are labeled with 'status' 1. The 'status' variable seems to be binary, possibly indicating the presence or absence of Parkinson's disease.

Next, we'll perform a correlation analysis to understand how these features relate to each other and particularly to the 'status' variable. This can provide insights into which features might be important for predicting the 'status'.

**The heatmap displays the correlation matrix for the dataset, providing insights into how different features are related to each other. Here are some key observations:**

1. **Correlation with Status**: Several features show a significant correlation with the 'status' variable. For instance, features like `spread1`, `PPE` (Pitch Period Entropy), and `MDVP:Shimmer` have relatively strong positive correlations with 'status'. This suggests they could be important factors in distinguishing between the two statuses.

2. **Highly Correlated Features**: There are groups of features that are highly correlated with each other. For example, `MDVP:Shimmer`, `Shimmer:DDA`, `MDVP:Shimmer(dB)`, `Shimmer:APQ3`, and `Shimmer:APQ5` are all strongly correlated. This is expected as they are different measures of voice amplitude variation (shimmer).

3. **Negative Correlations**: Some features show a negative correlation with 'status', such as `HNR` (Harmonics-to-Noise Ratio). A higher HNR typically indicates a healthier voice signal, which aligns with it being negatively correlated with Parkinson's disease status.

These correlations are useful for understanding relationships within the data and can guide further analysis, such as feature selection for predictive modeling.

In [None]:
# Descriptive Statistics
descriptive_stats = parkinsons_data.describe()

# Data Quality Check
missing_values = parkinsons_data.isnull().sum()

descriptive_stats

In [None]:
missing_values

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetic style of the plots
sns.set()

# Selecting a subset of columns for distribution plots (for simplicity)
columns_to_plot = ['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)', 'MDVP:Shimmer', 'HNR', 'status']

# Plotting distributions for the selected features
plt.figure(figsize=(15, 10))
for i, column in enumerate(columns_to_plot, 1):
    plt.subplot(3, 3, i)
    if column == 'status':
        sns.countplot(data=parkinsons_data, x=column)
    else:
        sns.histplot(parkinsons_data[column], kde=True)
    plt.title(column)

plt.tight_layout()
plt.show()


In [None]:
# Correlation analysis
correlation_matrix = parkinsons_data.corr()

# Plotting the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()



1. **Predictive Modeling**: Developing a model to predict the 'status' (likely indicating the presence or absence of Parkinson's disease) based on the other features.
2. **Principal Component Analysis (PCA)**: Reducing the dimensionality of the data to understand the most important features.
3. **Outlier Detection**: Identifying any outliers in the dataset that might affect analyses and models.
4. **Advanced Feature Analysis**: Examining the relationships and importance of specific features in more detail.

Let's begin with Predictive Modeling, where we'll build a model to predict the 'status' based on other features in the dataset. We'll use a simple classification model for this purpose.

First, we'll prepare the data by splitting it into features (X) and the target variable (y), and then divide it into training and testing sets. After that, we'll train a classifier and evaluate its performance. Let's start with the data preparation and model training.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Preparing the data
X = parkinsons_data.drop(['name', 'status'], axis=1)  # Features
y = parkinsons_data['status']  # Target variable

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Creating and training a RandomForest Classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predicting on the test set
y_pred = model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

accuracy


In [None]:
print(classification_rep)

### Predictive Modeling Results

The RandomForest Classifier was used to predict the 'status' variable. Here are the results:

- **Accuracy**: The model achieved an accuracy of approximately 93.2%. This indicates a high level of performance in classifying the status correctly.
- **Classification Report**:
    - **Precision**: Measures the accuracy of positive predictions. For status 0 (likely representing absence of Parkinson's), the precision is 92%, and for status 1 (likely presence of Parkinson's), it's 93%.
    - **Recall**: Indicates the ability to find all positive instances. For status 0, the recall is 80%, and for status 1, it's 98%.
    - **F1-score**: A weighted average of precision and recall. The F1-scores are 86% for status 0 and 96% for status 1.

The high values in these metrics suggest that the model is quite effective at distinguishing between the two statuses. The better performance on status 1 might be due to a larger number of samples for this class in the dataset, as seen in the earlier distribution plot.


1. **Principal Component Analysis (PCA)**: This will help us understand the underlying structure of the data and reduce the number of features while retaining most of the variance in the data.
2. **Outlier Detection**: We'll identify outliers in the dataset, which can impact the performance of predictive models.
3. **Advanced Feature Analysis**: This involves examining specific features in more detail, exploring their relationships and impact on the target variable.

Let's start with Principal Component Analysis (PCA) to reduce the dimensionality of the dataset and identify the most significant components.

### Principal Component Analysis (PCA) Results

- **Number of Components**: The PCA reduced the dataset to 8 principal components while retaining approximately 95.77% of the variance.
- **Explained Variance**: This high level of explained variance suggests that these 8 components capture most of the information present in the original dataset.

By reducing the dimensionality to 8 components from the original number of features, we can simplify the dataset while still retaining most of the information. This can be particularly useful for visualization, further analysis, or more efficient modeling.

Next, let's move to Outlier Detection. We'll use a statistical method to identify outliers in the dataset. Outliers can significantly affect the results of data analysis and statistical modeling. Let's proceed with that.

### Outlier Detection Results

- **Number of Outliers Detected**: 14 unique data points in the dataset have been identified as outliers based on the Z-score method. These data points have values in one or more features that are significantly different from the average (typically, a Z-score greater than 3 is considered an outlier).
- **Outlier Details**: The outliers are distributed across various features. For example, the data point at index 73 is an outlier in the feature corresponding to index 1.

Outliers can arise due to various reasons such as measurement errors, data entry errors, or genuine variance in the data. Depending on the context, you might choose to investigate these outliers further, or exclude them from certain analyses to avoid skewing the results.

Finally, we can move on to Advanced Feature Analysis. This step involves a deeper look into specific features to understand their individual impact and relationships with the target variable ('status'). This can include analyses like feature importance from the model, bivariate analyses, or looking into interaction effects between features.


In [None]:
from sklearn.decomposition import PCA
import numpy as np

# Standardizing the features before applying PCA
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)

# Applying PCA
pca = PCA(n_components=0.95)  # retain 95% of the variance
X_pca = pca.fit_transform(X_scaled)

# Number of components chosen and the explained variance ratio
num_components = pca.n_components_
explained_variance = np.sum(pca.explained_variance_ratio_)

num_components, explained_variance


In [None]:
from scipy import stats

# Detecting outliers using Z-score
# Note: Typically, data points with a Z-score greater than 3 are considered outliers
z_scores = np.abs(stats.zscore(X_scaled))
outliers = np.where(z_scores > 3)

# Number of outliers detected
num_outliers = len(set(outliers[0]))
num_outliers


In [None]:
outliers

# Logistic Regression


1. **Perform Logistic Regression**: We will fit a logistic regression model to the dataset.
2. **Model Evaluation**: Assess the performance of the model.
3. **Visualization**:
    - **Coefficient Plot**: To visualize the impact of each feature.
    - **Confusion Matrix**: For a visual representation of the model's performance.
    - **ROC Curve**: To assess the model's ability to discriminate between the two classes.
### Logistic Regression Analysis Results

1. **Model Accuracy**: The logistic regression model achieved an accuracy of approximately 84.75% on the test set.
2. **Confusion Matrix**:
   - True Negative (TN): 9
   - False Positive (FP): 6
   - False Negative (FN): 3
   - True Positive (TP): 41
3. **ROC AUC Score**: The area under the ROC curve is approximately 0.898, indicating a good discriminatory ability of the model between the two classes.

### Visualizations

1. **Coefficient Plot**: Showing the impact of each feature on the model.
2. **Confusion Matrix**: A visual representation of the model's performance.
3. **ROC Curve**: To visualize the model's discriminative ability.

#### 1. Coefficient Plot

- **Positive Coefficients**: Increase the log-odds of the target variable being 1 (indicating a higher likelihood of Parkinson's presence, if that's what 'status' 1 represents).
- **Negative Coefficients**: Decrease the log-odds of the target variable being 1.

#### 2. Confusion Matrix

- **True Positives (TP)**: 41 (correctly predicted status 1)
- **True Negatives (TN)**: 9 (correctly predicted status 0)
- **False Positives (FP)**: 6 (incorrectly predicted as status 1)
- **False Negatives (FN)**: 3 (incorrectly predicted as status 0)


#### 3. ROC Curve

The Receiver Operating Characteristic (ROC) Curve and the Area Under the Curve (AUC) provide insights into the model's classification ability:

- The blue line represents the ROC curve for the logistic regression model, with an AUC of approximately 0.90.
- The gray dashed line represents a purely random classifier (AUC = 0.50).
- The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.

An AUC close to 1 indicates a high ability of the model to differentiate between the two classes (status 0 and status 1). In this case, the AUC of 0.90 suggests that the model has a good discriminative ability.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
import numpy as np

# Creating and training a Logistic Regression model
log_reg_model = LogisticRegression(max_iter=1000)
log_reg_model.fit(X_train, y_train)

# Predicting on the test set
y_pred_log_reg = log_reg_model.predict(X_test)
y_pred_proba_log_reg = log_reg_model.predict_proba(X_test)[:, 1]

# Model Evaluation
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
conf_matrix = confusion_matrix(y_test, y_pred_log_reg)
roc_auc = roc_auc_score(y_test, y_pred_proba_log_reg)

# Coefficients for visualization
coefficients = log_reg_model.coef_[0]

accuracy_log_reg


In [None]:
conf_matrix

In [None]:
roc_auc

In [None]:
coefficients

In [None]:
# Coefficient Plot
feature_names = X.columns
coefficients = log_reg_model.coef_[0]

plt.figure(figsize=(10, 6))
coeff_plot = sns.barplot(x=coefficients, y=feature_names)
plt.title('Logistic Regression Coefficients')
plt.xlabel('Coefficient Value')
plt.ylabel('Features')
plt.show()


In [None]:
# Confusion Matrix Plot
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap='Blues', cbar=False,
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix')
plt.show()


In [None]:
# ROC Curve Plot
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba_log_reg)

plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()


# Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

# Creating and training a Gaussian Naive Bayes model
naive_bayes_model = GaussianNB()
naive_bayes_model.fit(X_train, y_train)

# Predicting on the test set
y_pred_nb = naive_bayes_model.predict(X_test)
y_pred_proba_nb = naive_bayes_model.predict_proba(X_test)[:, 1]

# Model Evaluation
accuracy_nb = accuracy_score(y_test, y_pred_nb)
conf_matrix_nb = confusion_matrix(y_test, y_pred_nb)
roc_auc_nb = roc_auc_score(y_test, y_pred_proba_nb)

accuracy_nb, conf_matrix_nb, roc_auc_nb


In [None]:
# Confusion Matrix Plot for Naive Bayes
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix_nb, annot=True, fmt="d", cmap='Blues', cbar=False,
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Naive Bayes Confusion Matrix')
plt.show()


In [None]:
# ROC Curve Plot for Naive Bayes
fpr_nb, tpr_nb, thresholds_nb = roc_curve(y_test, y_pred_proba_nb)

plt.figure(figsize=(6, 6))
plt.plot(fpr_nb, tpr_nb, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_nb)
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Naive Bayes ROC Curve')
plt.legend(loc="lower right")
plt.show()


1. **Fit a Naive Bayes Model**: Apply the Naive Bayes algorithm to the dataset.
2. **Evaluate the Model**: Assess the model's performance.
3. **Visualizations**:
   - **Confusion Matrix**: To visualize the performance of the model.
   - **ROC Curve**: Assessing the model's discriminative ability.


### Naive Bayes Model Analysis Results

1. **Model Accuracy**: The Naive Bayes model achieved an accuracy of approximately 74.58% on the test set.
2. **Confusion Matrix**:
   - True Negative (TN): 12
   - False Positive (FP): 3
   - False Negative (FN): 12
   - True Positive (TP): 32
3. **ROC AUC Score**: The area under the ROC curve is approximately 0.853, indicating a good ability of the model to differentiate between the two classes.

### Visualizations


1. **Confusion Matrix**: To understand the distribution of true and false predictions.
2. **ROC Curve**: To assess the model's ability to discriminate between the classes.

#### 1. Confusion Matrix Visualization
- **True Positives (TP)**: 32 (correctly predicted status 1)
- **True Negatives (TN)**: 12 (correctly predicted status 0)
- **False Positives (FP)**: 3 (incorrectly predicted as status 1)
- **False Negatives (FN)**: 12 (incorrectly predicted as status 0)

This matrix provides a clear picture of the model's performance in terms of correct and incorrect predictions.

#### 2. ROC Curve Visualization

- The blue line represents the ROC curve for the Naive Bayes model, with an AUC of approximately 0.853.
- The gray dashed line represents a random classifier (AUC = 0.50).
- The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various thresholds.

An AUC of 0.853 suggests that the Naive Bayes model has a good ability to distinguish between the two classes (status 0 and status 1). While not as high as the logistic regression model's performance, it still indicates a respectable level of discriminative ability.


# K-Nearest Neighbors


In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Creating and training a KNN model (with k=5)
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Predicting on the test set
y_pred_knn = knn_model.predict(X_test)
y_pred_proba_knn = knn_model.predict_proba(X_test)[:, 1]

# Model Evaluation
accuracy_knn = accuracy_score(y_test, y_pred_knn)
conf_matrix_knn = confusion_matrix(y_test, y_pred_knn)
roc_auc_knn = roc_auc_score(y_test, y_pred_proba_knn)

accuracy_knn, conf_matrix_knn, roc_auc_knn


In [None]:
# Confusion Matrix Plot for KNN
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix_knn, annot=True, fmt="d", cmap='Blues', cbar=False,
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('KNN Confusion Matrix')
plt.show()


In [None]:
# ROC Curve Plot for KNN
fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_test, y_pred_proba_knn)

plt.figure(figsize=(6, 6))
plt.plot(fpr_knn, tpr_knn, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_knn)
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('KNN ROC Curve')
plt.legend(loc="lower right")
plt.show()

In [None]:
# Exploring the impact of different 'k' values on KNN model accuracy
k_values = range(1, 21)
accuracy_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred_k = knn.predict(X_test)
    accuracy_k = accuracy_score(y_test, y_pred_k)
    accuracy_scores.append(accuracy_k)

# Plotting accuracy vs. k values
plt.figure(figsize=(10, 6))
plt.plot(k_values, accuracy_scores, marker='o', linestyle='-', color='blue')
plt.title('KNN Accuracy for Different k Values')
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.xticks(k_values)
plt.show()


1. **Fit a KNN Model**: Apply KNN to the dataset.
2. **Evaluate the Model**: Assess the model's performance.
3. **Visualizations**:
   - **Confusion Matrix**: For visualizing the performance of the model.
   - **ROC Curve**: To assess the discriminative ability of the model.
   - **K-value Analysis**: Visualizing the impact of different 'k' values on model accuracy.

### K-Nearest Neighbors (KNN) Model Analysis Results

1. **Model Accuracy**: The KNN model (with k=5) achieved an accuracy of approximately 83.05% on the test set.
2. **Confusion Matrix**:
   - True Negative (TN): 8
   - False Positive (FP): 7
   - False Negative (FN): 3
   - True Positive (TP): 41
3. **ROC AUC Score**: The area under the ROC curve is approximately 0.769, indicating a reasonable ability of the model to differentiate between the two classes.

### Visualizations


1. **Confusion Matrix**: To visualize the model's classification accuracy.
2. **ROC Curve**: Assessing the discriminative ability of the model.
3. **K-value Analysis**: Exploring how different 'k' values impact the model accuracy.

#### 1. Confusion Matrix Visualization

- **True Positives (TP)**: 41 (correctly predicted status 1)
- **True Negatives (TN)**: 8 (correctly predicted status 0)
- **False Positives (FP)**: 7 (incorrectly predicted as status 1)
- **False Negatives (FN)**: 3 (incorrectly predicted as status 0)

#### 2. ROC Curve Visualization

- The blue line represents the ROC curve for the KNN model, with an AUC of approximately 0.769.
- The gray dashed line represents a random classifier (AUC = 0.50).
- The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various thresholds.

An AUC of 0.769 suggests a reasonable ability of the KNN model to differentiate between the two classes, though it's lower compared to the logistic regression and Naive Bayes models we previously examined.

#### 3. K-value Analysis

Finally, let's analyze how different 'k' values (number of neighbors) impact the accuracy of the KNN model. We'll plot the model accuracy as a function of 'k' to determine the best 'k' value for this dataset.

The plot above illustrates the impact of different 'k' values (number of neighbors) on the accuracy of the K-Nearest Neighbors model:

- The accuracy of the KNN model varies with different 'k' values.
- There is a noticeable fluctuation in accuracy as 'k' changes, indicating the sensitivity of the model to the choice of 'k'.
- The plot can be used to select an optimal 'k' value that maximizes accuracy. In this case, we observe that certain 'k' values achieve higher accuracy than others.

Choosing the right 'k' value is crucial for KNN models, as it balances the bias-variance tradeoff. A very small 'k' can make the model sensitive to noise, while a very large 'k' might oversimplify the model.


#  Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Creating and training a Decision Tree model
decision_tree_model = DecisionTreeClassifier(random_state=42)
decision_tree_model.fit(X_train, y_train)

# Predicting on the test set
y_pred_dt = decision_tree_model.predict(X_test)

# Model Evaluation
accuracy_dt = accuracy_score(y_test, y_pred_dt)
conf_matrix_dt = confusion_matrix(y_test, y_pred_dt)

accuracy_dt, conf_matrix_dt


In [None]:
from sklearn.tree import plot_tree

# Visualizing the Decision Tree
plt.figure(figsize=(20, 10))
plot_tree(decision_tree_model, filled=True, feature_names=X.columns, class_names=['0', '1'], fontsize=10)
plt.title("Decision Tree Visualization")
plt.show()


In [None]:
# Confusion Matrix Plot for Decision Tree
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix_dt, annot=True, fmt="d", cmap='Blues', cbar=False,
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Decision Tree Confusion Matrix')
plt.show()


In [None]:
# Feature Importance from Decision Tree
feature_importances_dt = decision_tree_model.feature_importances_

# Plotting feature importances
plt.figure(figsize=(12, 8))
importance_plot = sns.barplot(x=feature_importances_dt, y=X.columns)
plt.title('Feature Importances in Decision Tree')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.show()



1. **Fitting a Decision Tree Model**: We will train a Decision Tree classifier on the dataset.
2. **Evaluating the Model**: Assessing the performance of the model.
3. **Visualizations**:
   - **Tree Visualization**: To visually inspect the structure of the trained decision tree.
   - **Confusion Matrix**: For a clear representation of the model's prediction accuracy.
   - **Feature Importance**: To understand which features are most influential in making predictions.


### Decision Tree Model Analysis Results

1. **Model Accuracy**: The Decision Tree model achieved an accuracy of approximately 86.44% on the test set.
2. **Confusion Matrix**:
   - True Negative (TN): 12
   - False Positive (FP): 3
   - False Negative (FN): 5
   - True Positive (TP): 39

### Visualizations


1. **Tree Visualization**: To inspect the structure of the decision tree.
2. **Confusion Matrix**: For a clear representation of the model's prediction accuracy.
3. **Feature Importance Plot**: To understand which features are most influential.

#### 1. Tree Visualization

The visualization above shows the structure of the trained Decision Tree. Each node in the tree represents a decision based on the value of a certain feature, leading to a classification at the leaf nodes. The colors represent the classes, with shades indicating the majority class in each node.

This tree structure is useful for understanding the decision-making process of the model and how different features contribute to these decisions.

#### 2. Confusion Matrix Visualization

- **True Positives (TP)**: 39 (correctly predicted status 1)
- **True Negatives (TN)**: 12 (correctly predicted status 0)
- **False Positives (FP)**: 3 (incorrectly predicted as status 1)
- **False Negatives (FN)**: 5 (incorrectly predicted as status 0)

This matrix helps in understanding the balance between correctly and incorrectly classified instances by the Decision Tree model.

#### 3. Feature Importance Plot

The bar plot above illustrates the feature importances as identified by the Decision Tree model. Each bar represents a feature's importance in making predictions, with higher bars indicating more important features.

From this visualization, you can discern which features the Decision Tree model found most useful in classifying the data. This insight can be valuable for feature selection in future modeling or for understanding the underlying patterns in the data.

# Support Vector Machines

In [None]:
from sklearn.svm import SVC

# Creating and training an SVM model with RBF kernel
svm_model = SVC(kernel='rbf', probability=True, random_state=42)
svm_model.fit(X_train, y_train)

# Predicting on the test set
y_pred_svm = svm_model.predict(X_test)
y_pred_proba_svm = svm_model.predict_proba(X_test)[:, 1]

# Model Evaluation
accuracy_svm = accuracy_score(y_test, y_pred_svm)
conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)
roc_auc_svm = roc_auc_score(y_test, y_pred_proba_svm)

accuracy_svm, conf_matrix_svm, roc_auc_svm


In [None]:
# ROC Curve Plot for SVM
fpr_svm, tpr_svm, thresholds_svm = roc_curve(y_test, y_pred_proba_svm)

plt.figure(figsize=(6, 6))
plt.plot(fpr_svm, tpr_svm, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_svm)
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('SVM ROC Curve')
plt.legend(loc="lower right")
plt.show()



1. **Fitting an SVM Model**: We will train an SVM classifier on the dataset.
2. **Evaluating the Model**: Assessing the performance of the model.
3. **Visualizations**:
   - **Confusion Matrix**: For a clear representation of the model's prediction accuracy.
   - **ROC Curve**: To assess the discriminative ability of the model.
   - **Decision Boundary (if feasible)**: To visualize how the SVM separates the classes (note that this is more straightforward in lower-dimensional spaces).


### Support Vector Machine (SVM) Model Analysis Results

1. **Model Accuracy**: The SVM model achieved an accuracy of approximately 81.36% on the test set.
2. **Confusion Matrix**:
   - True Negative (TN): 4
   - False Positive (FP): 11
   - False Negative (FN): 0
   - True Positive (TP): 44
3. **ROC AUC Score**: The area under the ROC curve is approximately 0.730, indicating a moderate ability of the model to differentiate between the two classes.

### Visualizations


1. **Confusion Matrix**: To visualize the model's classification accuracy.
2. **ROC Curve**: To assess the discriminative ability of the model.

#### 1. Confusion Matrix Visualization

- **True Positives (TP)**: 44 (correctly predicted status 1)
- **True Negatives (TN)**: 4 (correctly predicted status 0)
- **False Positives (FP)**: 11 (incorrectly predicted as status 1)
- **False Negatives (FN)**: 0 (no incorrect predictions as status 0)


#### 2. ROC Curve Visualization

The ROC Curve for the SVM model is presented above:

- The blue line represents the ROC curve for the SVM model, with an AUC of approximately 0.730.
- The gray dashed line represents a random classifier (AUC = 0.50).
- The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various thresholds.

An AUC of 0.730 suggests a moderate ability of the SVM model to differentiate between the two classes. While this is a bit lower compared to some of the other models we examined, it still indicates a reasonable level of discriminative ability.

#### Decision Boundary Visualization

Visualizing the decision boundary of an SVM in a high-dimensional space (like our dataset) is complex. Typically, decision boundary visualizations are more insightful for datasets with two or three features. However, we can perform a dimensionality reduction (e.g., using PCA) to project the data onto a lower-dimensional space and visualize the decision boundary there. This will be a simplified representation and may not capture all the nuances of the high-dimensional decision process.

In [None]:
# Performance metrics for all models
model_performance = {
    'Model': ['Logistic Regression', 'Naive Bayes', 'KNN', 'Decision Tree', 'SVM'],
    'Accuracy': [accuracy_log_reg, accuracy_nb, accuracy_knn, accuracy_dt, accuracy_svm],
    'ROC AUC': [roc_auc, roc_auc_nb, roc_auc_knn, 'N/A', roc_auc_svm]
}

# Converting to DataFrame for easier plotting
model_performance_df = pd.DataFrame(model_performance)

# Plotting model performance
plt.figure(figsize=(12, 6))

# Plotting Accuracy
plt.subplot(1, 2, 1)
sns.barplot(x='Accuracy', y='Model', data=model_performance_df, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.xlabel('Accuracy')
plt.ylabel('Model')

# Plotting ROC AUC
plt.subplot(1, 2, 2)
sns.barplot(x='ROC AUC', y='Model', data=model_performance_df[model_performance_df['ROC AUC'] != 'N/A'], palette='magma')
plt.title('Model ROC AUC Comparison')
plt.xlabel('ROC AUC')
plt.ylabel('')

plt.tight_layout()
plt.show()


In [None]:
# Selecting a subset of features for individual analysis
features_for_individual_analysis = ['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'HNR', 'spread1']

plt.figure(figsize=(15, 10))

# Creating boxplots for each selected feature against the 'status'
for i, feature in enumerate(features_for_individual_analysis, 1):
    plt.subplot(3, 2, i)
    sns.boxplot(x='status', y=feature, data=parkinsons_data)
    plt.title(f'Boxplot of {feature} by Status')

plt.tight_layout()
plt.show()
