# **🎯AI-Powered Early Disease Prediction using Multi-Source Health Data**

![Image Alt Text](project_image_.jpg)

# 📑 Table of Contents

1. 📌 Introduction
2. 🎯 Goal of the Project
3. 📊 Data Story
4. 🧹 Data Preprocessing

## 📌 Introduction

In today’s fast-paced world, chronic illnesses such as **Diabetes**, **Heart Disease**, and **Parkinson’s Disease** are becoming more widespread, often going undiagnosed until advanced stages. Early detection is key to improving outcomes and lowering long-term healthcare costs.

This project, titled **"AI-Powered Early Disease Prediction using Multi-Source Health Data,"** aims to develop a machine learning model that accurately predicts the likelihood of developing these diseases. By analyzing **clinical health records**, **lifestyle indicators**, and **medical measurements**, the model can help identify risk factors and issue early warnings.

We utilize three real-world datasets:  
- 🧠 **Parkinson’s Disease Dataset**  
- ❤️ **Heart Disease Dataset**  
- 💉 **Diabetes Dataset**

The power of **AI and classification algorithms** enables our system to learn patterns and deliver reliable, data-driven insights. This project showcases how **integrating multiple health sources** and applying **robust preprocessing and modeling techniques** can support **preventive healthcare** and assist professionals in making faster, smarter decisions.


## 🎯 Goal of the Project

The primary goal of this project is to develop a machine learning model that can **predict early signs of disease** using health-related data. By analyzing multiple health indicators such as glucose levels, blood pressure, BMI, and age, the project aims to:

- Detect patterns and risk factors associated with disease onset  
- Assist in early diagnosis and preventive healthcare  
- Empower healthcare professionals with data-driven insights  
- Improve patient outcomes through timely predictions


## 📈 Data Story 
## 📊 About the Dataset

The datasets used in this project are sourced from the **UC Irvine Machine Learning Repository**, a trusted platform for machine learning research and experimentation.

## 🔗 Source of Datasets

The datasets used in this project are publicly available from the **UCI Machine Learning Repository**:

- **Diabetes Dataset:**  
  Source Link:https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

- **Heart Disease Dataset:**  
  Source Link:https://archive.ics.uci.edu/dataset/45/heart+disease

- **Parkinson's Disease Dataset:**  
  Source Link:https://archive.ics.uci.edu/ml/datasets/parkinsons
These datasets provide medical and clinical data that are essential for training models to predict early signs of disease.


---

### 📝 **Dataset Description**

The project involves three distinct datasets, each tailored to predict a specific health condition using numerical and categorical clinical features. The goal is to **detect early signs of disease** using predictive modeling.

- **Diabetes Dataset:** 768 entries × 9 columns
- **Heart Disease Dataset:** 303 entries × 14 columns
- **Parkinson's Disease Dataset:** 195 entries × 24 columns

Each dataset includes both input features (health metrics) and a target column indicating presence or absence of disease.

---

### 🔍 **Features/Columns Overview**
### 🩺 1. Diabetes Dataset
- **Samples:** 768  
- **Features:** Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age  
- **Target:** Outcome (0 = No Diabetes, 1 = Diabetes)  
- **Goal:** Identify individuals at risk of diabetes based on clinical and personal health parameters.
- ### ❤️ 2. Heart Disease Dataset
- **Samples:** 303  
- **Features:** Age, Sex, Chest Pain Type, Resting Blood Pressure, Cholesterol, Fasting Blood Sugar, ECG, Max Heart Rate, etc.  
- **Target:** Target (1 = Heart Disease, 0 = No Heart Disease)  
- **Goal:** Predict presence of heart disease using key cardiovascular indicators.

### 🧠 3. Parkinson's Disease Dataset
- **Samples:** 195  
- **Features:** 22 voice measurements including frequency, jitter, shimmer, and other biomedical voice markers  
- **Target:** Status (1 = Parkinson's Disease, 0 = Healthy)  
- **Goal:** Use vocal biomarkers to detect the presence of Parkinson's Disease at early stages.
---

### 🧰 **Tools Used**
- **Python**
- **Pandas** – Data handling
- **Matplotlib & Seaborn** – Data visualization
- **Scikit-learn** – Machine learning and model evaluation
- **Jupyter Notebook** – Interactive development

---

### 🧠 **Data Story Summary**

This project explores how diverse health features can be used to **predict diseases at early stages**. Each dataset presents a different perspective on human health — from blood glucose levels to heart function and vocal signal patterns. By integrating them into one unified machine learning workflow, the goal is to **enhance early diagnosis and promote preventive care** through data.


## 🧹 Data Preprocessing

**1. 📚Importing Libraries**

In [None]:
# 📦 Basic Libraries
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

# 📊 Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# ⚙️ Data Preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder, PowerTransformer
from imblearn.over_sampling import SMOTE

# 🎯 Feature Selection
from sklearn.feature_selection import SelectKBest, f_classif

# 🔀 Data Splitting
from sklearn.model_selection import train_test_split, GridSearchCV

# 🤖 Models
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# 📏 Model Evaluation
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report,
    RocCurveDisplay
)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report


# 💾 Model Saving
import joblib





**2.📂  Load the dataset**

In [None]:
# Load dataset
diabetes = pd.read_csv("diabetes.csv")
heart = pd.read_csv("heart (1).csv")
parkinsons = pd.read_csv("parkinsons.csv")
parkinsons


**3.Understand the data structure**

In [None]:
diabetes.head()

In [None]:
heart.head()

In [None]:
parkinsons.head()

In [None]:
print("Shape of the dataset:")
diabetes.shape

In [None]:
print("Shape of the dataset:")
heart.shape

In [None]:
print("Shape of the dataset:")
parkinsons.shape

In [None]:
print("Dataset Info:")
diabetes.info()

In [None]:
print("Dataset Info:")
heart.info()

In [None]:
print("Dataset Info:")
parkinsons.info()

In [None]:
print("Statistical Summary:")
diabetes.describe()

In [None]:
print("Statistical Summary:")
heart.describe()

In [None]:
print("Statistical Summary:")
parkinsons.describe()

In [None]:
diabetes.columns

In [None]:
heart.columns

In [None]:
parkinsons.columns

## 🔶 1. Diabetes Dataset

**5.Handle Missing Data**

In [None]:
# Check missing values
missing_values = diabetes.isnull().sum()
print("Missing values in each column:")
print(missing_values)

In [None]:
# 🔚 Final Check
print(diabetes.isnull().sum())  # conform all should be 0

**6.Handle Duplicates**

✅ Check & Remove Duplicate Rows in Diabetes Dataset

In [None]:
# Check for duplicates
print("Duplicate Rows:", diabetes.duplicated().sum())

**7.Handle outliers**

✅ Step: Check Skewness in the  Diabetes Dataset

In [None]:
columns = diabetes.columns

for i in columns:
    print(f"Skewness of {i}: {diabetes[i].skew()}")


In [None]:
print(diabetes.skew().sort_values(ascending=True))

In [None]:
#boxplot to detect potential outliers in  data
plt.figure(figsize=(12, 8))
sns.boxplot(diabetes)
plt.xticks(rotation=90)
plt.title('Boxplot of  Features')
plt.show()

### 📦 Outlier Removal Based on Skewness

In [None]:
# 📌 IQR method for removing outliers from selected columns
def remove_outliers(diabetes, columns):
    data_filtered = diabetes.copy()

    for column in columns:
        Q1 = diabetes[column].quantile(0.25)
        Q3 = diabetes[column].quantile(0.75)
        IQR = Q3 - Q1

        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        data_filtered = data_filtered[
            (data_filtered[column] >= lower_bound) & (data_filtered[column] <= upper_bound)
        ]

    return data_filtered

# ✅ Example usage (replace 'data' with your actual DataFrame)
columns_to_clean = ['Pregnancies', 'BloodPressure','Insulin','DiabetesPedigreeFunction', 'Age']
cleaned_data = remove_outliers(diabetes, columns_to_clean)


Removing the outliers from 'Pregnancies', 'BloodPressure','Insulin','DiabetesPedigreeFunction', 'Age' by using remove_ outliers function.

In [None]:
# Visualize boxplot after removing outliers
plt.figure(figsize=(12, 8))
sns.boxplot(cleaned_data)
plt.xticks(rotation=90)
plt.title('Boxplot of  Features')
plt.show()

After analyzing the skewness of each column in the dataset, we identified that five features — **Insulin**, **DiabetesPedigreeFunction**, **Age**, **BloodPressure**, and **Pregnancies** — exhibited high skewness and contained significant outliers. To improve data quality and reduce noise, we applied the IQR (Interquartile Range) method to these columns to remove the extreme values.

The boxplot below, generated after outlier removal, clearly shows that the distribution of these features has become more balanced, with fewer extreme data points. This step is essential for ensuring that the machine learning model is trained on clean, representative data, improving its performance and reliability.

### 🔍 Skewness Check After Outlier Removal


In [None]:
print("📊 Skewness After Outlier Removal:")
cleaned_data.skew().sort_values(ascending=True)

After IQR method there is still outliers in Pregnancies, BloodPressure,Insulin, DiabetesPedigreeFunction and Age

  8.Handle Skewness

In [None]:
# Applying log transformation to skewed features
# Applying log transformation to any feature with skewness > 1 right skewed
# apply square root transformation when skewness is between 0.5 and 1
new_data2= cleaned_data.copy() #creating a copy before skewness corrections
for col in cleaned_data.columns:
    if cleaned_data[col].skew() > 1:
        cleaned_data[col] = np.log1p(cleaned_data[col])

print("\nSkewness after log transformation:")
print(cleaned_data.skew().sort_values(ascending=True))

In [None]:
cleaned_data.shape

### 🔃 Handling Skewness in the Diabetes Dataset

After outlier removal, we identified some features with remaining high skewness such as BloodPressure,Insulin, DiabetesPedigreeFunction and Age

To normalize these distributions, we applied the **log transformation (log1p)**. This compresses large values and spreads out small ones, reducing the skew.

Handling skewness improves model performance by ensuring that features follow a more normal distribution, which benefits algorithms like Logistic Regression, SVM, and Linear models.


In [None]:
# Separate categorical and numerical columns
categorical_cols =cleaned_data.select_dtypes(include='int64').columns.tolist()
numerical_cols = cleaned_data.select_dtypes(include='float64').columns.tolist()

# Manually adjust if needed
print("Categorical Columns:", categorical_cols)
print("Numerical Columns:", numerical_cols)

## 🧪 Exploratory Data Analysis (EDA)
This section explores how each health-related feature in the dataset behaves individually, in pairs, and across multiple variables. These visuals help us understand patterns and relationships that are critical for building accurate AI disease prediction models.
 

1️⃣ 📊 Histogram(Univariate Analysis) 

In [None]:
# Group by Outcome and calculate mean for each feature
grouped_means = cleaned_data.groupby("Outcome").mean().T

# Plot grouped bar chart
plt.figure(figsize=(12, 6))
grouped_means.plot(kind='bar', figsize=(12, 6), color=['#66c2a5', '#fc8d62'], width=0.75)
plt.title("📊 Average Feature Values by Diabetes Outcome", fontsize=14)
plt.xlabel("Health Features", fontsize=12)
plt.ylabel("Average Value", fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend(["No Diabetes (0)", "Diabetes (1)"], title="Outcome")
plt.tight_layout()
plt.show()

### 📊 Average Feature Values by Diabetes Outcome

The above grouped bar chart visualizes the average values of various health features for two patient groups — those **with diabetes (Outcome = 1)** and those **without diabetes (Outcome = 0)**.

Each feature from the dataset is represented on the x-axis, and the average value of that feature is shown on the y-axis for both classes. This chart helps us easily compare how each health metric varies between diabetic and non-diabetic individuals.

#### 🧾 Key Observations:
- **Glucose**: Diabetic patients have a significantly higher average glucose level than non-diabetic ones. This is one of the most distinguishing features.
- **BMI** and **Age**: These are also higher on average for diabetic individuals.
- **BloodPressure**, **SkinThickness**, and **DiabetesPedigreeFunction** show slight differences.
- **Pregnancies**: Diabetic patients tend to have a higher number of pregnancies.
- **Insulin**: Only a small visible difference, possibly due to normalization or missing values.

#### ✅ Conclusion:
This visual is important because it shows which features have the most impact on the diagnosis of diabetes. These patterns can help in selecting relevant features for machine learning models and building accurate disease prediction systems.


2️⃣  🔗 Pair Plot 

In [None]:
# 🔍 Pair Plot 
# Select only the key numeric columns (excluding 'Outcome' if needed)
selected_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'BMI', 'Age', 'Outcome']

# Plot pair plot
sns.pairplot(cleaned_data[selected_columns], hue='Outcome', palette='husl', diag_kind='kde')
plt.suptitle("Pair Plot of Key Features by Diabetes Outcome", y=1.02)
plt.show()


### 🔍 Pair Plot of Key Features by Diabetes Outcome

The pair plot above shows the relationships between selected health features and how they vary with the diabetes outcome. Each cell in the grid represents a scatter plot of one feature against another, helping identify correlations and class separation.

#### 🔑 Observations:
- **Glucose** and **BMI** show clear separation between diabetic and non-diabetic patients.
- **Age** and **Pregnancies** also demonstrate upward trends for diabetic cases.
- Diagonal plots show KDE (density) curves, giving insights into how each feature is distributed for each class (Outcome 0 vs Outcome 1).

#### ✅ Why This Visual Matters:
This "gara graph" gives a simple and powerful overview of feature behavior, helping data scientists and healthcare professionals understand which health metrics are most useful for early disease prediction. It also supports effective feature selection for building AI models.



3️⃣ 🥧 Pie Chart

In [None]:
# Count values in Outcome column
labels = ['No Diabetes (0)', 'Diabetes (1)']
sizes = cleaned_data['Outcome'].value_counts().sort_index()
colors = ['#66c2a5', '#fc8d62']

# Create pie chart
plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90, shadow=True, explode=(0, 0.05))
plt.title("🧬 Diabetes Outcome Distribution", fontsize=14)
plt.axis('equal')  # Equal aspect ratio ensures a perfect circle
plt.show()

### 🥧 Pie Chart: Diabetes Outcome Distribution

This pie chart shows the distribution of diabetes outcomes in the dataset. It compares the proportion of patients who were diagnosed with diabetes (`Outcome = 1`) versus those who were not (`Outcome = 0`).

#### 📊 Interpretation:
- The chart clearly shows whether the dataset is **balanced or imbalanced**.
- A higher percentage of either class helps understand how models might behave (e.g., biased toward the majority class).

#### ✅ Why This Visual Matters:
This "gara graph" provides a simple, attractive overview of the dataset's target label. It’s helpful for doctors, data scientists, or AI systems to see the overall ratio of positive to negative cases before training a prediction model.


4️⃣  🧪 Bar Chart – Average Glucose by Diabetes Outcome (Bivariate Analysis)


In [None]:
# Calculate average glucose by Outcome
glucose_avg = cleaned_data.groupby('Outcome')['Glucose'].mean().reset_index()

# Create bar chart
plt.figure(figsize=(6, 4))
sns.barplot(x='Outcome', y='Glucose', data=glucose_avg, palette='pastel')
plt.title("🧪 Average Glucose by Diabetes Outcome", fontsize=14)
plt.xlabel("Outcome (0 = No Diabetes, 1 = Diabetes)")
plt.ylabel("Average Glucose Level")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


### 🧪 Bar Chart – Average Glucose by Diabetes Outcome (Bivariate Analysis)

This bar chart displays the **average glucose levels** of patients categorized by their diabetes outcome:

- **Outcome = 0** represents patients without diabetes.
- **Outcome = 1** represents patients with diabetes.

#### 🔍 Observations:
- Patients diagnosed with diabetes have a **noticeably higher average glucose level** than those without.
- This suggests that **glucose is a key feature** in predicting the presence of diabetes.

#### ✅ Why This Visual Matters:
This bivariate analysis highlights a strong relationship between glucose levels and diabetes. It supports the idea that elevated glucose is a major indicator for early disease detection, making it an essential feature in any AI-powered predictive model.


5️⃣🧪 Histogram:

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(cleaned_data['Glucose'], kde=True, color='#66c2a5', bins=30)
plt.title("🧪 Distribution of Glucose Levels", fontsize=14)
plt.xlabel("Glucose")
plt.ylabel("Number of Patients")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


### 🧪 Histogram: Glucose Level Distribution (Univariate Analysis)

This histogram shows the distribution of **glucose levels** among patients in the dataset. It is a classic example of univariate analysis, where we examine the behavior of a single variable.

#### 🔍 Observations:
- Most patients have glucose values between **80 and 140**.
- A noticeable right skew indicates that some patients have very high glucose levels.
- Glucose is a key indicator in predicting diabetes, and understanding its spread is vital.

#### ✅ Why This Visual Matters:
This gara graph highlights how glucose varies across the population and helps identify thresholds or risk ranges. It serves as a simple, standalone insight into one of the most important features in early disease prediction.


6️⃣ 📦 Boxplot 

In [None]:
plt.figure(figsize=(7, 5))
sns.boxplot(x='Outcome', y='BMI', data=cleaned_data, palette='Set2')
plt.title("📦 BMI by Diabetes Outcome", fontsize=14)
plt.xlabel("Outcome (0 = No Diabetes, 1 = Diabetes)")
plt.ylabel("BMI")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


### 📦 Boxplot: BMI by Diabetes Outcome (Bivariate Analysis)

This bivariate visual compares the **BMI distribution** between patients with and without diabetes. Each box shows the spread of BMI values for each outcome class.

#### 🔍 Observations:
- Diabetic patients (Outcome = 1) tend to have higher BMI values.
- The median BMI is higher in the diabetic group.
- There are outliers in both groups, but more extreme in the diabetic class.

#### ✅ Why This Visual Matters:
This gara graph helps show the **relationship between BMI and diabetes**, confirming that higher BMI is often associated with a higher risk of developing diabetes. It gives insight into how two variables interact — perfect for feature relevance in disease prediction models.


7️⃣🔥 Correlation Heatmap 

In [None]:
plt.figure(figsize=(10, 6))
corr_matrix = cleaned_data.corr()

# Create heatmap
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', linewidths=0.5)
plt.title("🔍 Correlation Heatmap of Features", fontsize=14)
plt.tight_layout()
plt.show()


### 🔍 Correlation Heatmap of Features (Multivariate Analysis)

This heatmap shows the **pairwise correlations** between all numeric features in the dataset. It helps identify how strongly one feature is related to another, including how much each one contributes to the diabetes prediction (`Outcome`).

#### 🔍 Observations:
- **Glucose** has the highest positive correlation with `Outcome`, indicating it's a strong predictor of diabetes.
- **BMI**, **Age**, and **Pregnancies** also show moderate positive correlations.
- Some features like **SkinThickness** and **BloodPressure** are weakly correlated with the outcome.

#### ✅ Why This Visual Matters:
This gara graph helps in **feature selection**. Highly correlated features may contain similar information, and low-correlated ones may not be helpful. The heatmap is a go-to visual in AI and ML pipelines to understand data relationships before modeling.


### 🧠 Data Encoding

### 🔡 Encoding

In this dataset, all columns are already in numerical format and represent measurable health-related features such as:

- **Pregnancies**
- **Glucose**
- **BloodPressure**
- **SkinThickness**
- **Insulin**
- **BMI**
- **DiabetesPedigreeFunction**
- **Age**
- **Outcome**

Since there are no categorical (text-based) columns like gender, region, or status, **encoding is not required** at this stage.

Had there been categorical variables, we would have used **Label Encoding** or **One-Hot Encoding** to convert them into numerical values suitable for machine learning models.


### ✅ Feature Selection


In [None]:
# Correlation matrix
plt.figure(figsize=(10, 6))
correlation = cleaned_data.corr()

# Heatmap
sns.heatmap(correlation, annot=True, cmap='YlGnBu')
plt.title("🔍 Correlation Heatmap for Feature Selection")
plt.show()


### 🎯 Target Variable

In this project, the target variable is:

**`Outcome`**

This column represents whether a patient is diabetic:
- `0` → The patient does **not** have diabetes
- `1` → The patient **has** diabetes

All other columns (like Glucose, BMI, Age, etc.) are input features used to train the AI model to predict this outcome. Accurately predicting this target variable is the goal of the machine learning model in this early disease prediction project.


In [None]:
# correlation with target variables
print( cleaned_data.corr()['Outcome'].sort_values(ascending=False))

### ✅ Final Feature Selection Based on Correlation

Using Pearson correlation analysis with the target variable (`Outcome`), we identified the strength of relationships between each feature and diabetes diagnosis.

#### 📊 Top Features Selected:
- `Glucose` (correlation = 0.459)
- `Age` (correlation = 0.301)
- `BMI` (correlation = 0.261)
- `Pregnancies` (correlation = 0.230)

These features have the highest positive correlation with diabetes and were selected for training the machine learning model.

Other features like **Insulin**, **SkinThickness**, and **DiabetesPedigreeFunction** showed low or no significant correlation and were not included in the final model.


In [None]:
# Split features (X) and target (y)
X = cleaned_data.drop('Outcome', axis=1)
y = cleaned_data['Outcome']

# Apply SelectKBest with f_classif scoring function
select_k = SelectKBest(score_func=f_classif, k=4)  # Selecting top 1 features
X_selected = select_k.fit_transform(X, y)
print("Selected Features:", X.columns[select_k.get_support()])



### ✅ Feature Selection using SelectKBest (k = 4)

To improve model accuracy and reduce noise, we applied **feature selection** using the `SelectKBest` method with the **ANOVA F-test** (`f_classif`) scoring function. This helps us select the top features that are most statistically significant in predicting diabetes (`Outcome`).

We selected the top **4 features** (`k=4`) based on their F-scores.

#### 🏆 Selected Features:
- `Glucose`
- `Age`
- `BMI`
- `Pregnancies`

These features showed the highest statistical influence on the outcome and will be used to train the AI-powered disease prediction model.

Feature selection helps in:
- Reducing overfitting
- Improving model accuracy
- Enhancing model performance and training speed


In [None]:
selected_scores = select_k.scores_[select_k.get_support()] # to find scores of all features
print("Feature Scores based on select_k:", selected_scores)

In [None]:
# Based on k scores we can choose number of features required
select_k = SelectKBest(score_func=f_classif, k=2)  # Selecting top 2 features (depends on user choice)
X_selected = select_k.fit_transform(X, y)

print("Selected Features:", X.columns[select_k.get_support()])

**Split Data into Training and Testing Sets**

In [None]:

# Define X and y again (if not already)
X = cleaned_data.drop('Outcome', axis=1)
y = cleaned_data['Outcome']

# Split the data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)


In [None]:
X_train

In [None]:
X_test

In [None]:
y_train

In [None]:
y_test

### ⚖️ Applying SMOTE

In [None]:
# Before SMOTE
plt.figure(figsize=(6,4))
sns.countplot(x=y_train, palette="Set2")
plt.title("📉 Class Distribution Before SMOTE")
plt.xlabel("Outcome")
plt.ylabel("Count")
plt.show()


In [None]:
# Applying SMOTE
print("Original Class Distribution:", y_train.value_counts())
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)
print("Resampled Class Distribution:", pd.Series(y_train).value_counts())

In [None]:
# ✅ After SMOTE
plt.figure(figsize=(6,4))
sns.countplot(x=y_train, palette="Set1")
plt.title("📈 Class Distribution After SMOTE")
plt.xlabel("Outcome")
plt.ylabel("Count")
plt.show()

## ⚖️ Handling Class Imbalance with SMOTE

In the diabetes dataset, the target variable (`Outcome`) was **imbalanced** — meaning there were significantly more non-diabetic cases (`0`) than diabetic cases (`1`).  
This imbalance can cause machine learning models to become **biased**, predicting the majority class more often.  

To solve this, we applied **SMOTE (Synthetic Minority Oversampling Technique)**:  
- SMOTE generates synthetic samples of the minority class (`1` - diabetic).  
- This balances the dataset and helps the model learn equally from both classes.  

The count plots below clearly show the **class distribution before and after SMOTE**:
- **Before SMOTE** → Class `0` dominates.  
- **After SMOTE** → Both classes are balanced.  


### ⚖️ Feature Scaling – Diabetes Dataset

Feature scaling is an essential step before applying many machine learning algorithms, especially those that are **distance-based or gradient-based**, such as:

- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machines (SVM)
- Gradient Boosting

In the diabetes dataset, features like `Glucose`, `Insulin`, `Age`, and `BMI` have different scales. Without scaling, models may give more weight to features with higher magnitudes.

We used **StandardScaler**, which transforms the data to have:
- Mean = 0
- Standard deviation = 1

This ensures all features contribute equally to the model training.


In [None]:
# Scaling using StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
print(X_train_scaled )



In [None]:
y_train_df = pd.DataFrame(y_train) #converting to data frame from series
scaler = StandardScaler()
scaler.fit(y_train_df)
y_train_scaled = scaler.transform(y_train_df)
print(y_train_scaled )

In [None]:
X_train_scaled#scaled

In [None]:
X_train_scaled.shape

In [None]:
X_train.shape

In [None]:
X_test

### 🤖 Model Training

In [None]:
# ✅ Build ML Models for Diabetes Dataset
# Dictionary of models
models = {
    "Logistic Regression": LogisticRegression(random_state=42),
    "Support Vector Machine": SVC(probability=True, random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42)
}

results = []

# Train & Evaluate
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    results.append([name, acc, prec, rec, f1])
    print(f"🔹 {name}")
    print(classification_report(y_test, y_pred))
    print("-"*50)

# Convert results to DataFrame
results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1 Score"])
results_df


## 🏗️ Building Machine Learning Models

To predict the likelihood of diabetes, we implemented and evaluated **five classification models**:  
- Logistic Regression  
- Support Vector Machine (SVM)  
- Decision Tree  
- Random Forest  
- Gradient Boosting  

Each model was trained on the processed dataset, and performance was compared using **Accuracy, Precision, Recall, and F1 Score**.  
This ensures a fair evaluation and helps identify the most effective algorithm for early diabetes prediction.


## 🏆 Best Performing Model

After experimenting with five machine learning models:

- Logistic Regression  
- Support Vector Machine (SVM)  
- Decision Tree  
- Random Forest  
- Gradient Boosting  

We evaluated them based on **Accuracy, Precision, Recall, and F1-Score**.  
Among these, **Gradient Boosting** outperformed the others by providing the best balance between accuracy and recall, which is very important in medical diagnosis (to correctly identify patients at risk of Diabetes).  

📌 Therefore, we selected **Gradient Boosting Classifier** as our final model for this project.  


🧭  Overfitting is checked

In [None]:

# Train Logistic Regression
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train, y_train)

# Predictions
y_train_pred = log_reg.predict(X_train)
y_test_pred = log_reg.predict(X_test)

# Accuracy
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)

print("Training Accuracy:", train_acc)
print("Testing Accuracy:", test_acc)

# Check overfitting
if train_acc - test_acc > 0.1:
    print("⚠️ Model may be Overfitting!")
else:
    print("✅ Model is Generalizing Well")


✅ Overfitting Check on Diabetes Dataset (Logistic Regression)

Training Accuracy: 0.763

Testing Accuracy: 0.754

The gap between training and testing accuracy is very small (≈ 0.009).
This indicates that the Logistic Regression model on the Diabetes dataset is generalizing well.

✔️ The model captures the underlying patterns without overfitting or underfitting.

In [None]:
# 🔧 Hyperparameter Tuning with GridSearchCV
# Define model
rf = RandomForestClassifier(random_state=42)

# Define parameter grid for tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Apply GridSearchCV
grid_search = GridSearchCV(estimator=rf,
                           param_grid=param_grid,
                           cv=5,
                           scoring='accuracy',
                           n_jobs=-1,
                           verbose=1)

# Fit on training data
grid_search.fit(X_train, y_train)

# Best parameters and score
print("✅ Best Parameters:", grid_search.best_params_)
print("🎯 Best Cross-Validation Accuracy:", grid_search.best_score_)

# Evaluate on test set
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


### 🔍 Hyperparameter Tuning with GridSearchCV  

To improve Random Forest performance, we applied **GridSearchCV**, which tests multiple hyperparameter combinations and selects the best one.  

- **n_estimators:** Number of trees in the forest.  
- **max_depth:** Maximum depth of each tree (None = unlimited).  
- **min_samples_split:** Minimum samples required to split a node.  
- **min_samples_leaf:** Minimum samples required at a leaf node.  

✅ **GridSearchCV** performs cross-validation to avoid overfitting and gives the best parameter set.  

After finding the best parameters, we retrain the model and evaluate it on the test set.  

This ensures that the chosen Random Forest model is optimized and generalizes well on unseen data.  


### ⚙️ Pipeline 

In [None]:
# 📦 Pipeline with SMOTE + Scaling + Random Forest + GridSearchCV

# Define pipeline
pipeline = ImbPipeline(steps=[
    ('scaler', StandardScaler()),     # Feature scaling
    ('smote', SMOTE(random_state=42)),# Handle class imbalance
    ('rf', RandomForestClassifier(random_state=42))  # Model
])

# Define parameter grid (only for the RandomForest part)
param_grid = {
    'rf__n_estimators': [100, 200],
    'rf__max_depth': [None, 10, 20],
    'rf__min_samples_split': [2, 5],
    'rf__min_samples_leaf': [1, 2]
}

# GridSearchCV
grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           cv=5,
                           scoring='accuracy',
                           n_jobs=-1,
                           verbose=2)

# Fit on training data
grid_search.fit(X_train, y_train)

# Best Parameters
print("Best Parameters:", grid_search.best_params_)
print("Best CV Score:", grid_search.best_score_)

# Evaluate on Test Data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


### ⚙️ Pipeline with SMOTE, Scaling, and Random Forest

Instead of applying preprocessing steps manually, we use a **Pipeline**.  
This ensures that all transformations happen consistently during **training and testing**.  

Steps in the pipeline:  
1. **StandardScaler** → Scales the features so models work efficiently.  
2. **SMOTE** → Balances the dataset by generating synthetic minority class samples.  
3. **Random Forest Classifier** → Trains the model using the chosen hyperparameters.  

We apply **GridSearchCV** on this pipeline to optimize Random Forest hyperparameters.  
This way, scaling, balancing, and training are handled automatically during cross-validation.  

✅ The final model is robust, avoids data leakage, and is easier to reproduce.  


In [None]:
from sklearn.metrics import (
    roc_auc_score, roc_curve, auc,
    precision_recall_curve, ConfusionMatrixDisplay,
    classification_report
)
import matplotlib.pyplot as plt

# Predictions + probabilities
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print("📄 Classification Report:\n", classification_report(y_test, y_pred))
print("🧮 ROC-AUC:", roc_auc_score(y_test, y_proba))

# Confusion Matrix
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.title("Confusion Matrix")
plt.show()

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, label=f"ROC curve (AUC = {roc_auc:.3f})")
plt.plot([0,1], [0,1], linestyle="--")
plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate")
plt.title("ROC Curve"); plt.legend()
plt.show()

# Precision–Recall Curve
prec, rec, _ = precision_recall_curve(y_test, y_proba)
plt.figure()
plt.plot(rec, prec)
plt.xlabel("Recall"); plt.ylabel("Precision")
plt.title("Precision–Recall Curve")
plt.show()


# 📊 Model Evaluation Report

After training and tuning the Random Forest Classifier, we evaluated the model using multiple performance metrics and visualizations. Below is the detailed interpretation.

---

## 🔹 1. Classification Report

The classification report provides **precision, recall, F1-score, and support** for each class.

- **Precision** → Out of all predicted positives, how many were correct.  
- **Recall (Sensitivity)** → Out of all actual positives, how many were detected.  
- **F1-Score** → Harmonic mean of precision & recall (balance between them).  
- **Support** → Number of actual occurrences of the class in test data.

### 📑 Results:
- **Class 0 (No Disease)**  
  - Precision = **0.86**  
  - Recall = **0.83**  
  - F1-Score = **0.85**  
  → Model is very good at predicting no disease.

- **Class 1 (Disease Present)**  
  - Precision = **0.66**  
  - Recall = **0.71**  
  - F1-Score = **0.68**  
  → Model is weaker here, but still catches most cases.

- **Overall Accuracy = 79%**  
- **Macro F1 = 0.76** (balanced performance across classes).  
- **Weighted F1 = 0.79** (accounts for class imbalance).

✅ Interpretation: The model favors class 0 slightly, but still detects most patients with the disease.

---

## 🔹 2. Confusion Matrix

The confusion matrix gives exact counts of predictions:

|                  | Predicted 0 | Predicted 1 |
|------------------|-------------|-------------|
| **Actual 0**     | 74 (TN)     | 15 (FP)     |
| **Actual 1**     | 12 (FN)     | 29 (TP)     |

- **True Negatives (TN = 74)** → Correctly predicted “No Disease”.  
- **False Positives (FP = 15)** → Predicted disease but actually healthy.  
- **False Negatives (FN = 12)** → Missed real patients ❌ (critical in healthcare).  
- **True Positives (TP = 29)** → Correctly identified disease cases.  

⚠️ Concern: **12 false negatives** → patients wrongly predicted as healthy. In medical tasks, minimizing FN is more important than maximizing accuracy.

---

## 🔹 3. ROC Curve & AUC Score

- **ROC (Receiver Operating Characteristic) Curve** → Plots True Positive Rate (Recall) vs. False Positive Rate.  
- **AUC (Area Under Curve)** = **0.827**  
  - 1.0 = Perfect Model  
  - 0.5 = Random Guessing  
  - 0.827 = Good discriminative ability

✅ Interpretation: The model distinguishes well between classes.

---

## 🔹 4. Precision-Recall Curve

- **Precision-Recall Curve** is useful for **imbalanced datasets**.  
- Our curve starts with **high precision (1.0)** but decreases as recall increases.  
- When recall is high (catching more patients), precision drops (more false alarms).  

✅ Interpretation: The model performs well at moderate thresholds, but struggles when forced to detect every patient.

---

## 📌 Overall Conclusion

- The model performs well (Accuracy = 79%, AUC = 0.83).  
- **Problem:** False Negatives (12) → real patients missed.  
- **Solution:**  
  1. Use **class weights** or **SMOTE oversampling**.  
  2. Adjust probability **threshold** (e.g., from 0.5 → 0.4) to reduce false negatives.  
  3. Try other classifiers (XGBoost, Logistic Regression) for comparison.  
  4. Apply **cross-validation** for stability.  

In healthcare, **Recall is more important** than overall accuracy, because missing a patient can be dangerous.

---


 Save the Trained Model (using joblib)

In [None]:

# Save the best model
joblib.dump(best_model, 'disease_prediction_model.pkl')
print("✅ Model saved successfully!")
# Load model later
loaded_model = joblib.load("disease_prediction_model.pkl")



In [None]:
import os
print(os.getcwd())  # shows current working directory


🧪 Load and Use the Saved Model Later

In [None]:


# Example unseen data (Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age)
unseen_patient = np.array([[2, 120, 70, 25, 80, 28.5, 0.45, 32]])

# Scale and predict (pipeline already has scaler + model)
prediction = loaded_model.predict(unseen_patient)
proba = loaded_model.predict_proba(unseen_patient)

print("Predicted Class:", "Diabetes" if prediction[0] == 1 else "No Diabetes")
print("Prediction Probability:", proba)


In [None]:
import numpy as np
import pandas as pd

# 🧪 Example unseen patients' data
# Columns: [Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age]
unseen_patients = np.array([
    [2, 120, 70, 25, 80, 28.5, 0.45, 32],   # Patient 1
    [5, 155, 80, 32, 120, 34.0, 0.65, 45],  # Patient 2
    [0, 95, 65, 20, 70, 22.0, 0.30, 25],    # Patient 3
    [3, 180, 90, 35, 150, 37.5, 0.85, 50],  # Patient 4
    [1, 130, 72, 28, 85, 30.2, 0.55, 29]    # Patient 5
])

# ✅ Convert to DataFrame (recommended for readability)
columns = [
    "Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
    "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"
]
unseen_df = pd.DataFrame(unseen_patients, columns=columns)

# 🎯 Predictions using loaded model pipeline
predictions = loaded_model.predict(unseen_df)
probabilities = loaded_model.predict_proba(unseen_df)

# 📊 Combine results
results = unseen_df.copy()
results["Prediction"] = ["Diabetes" if p == 1 else "No Diabetes" for p in predictions]
results["Prob_No_Diabetes"] = probabilities[:, 0]
results["Prob_Diabetes"] = probabilities[:, 1]

# 🖨️ Show results
print(results)


### 🧪  Testing with Unseen Data

Now that the model is saved, we must test it with **completely new patient records** that were never seen during training/testing.  

- Input: A new patient's medical data.  
- Output: Predicted **Diabetes / No Diabetes** and the **probability score**.  

This step ensures that our model can be applied to **real-world cases** and is not just memorizing the training data.  


🔜 Next Step: Model Evaluation Summary

## 📝 Diabetes Dataset – Summary
# 🏁 Conclusion

1. **Model Training & Performance:**
   - We built a **Random Forest Classifier** pipeline with:
     - Standard Scaling
     - SMOTE (for handling class imbalance)
     - GridSearchCV (for hyperparameter tuning)
   - The model achieved:
     - **Training Accuracy:** ~0.76
     - **Testing Accuracy:** ~0.75
   - ✅ No major overfitting → Model generalizes well.

2. **Hyperparameter Tuning:**
   - Best parameters were selected automatically using **GridSearchCV**.
   - This improved the model’s balance between bias and variance.

3. **Evaluation Results:**
   - **Confusion Matrix** showed most predictions are correct, with relatively few misclassifications.
   - **ROC Curve & AUC Score** indicated the model performs significantly better than random guessing.
   - **Classification Report** (Precision, Recall, F1-score) confirmed reliability across classes.

4. **Insights from Data:**
   - Features like **Glucose, BMI, Age, and Insulin** had strong influence on predictions (from feature importance analysis).
   - These factors are critical indicators for disease risk.

5. **Model Deployment:**
   - Model saved as `disease_prediction_model.pkl` using **joblib**.
   - Can be reused for predictions on new/unseen patient data.

---

### 🔮 Final Thoughts:
- The pipeline we built is **robust, interpretable, and reusable**.  
- It can be integrated into healthcare systems for **early disease prediction**.  
- With more data and medical expert feedback, the model can be further improved.

✅ Overall, this project successfully demonstrates how **Machine Learning can be applied to health data** for disease prediction.


- The dataset was preprocessed by removing outliers, fixing skewness, and scaling features.
- Exploratory Data Analysis (EDA) included histograms, boxplots, pairplots, and heatmaps.
- The target variable was `Outcome`, indicating whether a person is diabetic (1) or not (0).
- Feature selection was done using SelectKBest with `f_classif`.
- Logistic Regression was used to train the model.
- Hyperparameter tuning was performed using GridSearchCV.
- The final model was saved using `joblib` and tested on new unseen data.
- The prediction outcome was interpreted as "Diabetic" or "Not Diabetic" based on input values.

✅ This completes the Diabetes prediction pipeline.


## 🔷 2. Heart Disease Dataset

✅ Data Preprocessing

**Handle Missing Data**

In [None]:
# Check missing values
missing_values = heart.isnull().sum()
print("Missing values in each column:")
print(missing_values)

In [None]:
# 🔚 Final Check|
print(heart.isnull().sum())  # Now all should be 0

✅ Check & Remove Duplicate Rows in Heart Dataset

In [None]:
# Check for duplicates
print("Duplicate Rows:", heart.duplicated().sum())

In [None]:
heart.drop_duplicates(inplace=True)

In [None]:
# Verify
print("Remaining Duplicates:", heart.duplicated().sum())

✅ Step: Check Skewness in the Heart Disease Dataset

In [None]:
columns = heart.columns

for i in columns:
    print(f"Skewness of {i}: {heart[i].skew()}")


In [None]:
print(heart.skew().sort_values(ascending=True))

In [None]:
# Plot boxplots of numerical columns before outlier removal
plt.figure(figsize=(14, 8))
sns.boxplot(data=heart.select_dtypes(include='number'), palette='Set2')
plt.title("📦 Boxplot of Numerical Features (Before Outlier Removal)")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


### 📦 Outlier Removal Based on Skewness

In [None]:
# 📌 IQR method for removing outliers from selected columns
def remove_outliers_iqr(df, columns):
    df_clean = df.copy()
    for col in columns:
        Q1 = df_clean[col].quantile(0.25)
        Q3 = df_clean[col].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        df_clean = df_clean[(df_clean[col] >= lower) & (df_clean[col] <= upper)]
    return df_clean

# Apply to your heart dataset
columns_to_clean = ['fbs', 'ca', 'oldpeak', 'chol', 'trestbps','exang']
heart_cleaned = remove_outliers_iqr(heart, columns_to_clean)


### 📦 Outlier Removal Based on Skewness

We analyzed the skewness of each feature in the heart disease dataset and found several variables with high skewness values. To improve data quality and reduce model distortion, we removed outliers using the IQR method.

#### 🔍 Features with High Skewness:
- `fbs` (1.98)
- `ca` (1.29)
- `oldpeak` (1.26)
- `chol` (1.14)
- `trestbps` (0.71)
- `exang` (0.73)

These features were cleaned by removing extreme values that lay beyond 1.5 times the interquartile range (IQR). This step helps in normalizing the distribution and improving the performance of machine learning models.


In [None]:
# Optional: Visualize boxplot after removing outliers

plt.figure(figsize=(12, 8))
sns.boxplot(heart_cleaned)
plt.xticks(rotation=90)
plt.title('Boxplot of  Features')
plt.show()

### 📦 Boxplot Comparison: Before vs After Outlier Removal

To improve the quality of our heart disease dataset, we applied the **IQR (Interquartile Range)** method to remove outliers from highly skewed features: `fbs`, `ca`, `oldpeak`, `chol`, and `trestbps`.

The boxplots shown above provide a visual comparison:

- **Before Outlier Removal:** Several columns, especially `chol`, `ca`, and `oldpeak`, displayed extreme values (outliers) that stretched the boxplot and potentially distorted the data distribution.
- **After Outlier Removal:** The distributions have become more compact and balanced. Most extreme points have been eliminated, leading to a cleaner and more reliable dataset for training machine learning models.

Outlier removal helps reduce noise, stabilize model predictions, and improve overall performance.


### 🔍 Skewness Check After Outlier Removal

In [None]:
print("📊 Skewness After Outlier Removal:")
heart_cleaned.skew().sort_values(ascending=True)


After IQR method there is still outliers in fbs, ca, oldpeak, chol, and trestbps.


 Handle Skewness

In [None]:
# Applying log transformation to skewed features
# Applying log transformation to any feature with skewness > 1 right skewed
# apply square root transformation when skewness is between 0.5 and 1
new_data2=heart_cleaned.copy() #creating a copy before skewness corrections
for col in heart_cleaned.columns:
    if heart_cleaned[col].skew() > 1:
        heart_cleaned[col] = np.log1p(heart_cleaned[col])

print("\nSkewness after log transformation:")
print(heart_cleaned.skew().sort_values(ascending=True))

In [None]:
#when skewness less than -0.5 go for power transformation methos like Yeo-johnson or Box-cox

# Initialize PowerTransformer with Yeo-Johnson method
pt = PowerTransformer(method='yeo-johnson')

for col in heart_cleaned.columns:
    if heart_cleaned[col].skew() < -0.5:
        # Reshape to 2D, apply transformation, and flatten back to 1D
        heart_cleaned[col] = pt.fit_transform(heart_cleaned[[col]]).flatten()

# Print skewness in ascending order
print(heart_cleaned.skew().sort_values(ascending=True))

### 🔃 Handling Skewness in the Heart Dataset

After outlier removal, we identified some features with remaining high skewness such as `chol`, `oldpeak`, `trestbps`, `ca`,`exang`and `fbs`.

To normalize these distributions, we applied the **log transformation (log1p)**. This compresses large values and spreads out small ones, reducing the skew.

Handling skewness improves model performance by ensuring that features follow a more normal distribution, which benefits algorithms like Logistic Regression, SVM, and Linear models.


In [None]:
# Separate categorical and numerical columns
categorical_cols = heart_cleaned.select_dtypes(include='int64').columns.tolist()
numerical_cols = heart_cleaned.select_dtypes(include='float64').columns.tolist()

# Manually adjust if needed
print("Categorical Columns:", categorical_cols)
print("Numerical Columns:", numerical_cols)


## 🧪 Exploratory Data Analysis (EDA)

🎨 1. Histogram of age

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(data=heart_cleaned, x='age', bins=20, kde=True, color='skyblue')
plt.title('Age Distribution of Patients')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

### 🧓 Age Distribution
Shows most patients are between **50–60 years**. This age range has the highest risk of heart disease.


❤️ 2. Countplot of target

In [None]:
sns.countplot(x='target', data=heart_cleaned, palette='Set2')
plt.title('Heart Disease Presence')
plt.xlabel('Target (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()


### ❤️ Heart Disease Count
This shows how many patients were diagnosed with heart disease (1) and how many were not (0).


🩺 3. Boxplot: chol vs target

In [None]:
sns.boxplot(x='target', y='chol', data=heart_cleaned, palette='pastel')
plt.title('Cholesterol vs Heart Disease')
plt.xlabel('Heart Disease')
plt.ylabel('Cholesterol Level')
plt.show()


### 🩸 Cholesterol Levels by Heart Disease
Patients with heart disease tend to have **higher cholesterol**. Some outliers were handled earlier.


🔥 4. Heatmap of Correlation

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(heart_cleaned.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()


### 🔥 Correlation Heatmap
Highlights how features like `cp`, `thalach`, `oldpeak`, and `exang` relate to heart disease.


🔁 5. Pairplot of Key Features

In [None]:
top_features = ['cp', 'thalach', 'oldpeak', 'age', 'target']
sns.pairplot(heart_cleaned[top_features], hue='target', palette='Set1')
plt.suptitle('Pairwise Relationships', y=1.02)
plt.show()

### 🔁 Pairplot of Key Features
Shows visual separation between heart disease and non-disease cases using combinations of top predictors.


6 📊 Pie Chart: Heart Disease vs No Disease

In [None]:
# Count the target values (0 = No disease, 1 = Disease)
labels = ['No Disease', 'Heart Disease']
sizes = heart_cleaned['target'].value_counts().values
colors = ['#66b3ff', '#ff9999']

# Plot the pie chart
plt.figure(figsize=(6,6))
plt.pie(sizes, labels=labels, colors=colors, startangle=90, 
        autopct='%1.1f%%', shadow=True, explode=(0, 0.1))
plt.title('Heart Disease Distribution')
plt.axis('equal')  # Ensures pie is circular
plt.show()


### 🥧 Pie Chart: Heart Disease Distribution

This pie chart shows the proportion of patients with and without heart disease:

- **Red** (Heart Disease): Represents the percentage of patients diagnosed with heart conditions.
- **Blue** (No Disease): Represents the percentage of healthy patients.

The distribution appears fairly balanced, making it suitable for binary classification models without applying resampling techniques.



### 🧠 Data Encoding

✅ One-Hot Encoding for Heart Dataset

In [None]:
# Apply One-Hot Encoding to necessary categorical columns
heart_encoded = pd.get_dummies(heart_cleaned, columns=['cp', 'restecg', 'slope', 'thal'], drop_first=True)

# Check new columns
heart_encoded.head()
print("heart_encodedOne-Hot Encoded Data:")
print(heart_encoded)


## 🔡 Encoding Categorical Variables (One-Hot Encoding)

In our heart disease dataset, some features such as `cp`, `restecg`, `slope`, and `thal` are **categorical with multiple values**.

### Why One-Hot Encoding?
- These features are **nominal** (no natural order between values).
- Using **Label Encoding** would assign numbers (0, 1, 2...) which adds a **false sense of order**.
- **One-Hot Encoding** creates separate binary columns for each category, avoiding misinterpretation by machine learning algorithms.

### Columns Encoded:
- `cp`: Chest pain type
- `restecg`: Resting ECG results
- `slope`: Slope of the ST segment
- `thal`: Thalassemia type

We used `pd.get_dummies()` with `drop_first=True` to avoid multicollinearity.

This ensures the data is now completely numeric and ready for model training.


📊 Feature Selection

### 🔍 Correlation Heatmap


In [None]:
# Calculate correlation matrix for the encoded dataset
corr_matrix = heart_encoded.corr()

# Plot heatmap
plt.figure(figsize=(14,10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)
plt.title('🔍 Correlation Heatmap of Heart Disease Features', fontsize=16)
plt.show()

### 🎯 Target Column: `target`

The `target` column is the label we want to predict using machine learning.

- `0` = Patient **does not** have heart disease
- `1` = Patient **has** heart disease

This is a **binary classification problem**, where the model learns to classify whether a person is likely to have heart disease based on their health features.


In [None]:
# correlation with target variables
print(heart_encoded.corr()['target'].sort_values(ascending=False))

### 📊 Feature Selection using SelectKBest

We used the **SelectKBest** method with `f_classif` (ANOVA F-value) to select the top 8 most relevant features for predicting heart disease.

This method helps reduce model complexity, improves accuracy, and avoids overfitting.

**Selected Features:**
- Based on both correlation and F-score, features like `thal_2`, `thalach`, `slope_2`, `cp_2`, `oldpeak`, and `ca` were found to be the most important.

These features were used to train the machine learning model.


In [None]:
# Define features and target
X = heart_encoded.drop('target', axis=1)
y = heart_encoded['target']

# Apply SelectKBest to select top 8 features
selector = SelectKBest(score_func=f_classif, k=8)
X_selected = selector.fit_transform(X, y)

# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("✅ Selected Features:", selected_features.tolist())


### ✅ Feature Selection (Top 8)

Used `SelectKBest` with ANOVA F-test (`f_classif`) to select the top 8 most relevant features for predicting heart disease. These features will be used for model training.


In [None]:
selected_scores = selector.scores_[selector.get_support()] # to find scores of all features
print("Feature Scores based on selector:", selected_scores)

In [None]:
# Based on k scores we can choose number of features required
selector = SelectKBest(score_func=f_classif, k=2)  # Selecting top 2 features (depends on user choice)
X_selected = selector.fit_transform(X, y)

print("Selected Features:", X.columns[selector.get_support()])

In [None]:
# Features and target
X = heart_encoded.drop('target', axis=1)
y = heart_encoded['target']

# Split the data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=y)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)


In [None]:
X_train

In [None]:
X_test

In [None]:

y_train

In [None]:
y_test

Apply SMOTE to Heart Disease Dataset

In [None]:
# Plot before SMOTE
plt.figure(figsize=(6,4))
sns.countplot(x=y)
plt.title("Class Distribution Before SMOTE")
plt.show()


In [None]:
# Applying SMOTE
print("Original Class Distribution:", y_train.value_counts())
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)
print("Resampled Class Distribution:", pd.Series(y_train).value_counts())

In [None]:
# ✅ After SMOTE
plt.figure(figsize=(6,4))
sns.countplot(x=y_train, palette="Set1")
plt.title("📈 Class Distribution After SMOTE")
plt.xlabel("target")
plt.ylabel("Count")
plt.show()

### SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE is a popular technique used to **balance imbalanced datasets**.  
It works by **generating synthetic samples** for the minority class rather than simply duplicating existing ones.  

**Key Points:**
- Helps improve model performance on imbalanced classification tasks.
- Reduces bias toward the majority class.
- Creates new samples by interpolating between existing minority class examples.

**Usage:** Often applied **before training a model** to ensure the classifier sees a balanced dataset.


### ⚖️ Feature Scaling – Heart Dataset

In [None]:
# Initialize StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
print(X_train_scaled )
y_train_df = pd.DataFrame(y_train) #converting to data frame from series
scaler = StandardScaler()
scaler.fit(y_train_df)
y_train_scaled = scaler.transform(y_train_df)
print(y_train_scaled )

In [None]:
X_train_scaled.shape

### 🤖 Model Training

In [None]:

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(),
    'SVM': SVC(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier()
}

# Train and evaluate models
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"=== {name} ===")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("\n")
    # Convert results to DataFrame
results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1 Score"])
results_df



### Best Model for Heart Disease Prediction

After evaluating multiple classification models on the heart disease dataset, **Gradient Boosting** emerged as the best-performing model.  

**Key Metrics:**
- **Accuracy:** 0.80  
- **Precision:** 0.65  
- **Recall:** 0.78  
- **F1 Score:** 0.71  

**Why Gradient Boosting is Best:**
- Achieves the **highest accuracy** among all tested models.  
- **High recall** ensures most patients with heart disease are correctly identified.  
- **Balanced F1 score** indicates good trade-off between precision and recall.  

> Gradient Boosting is therefore recommended for predicting heart disease in this dataset.
> # 🏆 Model Selection

- After comparing metrics:
  - **Accuracy** shows overall performance.
  - **Recall** is very important in disease prediction (catching positive cases).
  - **F1-Score** balances precision & recall.

👉 The model with the **highest Recall & F1-score** is usually preferred for healthcare problems, since missing a patient with disease is riskier than false alarms.



In [None]:

plt.figure(figsize=(10,5))
sns.barplot(x="Model", y="Accuracy", data=results_df)
plt.title("Model Accuracy Comparison")
plt.xticks(rotation=45)
plt.show()


In [None]:

plt.figure(figsize=(10,6))

for name, model in models.items():
    if hasattr(model, "predict_proba"):  # Only models with predict_proba
        y_prob = model.predict_proba(X_test)[:,1]
        fpr, tpr, _ = roc_curve(y_test, y_prob)
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, label=f"{name} (AUC = {roc_auc:.2f})")

plt.plot([0,1],[0,1],"--",color="gray")
plt.title("ROC Curve Comparison")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()


### 🔹 ROC Curve Comparison

The ROC (Receiver Operating Characteristic) curve is used to evaluate the performance of classification models.  
- The **X-axis** represents the False Positive Rate (FPR), and the **Y-axis** represents the True Positive Rate (TPR).  
- The **diagonal dashed line** represents a random classifier (AUC = 0.5).  

From the graph:  
- **Random Forest (AUC = 0.87)** performs the best among all models.  
- **Logistic Regression (AUC = 0.86)** also shows strong performance.  
- **Gradient Boosting (AUC = 0.82)** performs moderately well.  
- **Decision Tree (AUC = 0.65)** shows the weakest performance.  

✅ Higher AUC (Area Under the Curve) values indicate better model performance in distinguishing between positive and negative cases.


In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grids for each model
param_grids = {
    'Logistic Regression': {
        'C': [0.01, 0.1, 1, 10, 100],
        'solver': ['liblinear', 'lbfgs']
    },
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf', 'poly'],
        'gamma': ['scale', 'auto']
    },
    'Decision Tree': {
        'max_depth': [None, 5, 10, 20],
        'min_samples_split': [2, 5, 10],
        'criterion': ['gini', 'entropy']
    },
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 5, 10, 20],
        'min_samples_split': [2, 5, 10],
        'criterion': ['gini', 'entropy']
    },
    'Gradient Boosting': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 5, 10]
    }
}

# Perform GridSearchCV for each model
best_models = {}
for name, model in models.items():
    print(f"🔍 Tuning {name}...")
    grid = GridSearchCV(model, param_grids[name], cv=5, scoring='accuracy', n_jobs=-1)
    grid.fit(X_train, y_train)
    
    print(f"Best Parameters for {name}: {grid.best_params_}")
    print(f"Best CV Accuracy: {grid.best_score_:.4f}\n")
    
    best_models[name] = grid.best_estimator_

# Evaluate tuned models
from sklearn.metrics import accuracy_score

for name, model in best_models.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"{name} Test Accuracy after tuning: {acc:.4f}")


In [None]:
# Defining a simple pipeline for preprocessing and model training
pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('smote', SMOTE(random_state=42)),
    ('classifier', LogisticRegression(penalty="l2", C=0.1, max_iter=1000))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))



💾 Save the Best Heart Model


In [None]:
import joblib

# Save the pipeline
joblib.dump(grid_search.best_estimator_, 'best_heart_model.pkl')
print("💾 Heart model saved successfully as 'best_heart_model.pkl'")

# Load the model later
best_model = joblib.load('best_heart_model.pkl')
y_pred_loaded = best_model.predict(X_test)


Create Unseen Patient Data

In [None]:
import pandas as pd

# Example new patient data (same features as training set)
new_data = pd.DataFrame({
    'age': [52, 60],
    'sex': [1, 0],
    'cp': [3, 2],
    'trestbps': [130, 140],
    'chol': [250, 220],
    'fbs': [0, 1],
    'restecg': [1, 0],
    'thalach': [160, 150],
    'exang': [0, 1],
    'oldpeak': [1.2, 2.5],
    'slope': [2, 1],
    'ca': [0, 2],
    'thal': [2, 3]
})

# Use loaded pipeline for prediction
predictions = loaded_pipeline.predict(new_data)
print("🔍 Predictions for New Data:", predictions)


# 🧪 Testing the Pipeline on New Data

We provided **new patient health records** to the trained pipeline.  
The pipeline automatically handled:
- Data preprocessing (scaling, encoding, etc.)
- Feeding data into the tuned Logistic Regression model  

The predictions indicate **whether the patient is at risk of heart disease (1) or not (0)**.  
This step confirms the pipeline is **ready for real-world use**.


# 🏁 Final Conclusion

1. **Models Trained:** Logistic Regression, SVM, Decision Tree, Random Forest, Gradient Boosting  
2. **Best Performing Model:** Logistic Regression (after Hyperparameter Tuning) with ~86.8% CV Accuracy  
3. **Data Balancing:** Applied SMOTE to handle imbalance in Heart Disease dataset, which improved recall (better at identifying patients with disease).  
4. **Pipeline:** Created and saved a reusable pipeline (`disease_prediction_pipeline.pkl`) that handles preprocessing and prediction automatically.  
5. **New Data Testing:** The pipeline successfully predicted outcomes on unseen patient records, confirming generalization.  

---

### 🔑 Key Insights:
- Logistic Regression works very well for structured medical data with limited features.  
- Random Forest and Gradient Boosting also performed strongly but required more tuning.  
- Using SMOTE helped prevent bias toward the majority class.  
- The project demonstrates how **Machine Learning can assist doctors** by predicting health risks (Diabetes, Heart Disease, Parkinson’s).  

---

### ✅ Next Work:
- Deploy the model into a simple web or mobile app for real-time predictions.  
- Add more datasets or features (e.g., lifestyle, genetic data) to improve accuracy.  
- Apply Explainable AI (like SHAP or LIME) to understand feature importance in predictions.  


## 🔷 2. Parkinson’s Disease Dataset



✅ Data Preprocessing

**Handle Missing Data**

In [None]:
# Check missing values
missing_values = parkinsons.isnull().sum()
print("Missing values in each column:")
print(missing_values)

In [None]:
# 🔚 Final Check|
print(parkinsons.isnull().sum())  # Now all should be 0

✅ Check & Remove Duplicate Rows in Heart Dataset

In [None]:
# check for duplicates
print("Duplicate Rows:", parkinsons.duplicated().sum())

step: Check Skewness in the parkinsons Disease Dataset


In [None]:
# Only process numeric columns
numeric_columns = parkinsons.select_dtypes(include=['float64', 'int64']).columns

for col in numeric_columns:
    print(f"Skewness of {col}: {parkinsons[col].skew():.4f}")


In [None]:
# Drop non-numeric column like 'name' if present
parkinsons_numeric = parkinsons.select_dtypes(include='number')

# 📦 Boxplot
plt.figure(figsize=(14, 8))
sns.boxplot(data=parkinsons_numeric, palette='Set2')
plt.title("📦 Boxplot of Numerical Features (Before Outlier Removal)")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


In [None]:
# Calculate IQR and filter out outliers
Q1 = parkinsons_numeric.quantile(0.25)
Q3 = parkinsons_numeric.quantile(0.75)
IQR = Q3 - Q1

# Keep only rows within the IQR range
parkinsons_cleaned = parkinsons_numeric[~((parkinsons_numeric < (Q1 - 1.5 * IQR)) | (parkinsons_numeric > (Q3 + 1.5 * IQR))).any(axis=1)]
print(f"✅ Rows before: {parkinsons_numeric.shape[0]}, after removing outliers: {parkinsons_cleaned.shape[0]}")


In [None]:
# Boxplot after outlier removal
plt.figure(figsize=(14, 8))
sns.boxplot(data=parkinsons_cleaned, palette='Set3')
plt.title("📦 Boxplot of Numerical Features (After Outlier Removal)")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


### 🔍 Skewness Check After Outlier Removal

In [None]:
print("📊 Skewness After Outlier Removal:")
parkinsons_cleaned.skew().sort_values(ascending=True)

In [None]:
# Applying log transformation to skewed features
# Applying log transformation to any feature with skewness > 1 right skewed
# apply square root transformation when skewness is between 0.5 and 1
new_data2=parkinsons_cleaned.copy() #creating a copy before skewness corrections
for col in parkinsons_cleaned.columns:
    if parkinsons_cleaned[col].skew() > 1:
        parkinsons_cleaned[col] = np.log1p(parkinsons_cleaned[col])

print("\nSkewness after log transformation:")
print(parkinsons_cleaned.skew().sort_values(ascending=True))

In [None]:
#when skewness less than -0.5 go for power transformation methos like Yeo-johnson or Box-cox

# Initialize PowerTransformer with Yeo-Johnson method
pt = PowerTransformer(method='yeo-johnson')

for col in parkinsons_cleaned.columns:
    if parkinsons_cleaned[col].skew() < -0.5:
        # Reshape to 2D, apply transformation, and flatten back to 1D
        parkinsons_cleaned[col] = pt.fit_transform(parkinsons_cleaned[[col]]).flatten()

# Print skewness in ascending order
print(parkinsons_cleaned.skew().sort_values(ascending=True))

### 📊 Interpretation of Boxplot After Outlier Removal

The boxplot above shows the distribution of the numerical features in the Parkinson’s dataset **after removing outliers** using the IQR (Interquartile Range) method.

#### ✅ What to observe:
- The whiskers of each box now better represent the true spread of the data.
- Extreme values (previously shown as individual points) are mostly removed, reducing noise.
- The central boxes (representing the interquartile range) are now tighter and more consistent, which helps in building better models.

#### 🧠 Why this is useful:
Outliers can negatively impact machine learning models, especially those sensitive to scale (like Logistic Regression, SVM, and KNN). By removing them:
- We make training more stable.
- We improve model performance and generalization.
- We avoid misleading results due to extreme values.

Next, we’ll proceed to **scaling** the cleaned dataset before applying machine learning models.

In [None]:
# 🔚 Final Check
print(parkinsons.isnull().sum()) 

✅ Check & Remove Duplicate Rows in Heart Dataset

In [None]:
# check for duplicates
print("Duplicate Rows:", parkinsons.duplicated().sum())

Step: Check Skewness in the Heart Disease Dataset

In [None]:
# Only process numeric columns
numeric_columns = parkinsons.select_dtypes(include=['float64', 'int64']).columns

for col in numeric_columns:
    print(f"Skewness of {col}: {parkinsons[col].skew():.4f}")


In [None]:

# Drop non-numeric column like 'name' if present
parkinsons_numeric = parkinsons.select_dtypes(include='number')

# 📦 Boxplot
plt.figure(figsize=(14, 8))
sns.boxplot(data=parkinsons_numeric, palette='Set2')
plt.title("📦 Boxplot of Numerical Features (Before Outlier Removal)")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


In [None]:
# Calculate IQR and filter out outliers
Q1 = parkinsons_numeric.quantile(0.25)
Q3 = parkinsons_numeric.quantile(0.75)
IQR = Q3 - Q1

# Keep only rows within the IQR range
parkinsons_cleaned = parkinsons_numeric[~((parkinsons_numeric < (Q1 - 1.5 * IQR)) | (parkinsons_numeric > (Q3 + 1.5 * IQR))).any(axis=1)]
print(f"✅ Rows before: {parkinsons_numeric.shape[0]}, after removing outliers: {parkinsons_cleaned.shape[0]}")


In [None]:
# Boxplot after outlier removal
plt.figure(figsize=(14, 8))
sns.boxplot(data=parkinsons_cleaned, palette='Set3')
plt.title("📦 Boxplot of Numerical Features (After Outlier Removal)")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


### 🔍 Skewness Check After Outlier Removal

In [None]:
print("📊 Skewness After Outlier Removal:")
parkinsons_cleaned.skew().sort_values(ascending=True)


In [None]:
# Applying log transformation to skewed features
# Applying log transformation to any feature with skewness > 1 right skewed
# apply square root transformation when skewness is between 0.5 and 1
new_data2=parkinsons_cleaned.copy() #creating a copy before skewness corrections
for col in parkinsons_cleaned.columns:
    if parkinsons_cleaned[col].skew() > 1:
        parkinsons_cleaned[col] = np.log1p(parkinsons_cleaned[col])

print("\nSkewness after log transformation:")
print(parkinsons_cleaned.skew().sort_values(ascending=True))

In [None]:
#when skewness less than -0.5 go for power transformation methos like Yeo-johnson or Box-cox

# Initialize PowerTransformer with Yeo-Johnson method
pt = PowerTransformer(method='yeo-johnson')

for col in parkinsons_cleaned.columns:
    if parkinsons_cleaned[col].skew() < -0.5:
        # Reshape to 2D, apply transformation, and flatten back to 1D
        parkinsons_cleaned[col] = pt.fit_transform(parkinsons_cleaned[[col]]).flatten()

# Print skewness in ascending order
print(parkinsons_cleaned.skew().sort_values(ascending=True))

### 📊 Interpretation of Boxplot After Outlier Removal

The boxplot above shows the distribution of the numerical features in the Parkinson’s dataset **after removing outliers** using the IQR (Interquartile Range) method.

#### ✅ What to observe:
- The whiskers of each box now better represent the true spread of the data.
- Extreme values (previously shown as individual points) are mostly removed, reducing noise.
- The central boxes (representing the interquartile range) are now tighter and more consistent, which helps in building better models.

#### 🧠 Why this is useful:
Outliers can negatively impact machine learning models, especially those sensitive to scale (like Logistic Regression, SVM, and KNN). By removing them:
- We make training more stable.
- We improve model performance and generalization.
- We avoid misleading results due to extreme values.

Next, we’ll proceed to **scaling** the cleaned dataset before applying machine learning models.


## 🧪 Exploratory Data Analysis (EDA)
This section explores how each health-related feature in the dataset behaves individually, in pairs, and across multiple variables. These visuals help us understand patterns and relationships that are critical for building accurate AI disease prediction models.
 

📊 Count Plot

In [None]:

# Draw the count plot
plt.figure(figsize=(6, 4))
sns.countplot(data=parkinsons, x='status', palette='Set2')
plt.title('🧠 Parkinson\'s Disease Count by Status')
plt.xlabel('Status (0 = Healthy, 1 = Parkinson\'s)')
plt.ylabel('Count')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()


### 📊 Count Plot – Parkinson's Disease Status

This count plot displays the number of patients in the dataset who are:

- `0`: Healthy individuals (no Parkinson's)
- `1`: Patients diagnosed with Parkinson's disease

🧠 **Purpose**:
- To understand class distribution.
- It reveals if the dataset is **balanced** or **imbalanced**, which is crucial for model training.

📌 **Observation**:
- If one class dominates, model performance might be biased.
- A balanced dataset generally leads to better classification outcomes.


In [None]:

plt.figure(figsize=(16, 12))
corr_matrix = parkinsons_cleaned.corr()
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", square=True)
plt.title("🔗 Correlation Heatmap of Features")
plt.show()


### 🔗 Correlation Heatmap

This heatmap shows the correlation between all numerical features in the Parkinson’s dataset.

- Values range from **-1 to +1**
- **+1** indicates a strong positive correlation
- **-1** indicates a strong negative correlation
- **0** means no correlation

We use this to:
- Detect **multicollinearity** (features that are strongly correlated with each other)
- Identify features strongly related to the target (`status`)


In [None]:
# 🧮 2. Histogram / KDE Plot
# Define the features (exclude 'status' and non-numeric if any)
features = [col for col in parkinsons_cleaned.columns if col != 'status']

for feature in features:
    plt.figure(figsize=(6, 4))
    sns.histplot(data=parkinsons_cleaned, x=feature, hue='status', kde=True, palette='viridis', bins=30)
    plt.title(f"🧮 Distribution of {feature} by Status")
    plt.xlabel(feature)
    plt.ylabel("Count")
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.tight_layout()
    plt.show()


### 🧮 Histogram & KDE Plots

These plots show the distribution of each feature, grouped by `status`.

- Helps us understand how values are spread between healthy and Parkinson’s patients
- KDE line shows a smooth approximation of the distribution curve
- Good for spotting skewness or feature separation between classes


In [None]:
# 🧑‍🤝‍🧑 3. Pair Plot
# Only select a few important features for readability
subset_features = ['status', 'MDVP:Fo(Hz)', 'MDVP:Jitter(%)', 'MDVP:Shimmer', 'spread1', 'D2']

sns.pairplot(parkinsons_cleaned[subset_features], hue='status', palette='coolwarm')
plt.suptitle("🧑‍🤝‍🧑 Pair Plot of Selected Features", y=1.02)
plt.show()

### 🧑‍🤝‍🧑 Pair Plot

The pair plot shows the relationships between selected features and how they cluster by `status`.

- Each cell represents a scatterplot between two features.
- Diagonal shows the KDE or histogram for each individual feature.
- Helps identify **separable groups** and **feature interactions** useful for classification.


In [None]:
 # 6. Violin Plot
# A violin plot combines boxplot and KDE. It helps visualize the distribution + spread of features by class (status).
# Choose a few key features for visual clarity
violin_features = ['MDVP:Fo(Hz)', 'MDVP:Jitter(%)', 'MDVP:Shimmer', 'spread1']

for feature in violin_features:
    plt.figure(figsize=(6, 4))
    sns.violinplot(x='status', y=feature, data=parkinsons_cleaned, palette='Set2')
    plt.title(f"🎻 Violin Plot of {feature} by Status")
    plt.xlabel("Status (0 = Healthy, 1 = Parkinson's)")
    plt.ylabel(feature)
    plt.grid(axis='y', linestyle='--', alpha=0.5)
    plt.tight_layout()
    plt.show()

### 🎻 Violin Plot

The violin plot combines boxplot and distribution (KDE) to show how feature values are spread across each class.

- Wider sections show higher density of data points.
- Narrow sections mean fewer data points.
- Helps to spot if a feature has a **distinctive distribution** between Parkinson’s and healthy groups.


In [None]:
categorical_cols = parkinsons_cleaned.select_dtypes(include='object').columns
print("Categorical columns:", categorical_cols.tolist())


### 🧬 Feature Encoding (Not Required)

In this dataset, all features are already **numerical**, so no encoding is needed.

- Feature encoding is typically used to convert **categorical variables** (like gender, region, or type) into a numerical format.
- Since the Parkinson’s dataset contains **no object or categorical columns**, we can directly proceed to the next steps like **scaling**, **feature selection**, and **model training**.

✅ This saves preprocessing time and simplifies the pipeline.


In [None]:

# Calculate correlation matrix
corr_matrix = parkinsons_cleaned.corr()

# Set up the matplotlib figure
plt.figure(figsize=(16, 12))

# Draw the heatmap
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", square=True, linewidths=0.5)
plt.title("🔗 Correlation Heatmap of Parkinson's Features", fontsize=16)
plt.tight_layout()
plt.show()


### 🎯 Target Variable – `status`

In the Parkinson’s dataset, the **target column** is `status`.

- It indicates whether a person has **Parkinson’s disease** or not.
- It is a **binary classification** variable:
  - `0` = Healthy (No Parkinson’s)
  - `1` = Parkinson’s Disease (Positive)

#### Why it's important:
- All other columns are considered **input features** used to predict this target.
- During model training, the algorithm learns to distinguish between `status = 0` and `status = 1` based on the values of the other features.

We use this target for classification tasks to build a model that can predict whether a new patient is likely to h


In [None]:
# Reset index before joining
parkinsons_cleaned = parkinsons_cleaned.reset_index(drop=True)
parkinsons = parkinsons.reset_index(drop=True)

# Now assign the 'status' column directly
parkinsons_cleaned['status'] = parkinsons['status']

# Ensure numeric and drop any NaNs
parkinsons_cleaned['status'] = pd.to_numeric(parkinsons_cleaned['status'], errors='coerce')
parkinsons_cleaned = parkinsons_cleaned.dropna()

# Now check correlation
print(parkinsons_cleaned[['MDVP:Fo(Hz)', 'status']].corr())


In [None]:

# Check correlation again
print(parkinsons_cleaned.corr()['status'].sort_values(ascending=False))


In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# Separate features (X) and target (y)
X = parkinsons_cleaned.drop(columns='status')
y = parkinsons_cleaned['status']

# Select top k features (e.g., top 10)
k = 10
selector = SelectKBest(score_func=f_classif, k=k)
X_kbest = selector.fit_transform(X, y)

# Get feature names
selected_mask = selector.get_support()
selected_features = X.columns[selected_mask]

print(f"✅ Top {k} Selected Features:")
print(selected_features.tolist())


### 🔍 Feature Selection using SelectKBest (ANOVA F-test)

We applied `SelectKBest` with the ANOVA F-test (`f_classif`) to identify the **top 10 features** that are most statistically related to the target variable `status`.

#### Why use it:
- Automatically selects the most relevant features
- Improves model accuracy and training efficiency
- Filters out noisy or less informative features

These selected features will now be used to build a more focused and efficient machine learning model.


In [None]:


# Use only the selected features
X_selected = X[selected_features]  # X is from the previous step
y = parkinsons_cleaned['status']

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X_selected, y, test_size=0.2, random_state=42, stratify=y
)

# Check the shape of the splits
print("📊 Training Features Shape:", X_train.shape)
print("📊 Testing Features Shape:", X_test.shape)
print("🎯 Training Labels Shape:", y_train.shape)
print("🎯 Testing Labels Shape:", y_test.shape)


### ✂️ Split Data into Training and Testing Sets

We split the dataset into:
- **80% training data** to train the machine learning model
- **20% testing data** to evaluate the model’s performance on unseen data

We used `train_test_split()` from scikit-learn with:
- `random_state=42` to ensure reproducibility
- `stratify=y` to preserve the distribution of target classes (0 and 1) in both sets

This split allows us to build a model and assess its ability to generalize.


In [None]:
X_train

In [None]:
X_test

In [None]:
 y_train

In [None]:
 y_test

smote

In [None]:

# Count plot before SMOTE
plt.figure(figsize=(6,4))
sns.countplot(x=y)
plt.title("Before SMOTE - Class Distribution")
plt.show()
print("Before SMOTE:", Counter(y))

In [None]:
#ONLY IF THERE IS DATA IMBALANCE
print("Original Class Distribution:", y_train.value_counts())
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)
print("Resampled Class Distribution:", pd.Series(y_train).value_counts())

In [None]:
#  Count plot after SMOTE
plt.figure(figsize=(6,4))
sns.countplot(x=y_res)
plt.title("After SMOTE - Class Distribution")
plt.show()

print("After SMOTE:", Counter(y_res))

### ⚖️ Feature Scaling using StandardScaler


In [None]:


 # Scaling using StandardScaler
#✅ Only scale feature data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
print(X_train_scaled )



In [None]:
y_train_df = pd.DataFrame(y_train) #converting to data frame from series
scaler = StandardScaler()
scaler.fit(y_train_df)
y_train_scaled = scaler.transform(y_train_df)
print(y_train_scaled )

### ⚖️ Feature Scaling using StandardScaler

We scaled the selected features using **StandardScaler**, which transforms each feature to have:
- Mean = 0
- Standard Deviation = 1

This step is important for machine learning models that are sensitive to feature scale.

#### ✅ Notes:
- We fit the scaler only on the **training data** to avoid data leakage.
- The same transformation is applied to the test set using `.transform()`.

Now that our data is scaled, we're ready to train machine learning models!


In [None]:
X_train_scaled.shape

### 🤖 Model Training

In [None]:

models = {
    "Logistic Regression": LogisticRegression(max_iter=2000),
    "SVM (RBF)": SVC(probability=True, kernel="rbf", C=1.0, gamma="scale", random_state=42),
    "Decision Tree": DecisionTreeClassifier(max_depth=None, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=300, random_state=42),
    "KNN (k=7)": KNeighborsClassifier(n_neighbors=7),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "AdaBoost": AdaBoostClassifier(n_estimators=200, random_state=42),
    "Gaussian NB": GaussianNB(),
}

# ===== 3) Train with Pipeline: Scaling + SMOTE (train only) + Model =====
results = []
probas_for_roc = {}   # store probabilities for ROC curves
best_models = {}

for name, clf in models.items():
    pipe = ImbPipeline(steps=[
        ("scaler", StandardScaler()),
        ("smote", SMOTE(random_state=42)),
        ("model", clf)
    ])
    
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    
    # Try to get probabilities for ROC-AUC (some models don’t support predict_proba)
    try:
        y_proba = pipe.predict_proba(X_test)[:, 1]
    except Exception:
        try:
            # fallback to decision_function if available
            scores = pipe.decision_function(X_test)
            # convert to [0,1] via min-max for AUC comparability
            y_proba = (scores - scores.min()) / (scores.max() - scores.min() + 1e-12)
        except Exception:
            y_proba = None
    
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    auc_val = roc_auc_score(y_test, y_proba) if y_proba is not None else np.nan
    
    results.append([name, acc, prec, rec, f1, auc_val])
    if y_proba is not None:
        probas_for_roc[name] = y_proba
    best_models[name] = pipe
    
    print(f"\n=== {name} ===")
    print("Accuracy :", round(acc, 4))
    print("Precision:", round(prec, 4))
    print("Recall   :", round(rec, 4))
    print("F1-score :", round(f1, 4))
    if y_proba is not None:
        print("ROC-AUC  :", round(auc_val, 4))
    print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred))

results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1", "ROC_AUC"])
results_df = results_df.sort_values(by=["F1","Accuracy"], ascending=False).reset_index(drop=True)
results_df


In [None]:
# Bar plot: Accuracy & F1
plt.figure(figsize=(9,4))
sns.barplot(x="Model", y="Accuracy", data=results_df)
plt.title("Accuracy by Model")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

plt.figure(figsize=(9,4))
sns.barplot(x="Model", y="F1", data=results_df)
plt.title("F1-score by Model")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

# ROC Curves for top models with probabilities
plt.figure(figsize=(8,6))
for name, y_proba in probas_for_roc.items():
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{name} (AUC = {roc_auc:.2f})")
plt.plot([0,1],[0,1],'k--', lw=1)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.tight_layout()
plt.show()

# Confusion matrix for the best model (top row in results_df)
best_name = results_df.iloc[0]["Model"]
best_pipe = best_models[best_name]
best_pred = best_pipe.predict(X_test)

cm = confusion_matrix(y_test, best_pred)
plt.figure(figsize=(4.8,4))
sns.heatmap(cm, annot=True, fmt="d", cmap="viridis", cbar=True)
plt.title(f"Confusion Matrix – {best_name}")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.tight_layout()
plt.show()


📊 Model Performance Evaluation

The results of multiple classification models on the Parkinson’s dataset are shown in the plots above.

✅ Accuracy & F1-score Comparison

Most models, including Logistic Regression, Random Forest, Gaussian NB, and SVM (RBF), achieved accuracies close to 80%.

The F1-score, which balances precision and recall, shows a similar trend, meaning the models are not just accurate but also reliable in handling imbalanced cases.

Decision Tree performed relatively weaker compared to ensemble methods (Random Forest, Gradient Boosting, AdaBoost).

📈 ROC Curve & AUC Score

The ROC curves compare the ability of models to distinguish between Parkinson’s and non-Parkinson’s cases.

The AUC scores show that:

Random Forest, SVM (RBF), and Logistic Regression performed the best with AUC ≈ 0.88–0.89.

Gaussian NB slightly outperformed with AUC = 0.90, making it one of the strongest classifiers.

Decision Tree lagged with AUC = 0.68, confirming its weaker generalization ability.

🔍 Confusion Matrix (Example: Gaussian NB)

The confusion matrix for Gaussian Naive Bayes shows:

True Positives (21) and True Negatives (16) were well captured.

Only a small number of False Negatives (6) and False Positives (3) occurred.

This indicates a good balance between sensitivity (recall) and specificity.

🏆 Conclusion

Best Models: Gaussian NB, Random Forest, SVM (RBF), and Logistic Regression.

These models achieved the highest accuracy, F1-score, and AUC, making them strong candidates for early Parkinson’s disease prediction.

Decision Tree is less reliable and should be avoided for final deployment.



In [None]:
# Logistic Regression
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs']
}

# SVM
param_grid_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

# Decision Tree
param_grid_dt = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# Random Forest
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

# KNN
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}


rf = RandomForestClassifier(random_state=42)

grid_rf = GridSearchCV(estimator=rf, 
                       param_grid=param_grid_rf, 
                       cv=5, 
                       scoring='accuracy',
                       n_jobs=-1)

grid_rf.fit(X_train, y_train)

print("Best Parameters:", grid_rf.best_params_)
print("Best Accuracy:", grid_rf.best_score_)



In [None]:
best_rf = grid_rf.best_estimator_
y_pred = best_rf.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


In [None]:

# Define pipeline
rf_pipeline = ImbPipeline(steps=[
    ('scaler', StandardScaler()),       # scaling
    ('smote', SMOTE(random_state=42)),  # handle imbalance
    ('classifier', RandomForestClassifier(random_state=42))
])

# Hyperparameter grid
param_grid_rf = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [None, 5, 10],
    'classifier__min_samples_split': [2, 5]
}

# GridSearchCV with pipeline
grid_rf = GridSearchCV(rf_pipeline, param_grid=param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)
grid_rf.fit(X_train, y_train)

print("Best RF Parameters:", grid_rf.best_params_)
print("Best RF CV Accuracy:", grid_rf.best_score_)


In [None]:
best_model = grid_rf.best_estimator_
y_pred = best_model.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


In [None]:
# Save model
joblib.dump(model, 'parkinsons_model.pkl')

# Save scaler
joblib.dump(scaler, 'parkinsons_scaler.pkl')

# Save selected features list
joblib.dump(X_train.columns.tolist(), 'parkinsons_model_columns.pkl')

print("✅ Model, scaler, and columns saved successfully.")


In [None]:
import pandas as pd
import joblib

# Load Parkinson's model, scaler, and feature list
model = joblib.load("parkinsons_model.pkl")
scaler = joblib.load("parkinsons_scaler.pkl")
model_columns = joblib.load("parkinsons_model_columns.pkl")

# New sample must match Parkinson's dataset features
new_data = pd.DataFrame([{
    'MDVP:Fo(Hz)': 119.992,
    'MDVP:Fhi(Hz)': 157.302,
    'MDVP:Flo(Hz)': 74.997,
    'MDVP:Jitter(%)': 0.00784,
    'MDVP:Jitter(Abs)': 0.00007,
    'MDVP:RAP': 0.0037,
    'MDVP:PPQ': 0.00554,
    'Jitter:DDP': 0.01109,
    'MDVP:Shimmer': 0.04374,
    'MDVP:Shimmer(dB)': 0.426,
    'Shimmer:APQ3': 0.02182,
    'Shimmer:APQ5': 0.0313,
    'MDVP:APQ': 0.02971,
    'Shimmer:DDA': 0.06545,
    'NHR': 0.02211,
    'HNR': 21.033,
    'RPDE': 0.414783,
    'DFA': 0.815285,
    'spread1': -4.813031,
    'spread2': 0.266482,
    'D2': 2.301442,
    'PPE': 0.284654
}])

# Reorder and fill any missing columns
new_data = new_data.reindex(columns=model_columns, fill_value=0)

# Scale and predict
new_data_scaled = scaler.transform(new_data)
prediction = model.predict(new_data_scaled)

print("✅ Prediction:", "Parkinson’s Detected" if prediction[0] == 1 else "Healthy")


📝 Conclusion

In this project, we successfully developed an AI-powered disease prediction system using machine learning. By training and evaluating models on three different datasets—Diabetes, Heart Disease, and Parkinson’s Disease—we demonstrated how health data can be leveraged to assist in early disease detection.

Each disease model was trained separately with proper scaling, preprocessing, and feature selection.

The trained models were saved and integrated into a single system, where users can choose the disease type and provide their health parameters to get predictions.

This approach reduces the risk of feature mismatch errors by keeping models independent, while still offering a unified platform for multiple diseases.

Although the models provide useful insights, they are not a substitute for professional medical advice. The system can be seen as a support tool for early screening, helping patients and doctors to make more informed decisions.

With further improvements such as larger datasets, hyperparameter tuning, and deployment in a user-friendly web app (e.g., Streamlit), this project can evolve into a practical healthcare application.