In [67]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [90]:
df=pd.read_csv('patient_healthcare_data.csv')
df.head()

Unnamed: 0,patient_id,age,gender,region,bmi,blood_pressure,cholesterol,glucose,disease_risk
0,P1000,51.0,Female,North,23.989727,118.315079,168.614627,87.744226,1
1,P1001,14.0,Male,South,20.008867,116.685456,193.192907,92.245969,0
2,P1002,71.0,Male,East,31.529645,129.212501,,105.717308,0
3,P1003,60.0,Male,West,19.279434,131.362616,235.11979,106.689136,0
4,P1004,20.0,Male,North,23.239822,112.042483,110.470906,113.170885,0


# Part A

**Checking missing values**

In [69]:
df.isnull().sum()

patient_id         0
age               16
gender            16
region            16
bmi               15
blood_pressure    16
cholesterol       15
glucose           16
disease_risk       0
dtype: int64

We can conclude that every column has almost equal number of missing values.
But id and risk columns doesn't have any missing values.

**Summary report**

In [92]:
summary_report = pd.DataFrame({
    "Missing_Count": df.isnull().sum(),
    "Missing_Percentage (%)": df.isnull().mean() * 100
})

summary_report

Unnamed: 0,Missing_Count,Missing_Percentage (%)
patient_id,0,0.0
age,16,8.0
gender,16,8.0
region,16,8.0
bmi,15,7.5
blood_pressure,16,8.0
cholesterol,15,7.5
glucose,16,8.0
disease_risk,0,0.0


The dataset contains missing values across several columns. The percentage of missing values was calculated for each column.

- Age, BMI, Blood Pressure, Cholesterol, and Glucose contain missing values, which may have occurred due to incomplete medical records or measurement issues.
- Gender and Region also show missing entries, indicating incomplete demographic information.
- Patient ID and Disease Risk have no missing values, ensuring data integrity for unique identification and target prediction.

**Imputing using median**

In [71]:
from sklearn.impute import SimpleImputer

median_impute=SimpleImputer(strategy='median')
df['bmi_median_imputed']=median_impute.fit_transform(df[['bmi']])

- Median imputation replaces missing BMI values with the median BMI, making it more
  robust and less affected by outliers.

Given the presence of extreme BMI values in the dataset, median imputation provides a
more reliable estimate and preserves the central tendency of the data more effectively.


**Imputing using mode**

In [72]:
mode_impute=SimpleImputer(strategy='most_frequent')
df['region_mode_imputed']=mode_impute.fit_transform(df[['region']]).ravel()

This approach replaces missing entries with the mode of the column
Most frequent imputation is suitable for nominal features such as region,
where numerical averaging is not meaningful and category consistency is
important.


**Imputing gender using mode**

In [73]:
gender_impute = SimpleImputer(strategy='most_frequent')
df['gender_imputed'] = gender_impute.fit_transform(df[['gender']]).ravel()

Missing values in the gender column were handled using the most frequent
imputation technique. This method replaces missing entries with the category
that appears most often in the dataset.

**KNN imputer**

In [74]:
from sklearn.impute import KNNImputer
num_cols = ["age", "bmi", "blood_pressure", "cholesterol", "glucose"]

knn_imputer = KNNImputer(
    n_neighbors=5,
    weights="uniform"
)

df_knn_imputed = df.copy()
df_knn_imputed[num_cols] = knn_imputer.fit_transform(df[num_cols])
df_knn_imputed[num_cols].isnull().sum()

age               0
bmi               0
blood_pressure    0
cholesterol       0
glucose           0
dtype: int64

For each missing value the algorithm identifies the k nearest observations
based on feature similarity and imputes the missing value using the average
of those neighbors.

**Mice algorithm**

In [75]:
from sklearn.experimental import  enable_iterative_imputer
from sklearn.impute import IterativeImputer

num_cols = ["age", "bmi", "blood_pressure", "cholesterol", "glucose"]
mice_imputer = IterativeImputer(
    max_iter=10,
    random_state=42
)
df_mice_imputed = df.copy()
df_mice_imputed[num_cols] = mice_imputer.fit_transform(df[num_cols])

df_mice_imputed[num_cols].isnull().sum()


age               0
bmi               0
blood_pressure    0
cholesterol       0
glucose           0
dtype: int64

The MICE algorithm was applied using Iterative Imputer to handle missing
values across multiple numerical features simultaneously.

# Part B
**Z score method**

In [76]:
from scipy.stats import zscore

cols = ["cholesterol", "glucose"]
z_scores = np.abs(zscore(df[cols], nan_policy="omit"))
outlier = (z_scores > 3).any(axis=1)
outlier_patients = df[outlier]
df_cleaned = df[~outlier]
outlier_rows = df.loc[outlier, ["patient_id", "cholesterol", "glucose"]]

outlier_rows

Unnamed: 0,patient_id,cholesterol,glucose
30,P1030,207.793038,500.0
43,P1043,242.643825,350.0
65,P1065,50.0,98.684995
136,P1136,400.0,127.987109
139,P1139,600.0,87.061264


There are total 5 outliers that has z_score > 3 in cholesterol & glucose.

**IQR method**

In [77]:
bmi = df["bmi"].dropna()

Q1 = bmi.quantile(0.25)
Q3 = bmi.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
bmi_outliers = df[(df["bmi"] < lower_bound) | (df["bmi"] > upper_bound)]
df_cleaned = df[(df["bmi"] >= lower_bound) & (df["bmi"] <= upper_bound)]

print("Q1:", Q1)
print("Q3:", Q3)
print("IQR:", IQR)
print("Lower Bound:", lower_bound)
print("Upper Bound:", upper_bound)
print("\nOutliers:")
bmi_outliers

Q1: 22.60249990818491
Q3: 27.74420583999376
IQR: 5.1417059318088505
Lower Bound: 14.889941010471635
Upper Bound: 35.45676473770703

Outliers:


Unnamed: 0,patient_id,age,gender,region,bmi,blood_pressure,cholesterol,glucose,disease_risk,bmi_median_imputed,region_mode_imputed,gender_imputed
61,P1061,,Male,West,85.0,92.926768,184.4805,69.536262,0,85.0,West,Male
71,P1071,53.0,,West,60.0,120.71098,204.177924,89.958916,1,60.0,West,Female
154,P1154,35.0,Male,South,3.0,127.215138,,93.593054,0,3.0,South,Male


The calculated lower bound is significantly above extremely low BMI values such as 3.
The calculated upper bound is far below extreme BMI values such as 60 and 85.

The lower and upper bounds calculated using 1.5 times the IQR
successfully captured unrealistic BMI values. After removing these outliers the dataset became more stable and
better suited for statistical modeling.

**Percentile Method**

In [78]:
cols = ["bmi", "cholesterol", "glucose"]

df_capped = df.copy()
for col in cols:

    lower_limit = df[col].quantile(0.01)
    upper_limit = df[col].quantile(0.99)
    
    df_capped[col] = np.where(
        df[col] < lower_limit, lower_limit,
        np.where(df[col] > upper_limit, upper_limit, df[col])
    )
    print(f"{col} lower limit:", lower_limit)
    print(f"{col} upper limit:", upper_limit)

bmi lower limit: 16.694101280307738
bmi upper limit: 37.95773785561129
cholesterol lower limit: 124.81072384104374
cholesterol upper limit: 289.93602803390183
glucose lower limit: 50.078558267817066
glucose upper limit: 184.44707826949963


**Winsorization**

In [79]:
cols = ["bmi", "cholesterol", "glucose"]

df_winsorized = df.copy()
for col in cols:
    lower_limit = df[col].quantile(0.01)
    upper_limit = df[col].quantile(0.99)
    
    df_winsorized[col] = np.clip(df[col], lower_limit, upper_limit)    
    print(col)
    print("Lower cap:", lower_limit)
    print("Upper cap:", upper_limit)
    print()

bmi
Lower cap: 16.694101280307738
Upper cap: 37.95773785561129

cholesterol
Lower cap: 124.81072384104374
Upper cap: 289.93602803390183

glucose
Lower cap: 50.078558267817066
Upper cap: 184.44707826949963



After applying Winsorization extreme values below the 1st percentile and above the 99th percentile were capped.
- The total number of rows in the dataset remained unchanged.
- The maximum and minimum values of the affected columns shifted closer to the central distribution.
- The mean and standard deviation decreased slightly.
- The overall distribution became more stable and less skewed.

### **Dataset shape and summary before and after**
**Bmi median imputation before and after**

In [80]:
print(f'Before Impute: \n{df['bmi'].describe()}, \n\nAfter Impute: \n{df['bmi_median_imputed'].describe()}')

Before Impute: 
count    185.000000
mean      25.662279
std        6.500836
min        3.000000
25%       22.602500
50%       25.141054
75%       27.744206
max       85.000000
Name: bmi, dtype: float64, 

After Impute: 
count    200.000000
mean      25.623187
std        6.252545
min        3.000000
25%       22.731923
50%       25.141054
75%       27.598259
max       85.000000
Name: bmi_median_imputed, dtype: float64


**Region mode imputation before and after**

In [81]:
print(f'Before Impute: \n{df["region"].describe()}, \n\nAfter Impute: \n{df["region_mode_imputed"].describe()}')

Before Impute: 
count       184
unique        4
top       South
freq         49
Name: region, dtype: object, 

After Impute: 
count       200
unique        4
top       South
freq         65
Name: region_mode_imputed, dtype: object


**Winsorization before and after**

In [82]:
print("Shape Before Treatment:", df.shape)
summary_before = df[["bmi", "cholesterol", "glucose"]].describe()

print("Shape After Treatment:", df_winsorized.shape)
summary_after = df_winsorized[["bmi", "cholesterol", "glucose"]].describe()
print(f'Before Impute: \n{summary_before}, \n\nAfter Impute: \n{summary_after}')

Shape Before Treatment: (200, 12)
Shape After Treatment: (200, 12)
Before Impute: 
              bmi  cholesterol     glucose
count  185.000000   185.000000  184.000000
mean    25.662279   193.242546  106.802985
std      6.500836    45.537866   39.648145
min      3.000000    50.000000   40.000000
25%     22.602500   171.190988   90.148137
50%     25.141054   192.883623  104.275331
75%     27.744206   210.244585  116.510450
max     85.000000   600.000000  500.000000, 

After Impute: 
              bmi  cholesterol     glucose
count  185.000000   185.000000  184.000000
mean    25.363897   191.453479  104.281366
std      3.987244    30.704004   21.260244
min     16.694101   124.810724   50.078558
25%     22.602500   171.190988   90.148137
50%     25.141054   192.883623  104.275331
75%     27.744206   210.244585  116.510450
max     37.957738   289.936028  184.447078


**Gender imputation before and after**

In [83]:
print(f'Before Impute: \n{df["gender"].describe()}, \n\nAfter Impute: \n{df["gender_imputed"].describe()}')

Before Impute: 
count        184
unique         2
top       Female
freq          98
Name: gender, dtype: object, 

After Impute: 
count        200
unique         2
top       Female
freq         114
Name: gender_imputed, dtype: object


**KNN imptute before and after**

In [84]:
print(f'Before Impute: \n{df[["age", "bmi", "blood_pressure", "cholesterol", "glucose"]].describe()}, \n\nAfter Impute: \n{df_knn_imputed.describe()}')

Before Impute: 
              age         bmi  blood_pressure  cholesterol     glucose
count  184.000000  185.000000      184.000000   185.000000  184.000000
mean    46.695652   25.662279      120.655841   193.242546  106.802985
std     30.496345    6.500836       23.345356    45.537866   39.648145
min      0.000000    3.000000       40.000000    50.000000   40.000000
25%     23.000000   22.602500      108.628832   171.190988   90.148137
50%     47.000000   25.141054      118.214138   192.883623  104.275331
75%     66.000000   27.744206      129.638036   210.244585  116.510450
max    200.000000   85.000000      300.000000   600.000000  500.000000, 

After Impute: 
              age         bmi  blood_pressure  cholesterol     glucose  \
count  200.000000  200.000000      200.000000   200.000000  200.000000   
mean    47.269000   25.563433      120.533035   192.517425  106.442755   
std     29.569362    6.311200       22.494534    44.083431   38.406475   
min      0.000000    3.000000  

**MICE imputed before and after**

In [85]:
print(f'Before Impute: \n{df[["age", "bmi", "blood_pressure", "cholesterol", "glucose"]].describe()}, \n\nAfter Impute: \n{df_mice_imputed[num_cols].describe()}')

Before Impute: 
              age         bmi  blood_pressure  cholesterol     glucose
count  184.000000  185.000000      184.000000   185.000000  184.000000
mean    46.695652   25.662279      120.655841   193.242546  106.802985
std     30.496345    6.500836       23.345356    45.537866   39.648145
min      0.000000    3.000000       40.000000    50.000000   40.000000
25%     23.000000   22.602500      108.628832   171.190988   90.148137
50%     47.000000   25.141054      118.214138   192.883623  104.275331
75%     66.000000   27.744206      129.638036   210.244585  116.510450
max    200.000000   85.000000      300.000000   600.000000  500.000000, 

After Impute: 
              age         bmi  blood_pressure  cholesterol     glucose
count  200.000000  200.000000      200.000000   200.000000  200.000000
mean    46.695148   25.664574      120.647813   193.226670  106.802449
std     29.244678    6.251082       22.387306    43.790366   38.020872
min      0.000000    3.000000       40.0000

**IQR in bmi before and after**

In [86]:
print(f'Before Impute: \n{df["bmi"].describe()}, \n\nAfter Impute: \n{df_cleaned["bmi"].describe()}')

Before Impute: 
count    185.000000
mean      25.662279
std        6.500836
min        3.000000
25%       22.602500
50%       25.141054
75%       27.744206
max       85.000000
Name: bmi, dtype: float64, 

After Impute: 
count    182.000000
mean      25.272097
std        3.742152
min       16.504417
25%       22.610304
50%       25.119547
75%       27.715531
max       33.759212
Name: bmi, dtype: float64


**Percentile method before and after**

In [87]:
print(f'Before Impute: \n{df[cols].describe()}, \n\nAfter Impute: \n{df_capped[cols].describe()}')

Before Impute: 
              bmi  cholesterol     glucose
count  185.000000   185.000000  184.000000
mean    25.662279   193.242546  106.802985
std      6.500836    45.537866   39.648145
min      3.000000    50.000000   40.000000
25%     22.602500   171.190988   90.148137
50%     25.141054   192.883623  104.275331
75%     27.744206   210.244585  116.510450
max     85.000000   600.000000  500.000000, 

After Impute: 
              bmi  cholesterol     glucose
count  185.000000   185.000000  184.000000
mean    25.363897   191.453479  104.281366
std      3.987244    30.704004   21.260244
min     16.694101   124.810724   50.078558
25%     22.602500   171.190988   90.148137
50%     25.141054   192.883623  104.275331
75%     27.744206   210.244585  116.510450
max     37.957738   289.936028  184.447078


### Part C
**New data without missing values and outliers**

In [88]:
import pandas as pd
from sklearn.impute import SimpleImputer

age_imputer = SimpleImputer(strategy="median")
df["age"] = age_imputer.fit_transform(df[["age"]])

bmi_imputer = SimpleImputer(strategy="median")
df["bmi"] = bmi_imputer.fit_transform(df[["bmi"]])

bp_imputer = SimpleImputer(strategy="mean")
df["blood_pressure"] = bp_imputer.fit_transform(df[["blood_pressure"]])

chol_imputer = SimpleImputer(strategy="median")
df["cholesterol"] = chol_imputer.fit_transform(df[["cholesterol"]])

glucose_imputer = SimpleImputer(strategy="mean")
df["glucose"] = glucose_imputer.fit_transform(df[["glucose"]])

Q1 = df["bmi"].quantile(0.25)
Q3 = df["bmi"].quantile(0.75)
IQR = Q3 - Q1
lower_bmi = Q1 - 1.5 * IQR
upper_bmi = Q3 + 1.5 * IQR
df["bmi"] = np.clip(df["bmi"], lower_bmi, upper_bmi)

for col in ["cholesterol", "glucose"]:
    lower = df[col].quantile(0.01)
    upper = df[col].quantile(0.99)
    df[col] = np.clip(df[col], lower, upper)

mean_bp = df["blood_pressure"].mean()
std_bp = df["blood_pressure"].std()
z_scores_bp = (df["blood_pressure"] - mean_bp) / std_bp
df["blood_pressure"] = np.where(np.abs(z_scores_bp) > 3,mean_bp,df["blood_pressure"])

print('New data without missing values:\n\n',df)
print('\n Total missing values after imputation: \n')
df[["age","bmi","blood_pressure","cholesterol","glucose"]].isnull().sum()

New data without missing values:

     patient_id   age  gender region        bmi  blood_pressure  cholesterol  \
0        P1000  51.0  Female  North  23.989727      118.315079   168.614627   
1        P1001  14.0    Male  South  20.008867      116.685456   193.192907   
2        P1002  71.0    Male   East  31.529645      129.212501   192.883623   
3        P1003  60.0    Male   West  19.279434      131.362616   235.119790   
4        P1004  20.0    Male  North  23.239822      112.042483   127.371406   
..         ...   ...     ...    ...        ...             ...          ...   
195      P1195  61.0     NaN  South  25.141054      117.834594   208.987955   
196      P1196  57.0  Female   West  32.858901      111.395070   196.087691   
197      P1197  51.0  Female  North  25.141054      111.797116   144.527677   
198      P1198  11.0    Male   East  22.201098      119.508701   192.883623   
199      P1199  38.0  Female   West  25.855920      120.655841   243.876330   

        glucose 

age               0
bmi               0
blood_pressure    0
cholesterol       0
glucose           0
dtype: int64

##  Data Cleaning Summary Report

### 1Ô∏è‚É£ Most Effective Imputation Strategy

- **Median imputation** was most effective for skewed features like *BMI* and *Cholesterol*.
- It is robust to extreme values and preserved the true central tendency.
- **Mean imputation** worked well for relatively symmetric features like *Blood Pressure* and *Glucose*.



---

### 2Ô∏è‚É£ Best Outlier Handling Method

- **Winsorization** preserved data quality best.
- It capped extreme values without removing records.
- Dataset size remained unchanged.
- Statistical measures became more stable.



---

### 3Ô∏è‚É£ Improvement in Dataset Usability

After cleaning:

- ‚úî No missing values remain  
- ‚úî Extreme distortions reduced  
- ‚úî Mean and standard deviation stabilized  
- ‚úî Feature distributions became more realistic  
- ‚úî Dataset is ready for machine learning modeling  

üöÄ Overall, data cleaning improved reliability, stability, and model readiness of the dataset.