# TASK 1 :         ETL PIPELINE

## Project: Preprocessing Pipeline for Diabetes Risk Prediction

### Dataset: Pima Indians Diabetes

### Output: Cleaned data ready for machine learning

In [12]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

In [14]:
try:
    # Load dataset
    df = pd.read_csv("diabetes.csv")
    print("✅ Dataset loaded successfully!")
    print("Original shape:", df.shape)
    print("Sample rows:\n", df.head())

    # Replace 0s with np.nan in relevant columns
    columns_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
    df[columns_with_zeros] = df[columns_with_zeros].replace(0, np.nan)
    print("\n🔄 Replaced 0s with NaN in:", columns_with_zeros)
    print("🧪 Missing values before imputation:\n", df[columns_with_zeros].isnull().sum())

    # Impute missing values using median
    imputer = SimpleImputer(strategy='median')
    df[columns_with_zeros] = imputer.fit_transform(df[columns_with_zeros])
    print("✅ Missing values imputed.")

    # Scale features
    scaler = StandardScaler()
    features = df.drop('Outcome', axis=1)
    scaled_features = scaler.fit_transform(features)

    df_scaled = pd.DataFrame(scaled_features, columns=features.columns)
    df_scaled['Outcome'] = df['Outcome']
    print("📊 Scaling complete. Preview:\n", df_scaled.head())

    # Save cleaned dataset
    df_scaled.to_csv("cleaned_diabetes_data.csv", index=False)
    print("✅ Cleaned dataset saved as 'cleaned_diabetes_data.csv'.")

except FileNotFoundError:
    print("❌ File not found: Please make sure 'diabetes.csv' is in the working directory.")
except Exception as e:
    print("❌ Unexpected error:", str(e))

✅ Dataset loaded successfully!
Original shape: (768, 9)
Sample rows:
    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  

🔄 Replaced 0s with NaN in: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
🧪 Missing values before imputation:
 Glucose            5
BloodPressure     35
SkinThickness    227
Insulin          374
BMI               11

### Key Preprocessing Steps:


#### 1.Data Loading:
Read the raw CSV file using pandas.

#### 2.invalid Values Handling:
Replaced 0s with NaN in: ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

#### 3.Missing Value Imputation:

Used SimpleImputer with median strategy to fill missing values.

Median is preferred in medical datasets due to outliers (e.g., extreme blood sugar levels).



### 4.Feature Scaling:

Applied StandardScaler to normalize all features (zero mean, unit variance).

Scaling improves model performance, especially for deep learning.



#### Cleaned Output:

Saved the processed dataset as cleaned_diabetes_data.csv for further use.



### ✅ Outcome:

A fully cleaned and scaled medical dataset, ready for model training.

