### Step 1: Load 
#### Dataset Loading
 - Used the load_diabetes() function from scikit-learn to load the dataset into memory.

#### DataFrame Creation
- Converted the data into a pandas DataFrame for easier manipulation and analysis.

#### Target Column Addition
- Added the target column to the DataFrame, representing the disease progression measure.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
data = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
data['target'] = diabetes.target
data.head()


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


### Step 2: Standardize the Features
- Standardization ensures that all features have a mean of 0 and a standard deviation of 1, which is important for many machine learning algorithms to perform optimally.

#### Feature Scaling
- Used StandardScaler from scikit-learn to standardize the feature columns.
- Standardization helps to normalize the data, making all features comparable and improving model performance.

#### Transformation
- Applied the scaler to all feature columns (excluding the target column) using the fit_transform method.
- Created a new DataFrame to store the scaled features, while preserving the column names for readability.

#### Target Retention
- Retained the original target values by adding the target column back to the scaled DataFrame.

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.iloc[:, :-1]) 
scaled_data = pd.DataFrame(scaled_features, columns=diabetes.feature_names)
scaled_data['target'] = data['target']
scaled_data.head()


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.8005,1.065488,1.297088,0.459841,-0.929746,-0.732065,-0.912451,-0.054499,0.418531,-0.370989,151.0
1,-0.039567,-0.938537,-1.08218,-0.553505,-0.177624,-0.402886,1.564414,-0.830301,-1.436589,-1.938479,75.0
2,1.793307,1.065488,0.934533,-0.119214,-0.958674,-0.718897,-0.680245,-0.054499,0.060156,-0.545154,141.0
3,-1.872441,-0.938537,-0.243771,-0.77065,0.256292,0.525397,-0.757647,0.721302,0.476983,-0.196823,206.0
4,0.113172,-0.938537,-0.764944,0.459841,0.082726,0.32789,0.171178,-0.054499,-0.672502,-0.980568,135.0


### Step 3: Save the Processed Dataset
- After standardizing the features, the next step is to save the processed dataset for future use. This ensures that subsequent notebooks (e.g., EDA and modeling) can access the cleaned and transformed data.

#### Directory Check
- Verified the existence of the ../data/processed/ directory.
- Created the directory if it did not already exist to ensure a proper data pipeline structure.
  
#### Save Processed Data
- Saved the standardized dataset as a CSV file named diabetes_scaled.csv in the processed directory.
- This file will serve as the input for the EDA and modeling notebooks.
  
#### Confirmation
- Printed the path of the saved file to confirm the operation was successful.

In [8]:
import os
processed_dir = "../data/processed/"
if not os.path.exists(processed_dir):
    os.makedirs(processed_dir)
processed_path = os.path.join(processed_dir, "diabetes_scaled.csv")
scaled_data.to_csv(processed_path, index=False)
print(f"Processed data saved to {processed_path}")


Processed data saved to ../data/processed/diabetes_scaled.csv
