## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [1]:
# Write your code from here
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# 1. Load a sample dataset with Pandas
data = {'feature1': [10, 20, 30, 40, 50],
        'feature2': [100, 120, 110, 130, 105],
        'feature3': [0.5, 0.1, 0.9, 0.3, 0.7]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
print("-" * 30)

# 2. Define a pipeline using Pipeline from sklearn.pipeline.
# 3. Use StandardScaler to scale features.
# Create a pipeline with a single step: StandardScaler
scaling_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Fit the pipeline to the data and transform it
# Note: For demonstration, we fit and transform the same data.
# In a real scenario, you'd fit on training data and transform train/test data.
scaled_data = scaling_pipeline.fit_transform(df)

# The output of the pipeline is a NumPy array.
# You can convert it back to a DataFrame if needed.
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

print("DataFrame after StandardScaler:")
print(scaled_df)
print("-" * 30)

# You can also inspect the fitted scaler within the pipeline
print("Scaler mean values (fitted):")
print(scaling_pipeline.named_steps['scaler'].mean_)
print("Scaler scale (std dev) values (fitted):")
print(scaling_pipeline.named_steps['scaler'].scale_)

Original DataFrame:
   feature1  feature2  feature3
0        10       100       0.5
1        20       120       0.1
2        30       110       0.9
3        40       130       0.3
4        50       105       0.7
------------------------------
DataFrame after StandardScaler:
   feature1  feature2  feature3
0 -1.414214 -1.207020  0.000000
1 -0.707107  0.649934 -1.414214
2  0.000000 -0.278543  1.414214
3  0.707107  1.578410 -0.707107
4  1.414214 -0.742781  0.707107
------------------------------
Scaler mean values (fitted):
[ 30.  113.    0.5]
Scaler scale (std dev) values (fitted):
[14.14213562 10.77032961  0.28284271]


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [2]:
# Write your code from here
import pandas as pd
import numpy as np # For np.nan
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# 1. Load a dataset with missing values
data_with_missing = {'featureA': [1.0, 2.0, np.nan, 4.0, 5.0, 6.0],
                     'featureB': [10.0, np.nan, np.nan, 13.0, 10.0, 11.0],
                     'featureC': [100.0, 110.0, 120.0, 130.0, 140.0, np.nan]}
df_missing = pd.DataFrame(data_with_missing)

print("Original DataFrame with missing values:")
print(df_missing)
print("-" * 30)

# 2. Define a pipeline to use SimpleImputer for filling missing values.
# We'll use the 'mean' strategy for imputation. Other strategies include 'median', 'most_frequent', or 'constant'.
imputation_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# Fit the pipeline to the data and transform it
imputed_data = imputation_pipeline.fit_transform(df_missing)

# The output of the pipeline is a NumPy array.
# You can convert it back to a DataFrame if needed.
imputed_df = pd.DataFrame(imputed_data, columns=df_missing.columns)

print("DataFrame after SimpleImputer (mean strategy):")
print(imputed_df)
print("-" * 30)

# You can also inspect the fitted imputer within the pipeline
print("Values used for imputation (fitted means):")
print(imputation_pipeline.named_steps['imputer'].statistics_)

Original DataFrame with missing values:
   featureA  featureB  featureC
0       1.0      10.0     100.0
1       2.0       NaN     110.0
2       NaN       NaN     120.0
3       4.0      13.0     130.0
4       5.0      10.0     140.0
5       6.0      11.0       NaN
------------------------------
DataFrame after SimpleImputer (mean strategy):
   featureA  featureB  featureC
0       1.0      10.0     100.0
1       2.0      11.0     110.0
2       3.6      11.0     120.0
3       4.0      13.0     130.0
4       5.0      10.0     140.0
5       6.0      11.0     120.0
------------------------------
Values used for imputation (fitted means):
[  3.6  11.  120. ]
