## Automating Data Cleaning in Python

    Task: Basic Pipeline with Scaling
1. Objective: Create a pipeline that scales numerical features in a dataset.
2. Steps:
    - Load a sample dataset with Pandas.
    - Define a pipeline using Pipeline from sklearn.pipeline .
    - Use StandardScaler to scale features.

In [2]:
# Write your code from here
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load a sample dataset (Iris dataset)
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
target = iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

# Define a pipeline that scales features using StandardScaler
pipeline = Pipeline([
    ('scaler', StandardScaler())  # Apply StandardScaler to the dataset
])

# Fit the pipeline to the training data and transform the data
X_train_scaled = pipeline.fit_transform(X_train)

# Optionally, transform the test data with the same scaler
X_test_scaled = pipeline.transform(X_test)

# Output: Scaled train and test datasets
print("Scaled Training Data:\n", X_train_scaled[:5])  # Print first 5 rows of scaled training data
print("\nScaled Test Data:\n", X_test_scaled[:5])  # Print first 5 rows of scaled test data



Scaled Training Data:
 [[-1.47393679  1.20365799 -1.56253475 -1.31260282]
 [-0.13307079  2.99237573 -1.27600637 -1.04563275]
 [ 1.08589829  0.08570939  0.38585821  0.28921757]
 [-1.23014297  0.75647855 -1.2187007  -1.31260282]
 [-1.7177306   0.30929911 -1.39061772 -1.31260282]]

Scaled Test Data:
 [[ 0.35451684 -0.58505976  0.55777524  0.02224751]
 [-0.13307079  1.65083742 -1.16139502 -1.17911778]
 [ 2.30486738 -1.0322392   1.8185001   1.49058286]
 [ 0.23261993 -0.36147005  0.44316389  0.4227026 ]
 [ 1.2077952  -0.58505976  0.61508092  0.28921757]]


    Task: Pipeline with Imputation
1. Objective: Automate data cleaning by handling missing values.
2. Steps:
    - Load a dataset with missing values.
    - Define a pipeline to use SimpleImputer for filling missing values.

In [3]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Create a sample dataset with missing values
data = {
    'Age': [25, 30, 35, None, 40, None, 45],
    'Salary': [50000, 60000, 70000, 80000, None, 90000, 100000],
    'Name': ['John', 'Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank']
}
df = pd.DataFrame(data)

# Define which columns to process (numerical columns in this case)
numeric_columns = ['Age', 'Salary']

# Define a pipeline with a ColumnTransformer to handle imputation on numerical columns
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))  # Impute missing values with the mean of the column
])

# We will only apply the pipeline to numeric columns
column_transformer = ColumnTransformer(
    transformers=[
        ('num', pipeline, numeric_columns)
    ], 
    remainder='passthrough'  # Leave non-numeric columns unchanged
)

# Fit the column transformer and transform the dataset
df_imputed = column_transformer.fit_transform(df)

# Convert the result back to a DataFrame, maintaining the original column names
df_imputed_df = pd.DataFrame(df_imputed, columns=numeric_columns + [col for col in df.columns if col not in numeric_columns])

# Output the original and imputed data
print("Original Data with Missing Values:")
print(df)
print("\nImputed Data:")
print(df_imputed_df)

Original Data with Missing Values:
    Age    Salary     Name
0  25.0   50000.0     John
1  30.0   60000.0    Alice
2  35.0   70000.0      Bob
3   NaN   80000.0  Charlie
4  40.0       NaN    David
5   NaN   90000.0      Eve
6  45.0  100000.0    Frank

Imputed Data:
    Age    Salary     Name
0  25.0   50000.0     John
1  30.0   60000.0    Alice
2  35.0   70000.0      Bob
3  35.0   80000.0  Charlie
4  40.0   75000.0    David
5  35.0   90000.0      Eve
6  45.0  100000.0    Frank
