# Task 1: Data Preprocessing Pipeline using Pandas and Scikit-learn


## Objective
Create an ETL pipeline for preprocessing a dataset using pandas and scikit-learn tools like:
- SimpleImputer
- OneHotEncoder
- StandardScaler
- ColumnTransformer
- Pipeline

### Import Required *Libraries*

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml

### Load the Dataset
We're using the Titanic dataset from OpenML for demonstration.


In [10]:
# Step 1: Load dataset

# data = pd.read_csv("filename.csv")
data = fetch_openml(name='titanic', version=1, as_frame=True)['frame']

In [11]:
# Step 2: Separate features and target

X = data.drop(columns=['survived'])
y = data['survived']

### Understand the Feature Types
We separate features into numeric and categorical to apply suitable preprocessing steps.


In [12]:
# Step 3: Identify column types

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns

### Define Numeric and Categorical Pipelines


In [13]:
# Step 4: Define transformers

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

In [14]:
# Step 5: Combine transformers

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [15]:
# Step 6: Create a complete pipeline

etl_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

In [16]:
# Step 7: Apply pipeline to training data

X_processed = etl_pipeline.fit_transform(X)

### Apply Preprocessing and Split the Data

In [17]:
# Step 8: Train-test split

X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

print("ETL pipeline completed successfully.")

ETL pipeline completed successfully.


## Conclusion
We have successfully created a reusable ETL pipeline using Scikit-learn to:
- Handle missing values
- Scale numeric features
- Encode categorical features
- Prepare clean training and testing sets
