# Exercise 1

* Load **sample_dataset.csv** and select only the features: mean radius, area error, mean perimeter
* Apply the following transformations using ColumnTransformer and Pipeline:
    * Numerical features:
        * Cleaning using the mean value
        * Transformation using the Yeo-Johnson transformation
    * Categorical features:
        * Cleaning using the most probable value
        * One-hot encoding with dense output

In [None]:
import pandas as pd

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PowerTransformer, OneHotEncoder


In [None]:
df = pd.read_csv("../sample_dataset.csv").loc[:,['mean radius', 'area error', 'mean perimeter']]

In [None]:
numerical_pipeline = Pipeline([
    ('cleaner', SimpleImputer(strategy = 'mean')),
    ('power', PowerTransformer())
])

categorical_pipeline = Pipeline([
    ('cleaner', SimpleImputer(strategy = 'most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=False))
])

transformer = ColumnTransformer([
    ('numerical', numerical_pipeline, make_column_selector(dtype_exclude="object")),
    ('categorical', categorical_pipeline, make_column_selector(dtype_include="object"))
])

In [None]:
transformer.fit_transform(df)

# Exercise 2

* Modify the transformations of the previous exercise according to these settings and using set_params:
    * Numerical features: change the cleaning value to the median value
    * Categorical features: change the cleaning value to 'N' constant value

In [None]:
transformer.set_params(numerical__cleaner__strategy = 'median',
                      categorical__cleaner__strategy = 'constant',
                       categorical__cleaner__fill_value = 'N'
                      )

In [None]:
transformer.fit_transform(df)