# Esercise 1

* Load **sample_dataset.csv**
* Consider only the first 3 columns and remove all the records that have missings
* Calculate the mean value of "mean radius"
* Use a ColumnTransformer to apply the following transformations:
    * Binarize "mean radius" using a threshold equal to the mean value
    * Binning of "mean texture" with 10 uniform bins and one-hot encoded dense output
    * Binning of "mean perimeter" with 5 quantile bins and ordinal encoding output

In [1]:
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import KBinsDiscretizer, Binarizer

In [2]:
df = pd.read_csv("../sample_dataset.csv").iloc[:,0:3].dropna()

In [3]:
df

Unnamed: 0,mean radius,mean texture,mean perimeter
1,20.57,17.77,132.90
2,19.69,21.25,130.00
3,11.42,20.38,77.58
5,12.45,15.70,82.57
7,13.71,20.83,90.20
...,...,...,...
562,15.22,30.62,103.40
563,20.92,25.09,143.00
564,21.56,22.39,142.00
566,16.60,28.08,108.30


In [4]:
m = df['mean radius'].mean()

In [5]:
m

np.float64(14.107932467532468)

In [6]:
transformer = ColumnTransformer([
    ('mean_radius_transformer',Binarizer(threshold=m), ['mean radius']),
    ('mean_texture_transformer', KBinsDiscretizer(strategy="uniform", n_bins = 10, encode = 'onehot-dense'), ['mean texture']),
    ('mean_perimeter_transformer', KBinsDiscretizer(strategy = 'quantile', n_bins = 5, encode = 'ordinal'),['mean perimeter'])
])

In [7]:
transformer.fit_transform(df)

array([[1., 0., 0., ..., 0., 0., 4.],
       [1., 0., 0., ..., 0., 0., 4.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [1., 0., 0., ..., 0., 0., 4.],
       [1., 0., 0., ..., 0., 0., 3.],
       [1., 0., 0., ..., 0., 0., 4.]], shape=(385, 12))

# Esercise 2

* Load **sample_dataset.csv**
* Consider only the first 3 columns and remove all the records that have missings
* Use a ColumnTransformer to apply the following transformations:
    * Yeo-Johnson power transform of "mean radius"
    * Box-Cox transformation of "mean texture"
    * Calculate the logarithm of "mean perimeter"

In [8]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer, FunctionTransformer

In [9]:
df = pd.read_csv("../sample_dataset.csv").iloc[:,0:3].dropna()

In [10]:
transformer = ColumnTransformer([
    ('mean_radius_transformer',PowerTransformer(), ['mean radius']),
    ('mean_texture_transformer', PowerTransformer('box-cox'), ['mean texture']),
    ('mean_perimeter_transformer', FunctionTransformer(np.log), ['mean perimeter'])
])

In [11]:
transformer.fit_transform(df)

array([[ 1.63387527, -0.25875218,  4.88959697],
       [ 1.48019619,  0.54975194,  4.86753445],
       [-0.76778428,  0.36164666,  4.35130966],
       ...,
       [ 1.79501356,  0.78415652,  4.95582706],
       [ 0.84387904,  1.79038802,  4.68490515],
       [ 1.63893431,  1.98216691,  4.94235645]], shape=(385, 3))