Import necessary libraries

In [None]:
from IPython.display import display, Math, Latex

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")

## Feature Transformations

### **1.Polynomial Features**

* Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.

* For example, if an input sample is two dimensional and of the form  $[a,b]$ , the degree-2 polynomial features are  $[1,a,a^2,b,b^2 ,ab]$ .

* `sklearn.preprocessing.PolynomialFeatures` enables us to perform polynomial transformation of desired degree.

Let's demonstrate it with wine quality dataset :

In [None]:
from sklearn.preprocessing import PolynomialFeatures

wine_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv',sep=';')

wine_data_copy = wine_data.copy()
wine_data = wine_data.drop(['quality'] ,axis=1)

print('Number of features before transformation = ', wine_data.shape)

Number of features before transformation =  (1599, 11)


In [None]:
poly = PolynomialFeatures(degree=2)
wine_data_poly = poly.fit_transform(wine_data)
print('Number of features after transformation = ', wine_data_poly.shape)

Number of features after transformation =  (1599, 78)


Note that after transformation, we have 78 features. Let's list out these features:

In [None]:
poly.get_feature_names_out()

array(['1', 'fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol',
       'fixed acidity^2', 'fixed acidity volatile acidity',
       'fixed acidity citric acid', 'fixed acidity residual sugar',
       'fixed acidity chlorides', 'fixed acidity free sulfur dioxide',
       'fixed acidity total sulfur dioxide', 'fixed acidity density',
       'fixed acidity pH', 'fixed acidity sulphates',
       'fixed acidity alcohol', 'volatile acidity^2',
       'volatile acidity citric acid', 'volatile acidity residual sugar',
       'volatile acidity chlorides',
       'volatile acidity free sulfur dioxide',
       'volatile acidity total sulfur dioxide',
       'volatile acidity density', 'volatile acidity pH',
       'volatile acidity sulphates', 'volatile acidity alcohol',
       'citric acid^2', 'citric acid residual sugar',
       'citric acid chlorides', 'citric aci

Observe that :
* Some features have ^2 suffix - these are degree-2 features of input features. For example, `sulphates^2` is the square of `sulphates` features.

* Some features are combination of names of the original feature names. For example, `total sulfur dioxide pH` is a combinationn of two features `total sulfur dioxide` and `pH`.

### **2.Discretization**

**Discretization** (otherwise known as **quantization or binning**) provides a way to partition continuous features into discrete values.


* Certain datasets with continuous features may benefit from discretization, because it can transform the datasets of continuous attributes to one with only nominal attributes.

* One-hot encoded discretized features can make a model more expressive, while maintaining interpretability.

* For instance, pre-processing with discretizer can introduce non-linearity to linear models.

`KBinsDiscretizer` discretizes features into `k-bins`.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

In [None]:
wine_data = wine_data_copy.copy()

#transform the dataset with KBinDiscretizer
kbd = KBinsDiscretizer(n_bins=10, encode='onehot')

X = np.array(wine_data['chlorides']).reshape(-1, 1)
X_binned = kbd.fit_transform(X)

In [None]:
X_binned

<1599x10 sparse matrix of type '<class 'numpy.float64'>'
	with 1599 stored elements in Compressed Sparse Row format>

In [None]:
X_binned.toarray()[:5]

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])

### **3.Handling Categorical Features**

We need to convert the categorical features into numeric features. It includes :
1. Ordinal encoding

2. One hot encoding

3. Label encoding

4. MultiLabel Binarizer

5. Using dummy variables

[Iris dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data) has the following features:

1. sepal length (in cm)

2. sepal width (in cm)

3. petal length (in cm)

4. petal width (in cm)

class : Iris Setosa, Iris Versicolour, Iris Virginica

In [None]:
cols = ['sepal length', 'sepal width', 'petal width', 'label']

iris_data = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None, names=cols)

iris_data.head()

Unnamed: 0,sepal length,sepal width,petal width,label
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa


**1. Ordinal Encoding**

* Categorical features are those that contain categories or groups such as education level, state etc as their data.

* These are non-numerical features and need to be converted into appropriate from before they feeding them for training an ML model.

* Our intuitive way of handling them could be to assign them a numerical value.

* As an example, take state as a feature with 'Punjab', Rajasthan, and Haryana as the possible values. We might consider assigning number to these values as follows:

    Old feature | New feature
    ------------|-------------
    Punjab      |     1
    Rajasthan   |     2
    Haryana     |     3



However, this approach assigns some ordering to the labels, i.e. states, thus representing that Haryana is thrice Punjab and Rajasthan is twice Pubjab, these relationships do not exist in the data, thus providing wrong information to the ML model.

Let's demonstrate this concept with `Iris` dataset.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal = OrdinalEncoder()

iris_labels = np.array(iris_data['label'])

iris_labels_transformed = ordinal.fit_transform(iris_labels.reshape(-1, 1))
print(np.unique(iris_labels_transformed))

print()
print('First 5 labels in ordinal encoded form are : \n',
      iris_labels_transformed[:5])


[0. 1. 2.]

First 5 labels in ordinal encoded form are : 
 [[0.]
 [0.]
 [0.]
 [0.]
 [0.]]


**2. One-hot Encoding**

* This approach consists of creating an addtional feature for each label present in categorical feature(i.e. the number of different states here) and putting a 1 or 0 for these new features depending on the categorical feature's value. That is,


Old feature  |   New feature_1 (punjab) | New feature_2 (Rajasthan) | New feature_3(Haryana)
--------------|---------------------------|---------------------------|------------------------
Punjab        |          1                |           0               |         0
Rajasthan     |          0                |           1               |         0
Haryana       |          0                |           1               |         0


* It may be implemented using `OneHotEncoder` class from sklearn.preprocessing module.


The `label` in the iris dataset is a categorical attribute.

In [None]:
iris_data.label.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

There are three class labels. Let's convert them to one hot vectors.

In [None]:
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder()

print('Shape of y before encoding : ', iris_data.label.shape)


'''
Passing 1d arrays as data to onehotcoder is deprecated in version, hence reshape to (-1,1) to have two dimensions.

Input of onehotencoder fit_transform must not be 1-rank array
'''

iris_labels = one_hot_encoder.fit_transform(iris_data.label.values.reshape(-1, 1))

# y.reshape(-1,1) is a 450 x 1 sparse matrix of type <class numpy.float64>

# y is a 150 x 3 sparse matrix of type <class numpy.float64> with 150 stored
# elements in Coordinate format.

print('Shape of y after encoding : ', iris_labels.shape)

# since output is sparse use toarray() to expand it.
print()
print('First 5 labels in one-hot vector form are : \n',iris_labels.toarray()[:5])



Shape of y before encoding :  (150,)
Shape of y after encoding :  (150, 3)

First 5 labels in one-hot vector form are : 
 [[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


**3. Label Encoding**

Another option is to use `LabelEncoder` for transforming categorical features into integer codes.

In [None]:
from sklearn.preprocessing import LabelEncoder

iris_labels = np.array(iris_data['label'])

label = LabelEncoder()
label_integer = label.fit_transform(iris_labels)

print('Labels in integer form are : \n', label_integer)

Labels in integer form are : 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


**4. MultiLabel Binarizer**

* Encodes categorical features with value 0 to $ k-1$ where $k$ is number of classes.

* As the name suggests for case where output are multilabels there we use each unique label as column and assign 0 or 1 depending upon in the dataset that value is present or not.

Movie genres is best example to understand.

In [None]:
movie_genres = [
    {'action', 'comedy'},
    {'comedy'},
    {'action', 'thriller'},
    {'science-fiction', 'action', 'thriller'}
]

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
mlb.fit_transform(movie_genres)

array([[1, 1, 0, 0],
       [0, 1, 0, 0],
       [1, 0, 0, 1],
       [1, 0, 1, 1]])

**5. Using Dummy variables**

Use `get_dummies` to create a one-hot encoding for each unique categorical value in the 'class' column

In [None]:
iris_data_onehot = pd.get_dummies(
    iris_data, columns=['label'], prefix=['one_hot'])

iris_data_onehot.head()

Unnamed: 0,sepal length,sepal width,petal width,one_hot_Iris-setosa,one_hot_Iris-versicolor,one_hot_Iris-virginica
5.1,3.5,1.4,0.2,1,0,0
4.9,3.0,1.4,0.2,1,0,0
4.7,3.2,1.3,0.2,1,0,0
4.6,3.1,1.5,0.2,1,0,0
5.0,3.6,1.4,0.2,1,0,0


### **4.Custom Transformers**

Enables conversion of an existing Python function into a transformer to assist in data cleaning or processing.

Useful when:
1. The dataset consists of *hetereogeneous data types* (e.g. raster images and text captions)

2. The dataset is stored in a `pandas.DataFrame` and different columns require *different processing pipelines.*

3. We need stateless transformations such as taking the log of frequencies, custom scaling, etc.

We can implement a transformer from an arbitary function with `Function Transformer`.

In [None]:
from sklearn.preprocessing import FunctionTransformer

For example, let us build a tranformer that applies a log transformation to features.

For this demonstration, we will be using a  [wine quality dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv) from UCI machine learning repository.

It has got the following attributes:


1. fixed acidity

2. volatile acidity

3. citric acid

4. residual sugar

5. chlorides

6. free sulfur dioxide

7. total sulfur dioxide

8. density

9. pH

10. sulphates

11. alcohol

12. quality (output: score between 0 and 10)

In [None]:
wine_data = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';')

wine_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fixed acidity,1599.0,8.319637,1.741096,4.6,7.1,7.9,9.2,15.9
volatile acidity,1599.0,0.527821,0.17906,0.12,0.39,0.52,0.64,1.58
citric acid,1599.0,0.270976,0.194801,0.0,0.09,0.26,0.42,1.0
residual sugar,1599.0,2.538806,1.409928,0.9,1.9,2.2,2.6,15.5
chlorides,1599.0,0.087467,0.047065,0.012,0.07,0.079,0.09,0.611
free sulfur dioxide,1599.0,15.874922,10.460157,1.0,7.0,14.0,21.0,72.0
total sulfur dioxide,1599.0,46.467792,32.895324,6.0,22.0,38.0,62.0,289.0
density,1599.0,0.996747,0.001887,0.99007,0.9956,0.99675,0.997835,1.00369
pH,1599.0,3.311113,0.154386,2.74,3.21,3.31,3.4,4.01
sulphates,1599.0,0.658149,0.169507,0.33,0.55,0.62,0.73,2.0


Let's use `np.log1p` which returns natural logarithm of(1 + the feature value).


In [None]:
transformer = FunctionTransformer(np.log1p, validate=True)

wine_data_transformed = transformer.transform(np.array(wine_data))
pd.DataFrame(wine_data_transformed, columns=wine_data.columns).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fixed acidity,1599.0,2.215842,0.1781,1.722767,2.091864,2.186051,2.322388,2.827314
volatile acidity,1599.0,0.417173,0.114926,0.113329,0.329304,0.41871,0.494696,0.947789
citric acid,1599.0,0.228147,0.152423,0.0,0.086178,0.231112,0.350657,0.693147
residual sugar,1599.0,1.218131,0.269969,0.641854,1.064711,1.163151,1.280934,2.80336
chlorides,1599.0,0.083038,0.038991,0.011929,0.067659,0.076035,0.086178,0.476855
free sulfur dioxide,1599.0,2.639013,0.62379,0.693147,2.079442,2.70805,3.091042,4.290459
total sulfur dioxide,1599.0,3.63475,0.682575,1.94591,3.135494,3.663562,4.143135,5.669881
density,1599.0,0.691519,0.000945,0.68817,0.690945,0.691521,0.692064,0.69499
pH,1599.0,1.460557,0.03576,1.319086,1.437463,1.460938,1.481605,1.611436
sulphates,1599.0,0.501073,0.093731,0.285179,0.438255,0.482426,0.548121,1.098612


Simple Examples :

In [None]:
transformer = FunctionTransformer(np.log1p)

X = np.array([[0, 9], [7, 8]])
transformer.transform(X)

array([[0.        , 2.30258509],
       [2.07944154, 2.19722458]])

In [None]:
transformer = FunctionTransformer(np.exp2)

X = np.array([[1,3], [2,4]])
transformer.transform(X)


array([[ 2.,  8.],
       [ 4., 16.]])

### **5.Composite Transformers**

* It applies a set of transformers to columns of an array or `pandas.DataFrame`, concatenates the transformed outputs from different transformers into a single matrix.


**5.A. Apply Transformation to diverse features**

* It is useful for transforming heterogeneous data by applying different transformers to separate subsets of features.

* It combines different feature selection mechanism and transformation into a single transformer object.

* It is a list of tuples.

* In the tuple, first we mention the reference name, second the method and third the column on which we want to apply column transformer.

In [None]:
X = [
    [20.0,'male'],
    [11.2,'female'],
    [15.6,'female'],
    [13.0,'male'],
    [18.6, 'male'],
    [16.4,'female']
]

X = np.array(X)
print(X)

[['20.0' 'male']
 ['11.2' 'female']
 ['15.6' 'female']
 ['13.0' 'male']
 ['18.6' 'male']
 ['16.4' 'female']]


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MaxAbsScaler ,OneHotEncoder

col_trans = ColumnTransformer([
    ('scaler' ,MaxAbsScaler() ,[0]),
    ('pass' ,'passthrough' ,[0]) ,
    ('encoder' ,OneHotEncoder() ,[1])
])

col_trans.fit_transform(X)

array([['1.0', '20.0', '0.0', '1.0'],
       ['0.5599999999999999', '11.2', '1.0', '0.0'],
       ['0.78', '15.6', '1.0', '0.0'],
       ['0.65', '13.0', '0.0', '1.0'],
       ['0.93', '18.6', '0.0', '1.0'],
       ['0.82', '16.4', '1.0', '0.0']], dtype='<U32')

**5.B. TransformedTargetRegressor**

Transforms the target variable `y` before fitting a regression model.

* The predicted values are mapped back to the original space via an inverse transform.

* It takes **regressor** and **transformer** as arguments to be applied to the target variable.

In [None]:
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True)

# select a subset of data
X, y = X[:2000, :], y[:2000]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
# transformer to scale the data
transformer = MinMaxScaler()

# first regressor - based on the original labels.
regressor = LinearRegression()

# second regressor - based on transformed labels.
ttr = TransformedTargetRegressor(regressor=regressor, transformer=transformer)

regressor.fit(X_train, y_train)
print('R2 score of raw_label regression: {0:.4f}'.format(
    regressor.score(X_test, y_test)))


ttr.fit(X_train, y_train)
print('R2 score of transformed label regression: {0:.4f}'.format(
    ttr.score(X_test, y_test)))

R2 score of raw_label regression: 0.5853
R2 score of transformed label regression: 0.5853
