# Data Preprocessing Techniques
* Data cleaning
    * Data Imputation
    * Feature Scaling
* Feature transformations
    * Polynomial Features
    * Discretization
    * Handling categorical variables
    * Custom Transformers
    * Composite Transformers
        * Apply composite feature to diverse features
        * TargetTransformRegressor
    * Feature Selection
        * Filter based methods
        * Wrapper based Methods
    * Feature extraction
        * PCA

These transformations are applied in a specific order and the order can be specified via Pipeline. 

## Importing basic libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_theme(style="whitegrid")

## 1.Feature Extraction

### DictVectorizer
 `DictVectorizer` converts a list of dictionary objects to feature matrix

In [None]:
data = [{'age' : 4, 'height' : 96.0},
        {'age' : 1, 'height' : 73.9},
        {'age' : 3, 'height' : 88.9},    
        {'age' : 2, 'height' : 81.6}]

In [None]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
data_transformed = dv.fit_transform(data)
data_transformed, data_transformed.shape

## 2. Data Imputation

* Many Ml algos need full feature matrix
* Data Imputation identifies missing values in each features and replaces them with a strategy such as 
    * Mean/Median/mode.
    * user specified constant value.
    
sklearn library provides `sklearn.impute.SimpleImputer` class for this purpose

`add_indicator` is a boolean parameter when set to `True` returns missing value indicators

In [None]:
#Let's get some real world data!!

cols = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg' , 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']
heart_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data', header=None, names=cols)

### **STEP 1:** Check if dataset contains missing values.

In [None]:
heart_data.info() # for numerical values. For str, check unique values etc

In [None]:
(heart_data.isnull().sum()) #Checking for null values

In [None]:
print("Unique values in ca: ", heart_data.ca.unique())
print("Unique values in thal: ", heart_data.thal.unique())

`?` missing values. Let's count the missing values.

In [None]:
print("# Missing values in ca:", heart_data.loc[heart_data.ca == '?', 'ca'].count())
print("# Missing values in thal:", heart_data.loc[heart_data.thal == '?', 'thal'].count())

### **STEP 2:** Replace '?' with `nan`

In [None]:
heart_data.replace('?', np.nan, inplace=True)

**STEP 3:** Fill the missing `sklearn` missing value imputation utilities.

Here we use `SimpleImputer` with `mean` strategy.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(heart_data)
heart_data_imputed = imputer.transform(heart_data)
print(heart_data_imputed.shape)

`add_indicator = True:` Adds additional column for each column containing missing values.  

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean', add_indicator=True)
imputer = imputer.fit(heart_data)
heart_data_imputed_with_indicator = imputer.transform(heart_data)
print(heart_data_imputed_with_indicator.shape)

## 3. Feature Scaling
Feature scaling transforms feature values such that all the features are on the same scale. 
* Enables faster convergence in iterative optimization
* Algos which uses euclidean distance b/n features also suffer
* Tree based ML algos are not affected by feature scaling - So feature scaling not required

Feature scaaling can be performed with the following methods:
* Standardization
* Normalisation
* MaxAbsScaler

Data Set:  [Abalone dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data)

In [None]:
cols = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight', 'Viscera weight', 'Shell weight', 'Rings']
abalone_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', header = None, names=cols)

### **STEP 1:** Examine the dataset
Feature scaling performed only on numerical attributes. Let's check which are numerical attributes in the dataset.

In [None]:
abalone_data.info()

### **STEP 1a:**  Convert non-numerical attributes to numerical ones

In [None]:
abalone_data.Sex.unique()

In [None]:
# Assign numerical values to sex.
abalone_data = abalone_data.replace({"Sex": {"M":1, "F":2, "I":3}})
abalone_data.info()

### **STEP 2:** Seperate features from labels


In [None]:
y = abalone_data.pop('Rings')
print("The Dataframe object after deleting label")
abalone_data.info()

## **STEP 3:** Examine feature scales

#### Statistical method

In [None]:
abalone_data.describe().T

#### Visualisation of feature distributions

* Histograms
* Kernel Density Estimation(KDE) Plot
* Box
* Violin

##### Seperate Histograms

In [None]:
fig, axs = plt.subplots(nrows=4, ncols=2)
fig.suptitle("Individual histograms")
fig.set_size_inches(18.5, 10.5)

for i in range(len(abalone_data.columns)):
    fig.add_subplot(4,2,i+1)
    fig.tight_layout()
    plt.hist(abalone_data[abalone_data.columns[i]])
    plt.title(abalone_data.columns[i])
    plt.xlabel('Range')
    plt.ylabel('Frequency')

##### Histograms together

In [None]:
#Histograms collated
fig = plt.figure()
fig.set_size_inches(10, 6)

for feature in abalone_data.columns:
    plt.hist(abalone_data[feature], alpha=0.5, label=feature)

plt.legend()
plt.title('Distribution of features across samples')
plt.xlabel('Range')
plt.ylabel('Frequency')

##### KDE Plot

In [None]:
ax = abalone_data.plot.kde()

##### Box Plot

In [None]:
box = abalone_data.plot.box(vert=False)

##### Violin Plot

In [None]:
sns.set(style = 'whitegrid')
sns.violinplot(data=abalone_data, orient="h", scale="width")

## **STEP 4:** Scaling

* Normalisation
    * `MaxAbsoluteScaler` transaforms features into range [-1,1]
        * x' = x/MaxAbsoluteValue 
        * MaxAbsoluteValue = max(x.max, |x.min|)
    * `MinMaxScaler` transforms feature in range [0,1]
        * x_new = (x_old - x_min)/(x_max - x_min)

* Standardisation
    * `StandardScaler`
    * X_new = (X_old - mu)/sigma

In [None]:
x = np.array([4, 2, 5, -2, -100]).reshape(-1, 1)
print(x)

In [None]:
from sklearn.preprocessing import MaxAbsScaler

mas = MaxAbsScaler()
x_new = mas.fit_transform(x)
print(x_new)

In [None]:
from sklearn.preprocessing import MinMaxScaler
X = abalone_data
mm = MinMaxScaler()
X_normalised = mm.fit_transform(X)
X_normalised[:5]

In [None]:
X_normalised.mean(axis=0)

In [None]:
X_normalised.std(axis=0)

#### Histogram of transformed features

In [None]:
fig = plt.figure()
fig.set_size_inches(10, 6)
cols = abalone_data.columns
df = pd.DataFrame(X_normalised, columns=cols)

for feature in df.columns:
    plt.hist(df[feature], alpha=0.5, label=feature)

plt.legend()
plt.title('Distribution of features across samples')
plt.xlabel('Range')
plt.ylabel('Frequency')

#### Box Plot of transformed features

In [None]:
box = df.plot.box(vert=False)

##### Violin Plot of tranformed features

In [None]:
sns.set(style = 'whitegrid')
sns.violinplot(data= df, orient="h", scale="width")

##### KDE Plot of transformed features

In [None]:
ax = df.plot.kde()

### Standardisation

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_standardised = ss.fit_transform(X)
X_standardised.mean(axis=0), X_standardised.std(axis=0)

#### Histogram of standardised features

In [None]:
fig = plt.figure()
fig.set_size_inches(10, 6)
cols = abalone_data.columns
df = pd.DataFrame(X_standardised, columns=cols)

for feature in df.columns:
    plt.hist(df[feature], alpha=0.5, label=feature)

plt.legend()
plt.title('Distribution of features across samples')
plt.xlabel('Range')
plt.ylabel('Frequency')

#### Box Plot of standardised features

In [None]:
box = df.plot.box(vert=False)

##### Violin Plot of standardised features

In [None]:
sns.set(style = 'whitegrid')
sns.violinplot(data= df, orient="h", scale="width")

##### KDE Plot of standardised features

In [None]:
ax = df.plot.kde()

## 4. `add_dummy_feature`
Augments dataset with a column vector of ones

In [None]:
x = np.array(
        [[7,1],
        [1, 8],
        [2, 0],
        [9, 6]])

from sklearn.preprocessing import add_dummy_feature

x_new = add_dummy_feature(x)
print(x_new)

## 5. Custom transformers

Enables conversion of an existing Python function into a transformer to assist in data cleaning or preprocessing

Useful when:
1. Dataset consists of heterogenous datatypes
2. When different columns require different transformations
3. We need stateless transformations such as taking the log of frequencies, custom scaling etc 


Dataset: [Wine Quality dataset from UCI ML repository](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv)

In [None]:
wine_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";") 
wine_data.describe().T

Let's use `np.log1p` which returns natural logarithm of (1 + the feature Value)

In [None]:
from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(np.log1p, validate=True)
wine_data_transformed = transformer.transform(np.array(wine_data))
pd.DataFrame(wine_data_transformed, columns=wine_data.columns).describe().T

## 6. Polynomial Features

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree

* `sklearn.preprocessing.PolynomialFeatures`

In [None]:
from sklearn.preprocessing import PolynomialFeatures
wine_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";") 
wine_data_copy = wine_data.copy()
wine_data = wine_data.drop(['quality'], axis=1)
print ('Number of features before transformation = ', wine_data.shape)

#Let's apply a polynomial transform of the order 2 to wine_data
poly = PolynomialFeatures(degree=2)
poly_wine_data = poly.fit_transform(wine_data)
print("Number of features after transformation= ", poly_wine_data.shape)

In [None]:
poly.get_feature_names_out()

## 7. Discretization

Discretization/quantization/binning provides a way to partition continuos features into discrete values.

* Certain dataasets with continuous features may benefit from discretisation, because it transforms continous attibutes to nominal attributes.
* One-hot encoded discretised freatues can make a model more expressive, while maintaing interpretability.
* For instance, pre-processing witha discretizer can introduce non-linearity to linear models.

In [None]:
#KBinsDiscretizer discretizes features into k bins
from sklearn.preprocessing import KBinsDiscretizer

wine_data = wine_data_copy.copy()

#Traansform the dataset with KBinsDiscretizer
enc = KBinsDiscretizer(n_bins=10, encode="onehot")
X = np.array(wine_data['chlorides']). reshape(-1,1)
X_binned = enc.fit_transform(X)
X_binned

In [None]:
X_binned.toarray()[:5]

## 8. Handling Categorical Features

The following methods can be used to convert the categorical features into numeric features

1. Ordinal encoding
2. One-hot encoding
3. Label encoding
4. Using dummy variable


### Ordinal encoding
Assigns unique numerical value to each unique non-numerical feature. But it would introduce numerical relationship which may or may not exist between the non-numerical entities
* Implemented using `OrdinalEncoder` class from `sklearn.preprocessing` module

### One-hot encoding
This approach consists of creating an additional feature for each label present in the categorical feature and putting 1 or 0 for these new features depending on the categorical feature's value.
* Implemented using `OneHotEncoder` class from `sklearn.preprocessing` module

Dataset: [Iris]()

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
cols = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
iris_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", header=None, names=cols)
iris_data.head()

The `label` is a categorical attribite

In [None]:
iris_data.label.unique()

Let's convert them to One-hot vectors

In [None]:
onehotencoder = OneHotEncoder(categories='auto')
print(" Shape of y before encoding", iris_data.label.shape)

'''
Passing 1d arraysas data to onehotencoder is deprecated.
Hence reshape to (-1,1) to have two dimensions
Input of OneHotEncoder fit_transform must not be one dimensional array
'''
iris_labels = onehotencoder.fit_transform(iris_data.label.values.reshape(-1, 1))

print('Shape of y after encoding = ',iris_labels.shape)

print("First five labels:")
print(iris_labels.toarray()[:5])

Let us observe the difference between one hot encoding and ordinal encoding

In [None]:
enc = OrdinalEncoder()
iris_labels = np.array(iris_data['label'])

iris_labels_transformed = enc.fit_transform(iris_labels.reshape(-1, 1))
print("Unique labels: ", np.unique(iris_labels_transformed))

print("\nFirst 5 labels:")
print(iris_labels_transformed[:5])

### LabelEncoder

Another option is to use `LabelEncoder` for transforming categorical features into integer codes

In [None]:
from sklearn.preprocessing import LabelEncoder

iris_labels = np.array(iris_data['label'])

enc = LabelEncoder()
label_integer = enc.fit_transform(iris_labels)
label_integer

### MultilabelBinarizer

Encodes categorical featurees with value between $0$ and $k-1$, where $k$ is number of classes

In [None]:
movie_genres = [{'action', 'comedy'},
                {'comedy'},
                {'action', 'thriller'},
                {'science-fiction', 'action', 'thriller'} ]
                

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit_transform(movie_genres)

### Using dummy variables

In [None]:
# use get_dummies to create a one_hot encoding for each unique categorical value in the 'class column
# convert categorical classs variable to one hot encoding:
iris_data_onehot = pd.get_dummies(iris_data, columns=['label'], prefix=['one_hot'])
iris_data_onehot

## 9. Composite Transformers

### ColumnTransformer

It applies a set of transdformers to columns of an array or `pandas.DataFrame`, concactenates the transformed outputs from different transformers into a single matrix.
* It is useful for transforming heterogenous data by applying different transformers to seperate subset of features
* It combines different feature selcetion mechanisms and transformations into a single transformer object.

In [None]:
x = [
    [20.0, 'male',],
    [11.2, 'female',],
    [15.6, 'female',],
    [13.0, 'male',],
    [18.6, 'male',],
    [16.4, 'female',],
]

x = np.array(x)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MaxAbsScaler, OneHotEncoder

ct = ColumnTransformer([('scaler', MaxAbsScaler(), [0]),
                        ('pass', 'passthrough', [0]),
                        ('encoder', OneHotEncoder(), [1])])
ct.fit_transform(x)

### TransformedTargetRegressor

Transforms the target variable y before fitting a regression model.
* The predicted values are mapped back to the original space via an inverse transform.
* It takes regressor and transformer to be applied to the taret as arguments.

In [None]:
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import MaxAbsScaler
from sklearn.linear_model import LinearRegression
from sklearn. model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True)
X, y = X[:2000,:], y[:2000] #select a subset of data

transformer = MaxAbsScaler()

regressor = LinearRegression()

regr = TransformedTargetRegressor(regressor=regressor, transformer=transformer)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
regr.fit(X_train, y_train)

print("R2 score of transformed label regression: {0:.2f}".format(regr.score(X_test, y_test)))

raw_target_regr = LinearRegression().fit(X_train, y_train)
print("R2 score of raw label regression: {0:.2f}".format(raw_target_regr.score(X_test, y_test)))

## 10. Feature Selection

`sklearn.feature_selction` 

### Filter based methods

#### VarianceThreshold

This transformer helps to keep only high variance features by providing a certain threshold.


In [None]:
data = [{'age': 4, 'height': 96.0},
        {'age': 1, 'height': 73.9},
        {'age': 3, 'height': 88.9},
        {'age': 2, 'height': 81.6}]

dv = DictVectorizer(sparse=False)
data_transformed = dv.fit_transform(data)
np.var(data_transformed, axis = 0)

In [None]:
from sklearn.feature_selection import VarianceThreshold

vt = VarianceThreshold(threshold=9)
data_new = vt.fit_transform(data_transformed)
data_new

#### SelectKBest
It selects $k$ highest scoring features based on a function and removes the rest of the features.

Dataset: [California Housing]()

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_regression

#Download data
X_cal, y_cal = fetch_california_housing(return_X_y=True)

#Select a subset of data

X, y = X_cal[:2000,:], y_cal[:2000]

print(f'Shape of feature matrix before feature selection:{X.shape}')

Let's select 3 most important features, we can use only `mutual_info_regression` or `f_regression` functions only

In [None]:
skb = SelectKBest(mutual_info_regression, k = 3)
X_new = skb.fit_transform(X, y)

print(f'Shape of feature matrix after feature selection:{X_new.shape}')

In [None]:
skb.get_feature_names_out()

#### SelectPercentile

This is very similar to `SelectKBest`, the only difference being that it select upto the top `percentile`of all features abd drops the rezt of the features. Uses a a Scoring function like `SelectKBest` as well. 

In [None]:
from sklearn. feature_selection import SelectPercentile
sp = SelectPercentile(mutual_info_regression, percentile = 30)
X_new = sp.fit_transform(X, y)
print(f'Shape of feature matrix after feature selection:{X_new.shape}')

In [None]:
skb.get_feature_names_out()

#### GenericUnivariateSelect

It applies univariate feature selection with a strategy, which is passed to the API via `mode` parameter. `mode` can take the following values: `percentile`, `k_best`, `fpr`(false positive ratio), `fdr`(false discovery ratio), `fwe`(family wise error rate). To obtain the same result as `SelectKBest`,

In [None]:
from sklearn.feature_selection import GenericUnivariateSelect
gus = GenericUnivariateSelect(mutual_info_regression, mode = 'k_best', param = 3)
X_new = gus.fit_transform(X, y)
print(f'Shape of feature matrix before feature selection:{X.shape}')
print(f'Shape of feature matrix after feature selection:{X_new.shape}')

### Wrapper based Methods

#### RFE(Recursive Feature Elimination)
**STEP 1** : Fits a model.

**STEP 2** : Ranks the features, afterwards it removes one or more features dependent on `step` parameter.

**STEP 3** : Repeat till we reach the desired number of features

In [None]:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

estimator = LinearRegression()
selector = RFE(estimator, n_features_to_select=3, step=1)
selector = selector.fit(X, y)

#support_ attribute is a boolean array
#marking which features are selected
print(selector.support_)

#rank of each feature
#if it's value is '1', Then it is selected
#features with rank 2 and onwards are ranked least.
print(f'Rank of each feature is: {selector.ranking_}')


In [None]:
X_new = selector.transform(X)
print(f'Shape of feature matrix after feature selection:{X_new.shape}')

#### SelectFromModel

Select desired number of important features above certain threshold of feature importance as obtained from the trained estimator.

* The feature importance is obtained via `coef_`, `feature_importances` or an `importance_getter` callable from the trained estimator
* The feature importance threshold can be specified either numerically or through string argument based on built in heuristic such as `mean`, `median` and float multiples of these like `0.1*mean`

In [None]:
from sklearn.feature_selection import SelectFromModel

estimator = LinearRegression()
estimator.fit(X, y)

print(f'Coefficients of features: {estimator.coef_}')
print(f'Indices of top {3} features: {np.argsort(estimator.coef_)[-3:]}')

t = np.argsort(np.abs(estimator.coef_))[-3:]
model = SelectFromModel(estimator, max_features=3, prefit=True)
X_new = model.transform(X)
print(f'Shape of feature matrix after feature selection:{X_new.shape}')


#### SequentialFeatureSelector

It performs feature selection by selecting or deselecting features one by one in a greedy manner

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector

In [None]:
%%time

estimator = LinearRegression()

sfs = SequentialFeatureSelector(estimator, n_features_to_select=3)
sfs.fit_transform(X, y)
print(sfs.get_support())

The features corresponding to `True` in the output `sfs.get_support()` are seleccted. 

In [None]:
%%time

estimator = LinearRegression()
sfs = SequentialFeatureSelector(estimator, n_features_to_select=3, direction='backward')
sfs.fit_transform(X, y)
print(sfs.get_support())

In [None]:
%%time

estimator = LinearRegression()
sfs = SequentialFeatureSelector(estimator, n_features_to_select=3, direction='forward')
sfs.fit_transform(X, y)
print(sfs.get_support())

## 11. PCA

In [None]:
from sklearn.decomposition import PCA
pca  = PCA(n_components=2)
pca.fit(X)

## 12. Chaining Transformers

The preprocessing transformations are applied one after the another on the input feature matrix. It is important to apply exactly the same transformation on training, evaluation and testing sets in the same order. 

The `sklearn.pipeline` module provides utilities to build a compsite estimator, as a chain of transformers and estimators.

### Pipeline

Sequencially apply a list if transformers and estimators.
* Intermediate steps of the pipeline must be 'transformers'. i.e. They must implement `fit` and `transform` methods.
* Final estimator only needs to implement `fit`

The purpose of the pipeline is to assemble several steps that can be cross-validated together whilesetting different parameters

### Pipeline
A Pipeline can be created with Pipeline(). It takes a list of `('estimatorName', estimator(..))` tuples. The pipeline object exposes the interrface of the last step.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

estimators = [
    ('simpleImputer', SimpleImputer()),
    ('standardscaler', StandardScaler())
]

pipe = Pipeline(steps=estimators)

The same pipeline can als be created via `make_pipeline()` helper function, which doesn't take names of the steps and assigns them generic names based on their steps.

In [None]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(SimpleImputer(), 
                     StandardScaler())

Accessing Individual steps in a pipeline

In [None]:
from sklearn.decomposition import PCA

estimators = [
            ('simpleImputer', SimpleImputer()),
            ('pca', PCA()),
            ('regressor', LinearRegression())
]

pipe = Pipeline(steps=estimators)

In [None]:
# Let's print number of steps in this pipeline
print(len(pipe.steps))

In [None]:
#Let's look at each step
print(pipe.steps)

In [None]:
#Accessing parameters of a step

pipe.set_params(pca__n_components = 2)

### GridSearch with pipeline

By using naming convention of nested parameters, grid search can be implemented.

In [None]:
from sklearn.impute import KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

param_grid = dict(imputer = ['passthrough',
                             SimpleImputer(),
                             KNNImputer()],
                    clf = [SVC(), LogisticRegression()],
                    clf__C = [0.1, 10, 100])

grid_search = GridSearchCV(pipe, param_grid=param_grid)

* `c` is an inverse of regularization, lower its value stronger the regularisation is.
* In the example above `clf__C provides a set of values for grid search

### Caching Transformers

Transforming data is a computationally expensive step.

* For grid search, transformers need not be applied for every parameter configuration. they can be applied once, and the transformed data can be reused.

This can be achieved by setting `memory` parameter of a `pipeline` object.

In [None]:
import tempfile
tempDirPath = tempfile.TemporaryDirectory()

In [None]:
estimators = [ 
                ('simpleImputer', SimpleImputer()),
                ('pca', PCA(2)),
                ('regressor', LinearRegression())
]

pipe = Pipeline(steps=estimators, memory=tempDirPath)

### FeatureUnion

Concatenates results of multiple traansformer objects.

* Applies a list of transformer objects in parallel, and their outputs are concatenated side by side into a largematrix.

`FeatureUnion` and `Pipeline` can be used to create complex transformers.

## 13. Visualising Pipelines

In [None]:
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([('selector', ColumnTransformer([('select_first_4',
                                                            'passthrough',
                                                            slice(0,4))])),
                            ('imputer', SimpleImputer(strategy='median')),
                            ('std_scaler', StandardScaler()),
                            ])

cat_pipeline = ColumnTransformer([('label_binarizer', LabelBinarizer(), [4]),
                                    ])

full_pipeline = FeatureUnion(transformer_list=
                            [("num_pipeline", num_pipeline),
                                ("cat_pipeline", cat_pipeline),
                            ]) 


In [None]:
from sklearn import set_config
set_config(display='diagram')

#displays HTML representation in a jupyter context
full_pipeline

## 14. Handling impalanced data

There are two main approached to handle imbalanced data:
* Undersampling
* Oversampling

Dataset: [Wine Quality dataset from UCI ML repository](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv)

In [None]:
wine_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")

wine_data['quality'].hist(bins=50)
plt.xlabel('Quality')
plt.ylabel('Number of samples')
plt.show()

In [None]:
wine_data.shape

### Undersampling

In [None]:
from imblearn.under_sampling import RandomUnderSampler

In [None]:
# class count
class_count_3, class_count_4, class_count_5, class_count_6, class_count_7, class_count_8 = wine_data['quality'].value_counts()

#seperate class

class_3 = wine_data[wine_data['quality'] == 3]
class_4 = wine_data[wine_data['quality'] == 4]
class_5 = wine_data[wine_data['quality'] == 5]
class_6 = wine_data[wine_data['quality'] == 6]
class_7 = wine_data[wine_data['quality'] == 7]
class_8 = wine_data[wine_data['quality'] == 8]

# Print the shape of the class
print('class 3:', class_3.shape)
print('class 4:', class_4.shape)
print('class 5:', class_5.shape)
print('class 6:', class_6.shape)
print('class 7:', class_7.shape)
print('class 8:', class_8.shape)

In [None]:
from collections import Counter

X = wine_data.drop(['quality'], axis=1)
y = wine_data['quality']

undersample = RandomUnderSampler(random_state = 0)
X_rus, y_rus = undersample.fit_resample(X, y)

print('Original dataset shape: ', Counter(y))
print('Original dataset shape: ', Counter(y_rus))

### Oversampling

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
X_ros, y_ros = ros.fit_resample(X, y)

print('Original dataset shape: ', Counter(y))
print('Resample data shape: ', Counter(y_ros))

In [None]:
print(X_ros.shape[0] - X.shape[0], 'New random points generated with RandomOverSampler')

### Oversampling using SMOTE

SMOTE(Synthetic Minor Oversampling Technique) is a popular technique for oversampling.

In [None]:
from imblearn.over_sampling import SMOTE

oversample = SMOTE()
X_sm, y_sm = oversample.fit_resample(X, y)
counter = Counter(y_sm)
counter

**Types of SMOTE**

* Borderline SMOTE
* Borderline-SMOTE SVM
* Adaptive Synthetic Sampling(ADASYN)