#**WEEK-2 NOTES**


##**PART 1 Feature Extraction**

In [None]:
from sklearn.feature_extraction import DictVectorizer, FeatureHasher

  * **DictVectorizer**

    Converts lists of mappings of feature name and feature value into a matrix

  * **FeatureHasher**

    uses feature hashing technique

    instead of building a hash table of the features (as the vectorizers do it applies a HASH FUNCTION to the features to deatermine their column index in sample matrices directly.

    this results in increased speed and reduced memory usage, at the expense of inspectability.

    the hasher does not remember what the input features looked like and has no inverse_transform method.

    output of this transformer is scipy.sparse matrix.

  * **Feature Extraction of Non-Numerical Values**

    from Images

    `sklearn.feature_extraction.image.*`

    from Text

    `sklearn.feature_extraction.text.*`


##**PART 2 Data Cleaning**

### **1. Handling Missing Values**

  * Handling Missing Values

    `sklearn.impute` API provides functionality to fill missing values in a dataset

    `MissingIndicator` class provides indicators for missing values

   1.  SimpleImputer
      * Fills missing values with one of the following strategies :
    `'mean'`, `'median'` , `'most_frequent`' and `'constant'`

    2.  KNNImputer
      * Uses k-nearest neighbours approach to fill missing values in a dataset.
    The missing value of an attribute in a specicfic example is filled with the mean vlue of the same attribute of `n_neighbors` closest neighbors.
      * The nearest neighbors are decided based on Euclidean distance.

In [None]:
from sklearn.impute import SimpleImputer, KNNImputer

X = [[]] #original feature matrix

In [None]:
#Simple Imputer

si = SimpleImputer(strategy = 'mean')
si.fit_transform(X) # tranformed feature matrix

In [None]:
#KNN Imputer

knni = KNNImputer(n_neighbors = 2, weights = 'uniform')
knni.fit_transform(X) # tranformed feature matrix

###**2. Numerical Transformers**

#### i . Standard Scalar



  * Transforms the original features vector x into a new feature vector x' using the following formula :

  𝐗' = $\frac{𝚇 - μ}{σ}$


image.png

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
#activity question

import numpy as np

a = np.array([0,3,6])
x = a.reshape(-1,1)

In [None]:
# Standard Scaler

ss = StandardScaler()
ss.fit_transform(x) #fit transform learns the parameters mu and sigma from the original feature matrix

####ii. MinMaxScaler


  * It transforms the original feature vector x into new feature vector x' so that all values fall within the range [0,1] using the following formula:

  𝐗' = $\frac{𝐗 - 𝐗.min}{𝐗.max - 𝐗.min}$

  where x.max and x.min are largest and smallest values of that feature respectively of the original feature vector x

  * the largest number is transformed to 1 and the smallest number is transformed to 0

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
#MinMaxScaler

mms = MinMaxScaler()
mms.fit_transform(x)

#### iii. MaxAbsScaler




  * It transforms the original features vector x into new feature vector x' so that all values fall within range [-1,1]

 𝐗' = $\frac{𝐗}{MaxAbsoluteValue}$

  where `MaxAbsoluteValue = max( x.max, |x.min| )`

In [None]:
from sklearn.preprocessing import MaxAbsScaler

In [None]:
#MaxAbsScaler

mas = MaxAbsScaler()
mas.fit_transform(x)

####iv. FunctionTransformer

  * Constructs tranformed features by applying a USER DEFINED function

In [None]:
from sklearn.preprocessing import FunctionTransformer

In [None]:
#FunctionTransformer
import numpy

ft = FunctionTransformer(numpy.log2)
ft.fit_transform(x)

####v. Polynomial Transformation

  * Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
#Polynomial Transformation

pf = PolynomialFeatures(degree = 2)
pf.fit_transform(x)

#### vi. KBinsDiscretizer

  * Divides a continuous variable into bins
  * One hot encoding or Ordinal encoding is further applied to the bin labels

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

In [None]:
#KBinsDiscretizer

kbd = KBinsDiscretizer(n_bins = 5, strategy = 'uniform', encode = 'ordinal')
kbd.fit_transform(x)

###**3. Categorical transformers**
for categorical feature encoding and for label encoding

#### i. OneHotEncoder

  * Encodes categorical feature or label as a one-hot numeric array
  * Creates one binary column for each of K unique values
  * Exactly one column has 1 in it and rest have 0

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
#OneHotEncoder

ohe = OneHotEncoder()
ohe.fit_transform(x)

Example :


In [None]:
import numpy as np

l = np.array([1,2,3,1])
a = l.reshape(-1,1)

ohe = OneHotEncoder()
new = ohe.fit_transform(a)

print(new)

  (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 0)	1.0


#### ii. LabelEncoder

  * Encodes target labels with value between 0 and K-1, where K is number of distinct values

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
#LabelEncoder

le = LabelEncoder()
le.fit_transform(x)

Example :

In [None]:
l = np.array([1,2,6,1,8,6])

le = LabelEncoder()
transformed_vector = le.fit_transform(l)
print(transformed_vector)

[0 1 2 0 3 2]


#### iii. OrdinalEncoder

* Encodes categorical features with value between 0 and K-1 (where K is the no. of distinct values)
* **OrdinalEncoder** can operate multi-dimensional data, while **LabelEncoder** can only transform 1-D data

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
c = np.array([[1, 'male'], [2, 'female'], [6, 'female'], [1, 'male'], [8, 'male'], [6, 'female']])

oe = OrdinalEncoder()
transformed_c = oe.fit_transform(c)

print(transformed_c)

[[0. 1.]
 [1. 0.]
 [2. 0.]
 [0. 1.]
 [3. 1.]
 [2. 0.]]


#### iv. LabelBinarizer

* Several regression and binary classification can be extended to multi-class setup in **one-vs-all** fashion.
*This involves training a single regressor or classifier per class.
*For this, we need to convert multi-class labels to binary labels, and **LabelBinarizer** performs this task
*If estimator supports multiclass data, LabelBinarizer is not needed.



In [None]:
from sklearn.preprocessing import LabelBinarizer

In [None]:
lb = LabelBinarizer()
lb.fit_transform(y)

####v. MultiLabelBinarizer

* Encodes categorical features with value between 0 and K-1, where K is number of classes.
* In this example K=4, since there are only 4 genres of movies

`movie_genres = [{'action', 'comedy'}, {'comedy'}, {'action', 'thriller'}, {'science-fiction', 'action', 'thriller'}]`

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

In [None]:
mlb = MultiLabelBinarizer()
mlb.fit_transform(movie_genres)

####vi. Add Dummy Feature

`add_dummy_feature` augments dataset with a column vector, each value in the column vector is 1

In [None]:
from sklearn.preprocessing import add_dummy_feature

In [None]:
add_dummy_feature(X)


##**PART 3 Feature Selection**

* Sometimes in a real world dataset, all features do not contribute well enough towards fitting a model.
* The features that do not contribute significantly, can be removed. It leads to *decrease in size of the dataset* and ence the *computation cost* of a fitting model.

`sklearn.feature_selection` provides many APIs to accomplish this task.

The following are the classes present under the feature selection API :

FILTER-BASED
  * VarianceThreshold
  * SelectKBest
  * SelectPercentile
  * GenericUnivariateSelect

WRAPPER-BASED
  * RFE
  * RFECV
  * SelectFromModel
  * SequentialFeatureSelector

####i. Filter-Based Feature Selection

1. Variance Threshold

    * removes all features with variance below a certain threshold, as specified by the user, from input feature matrix.

    * by default removes a feature which has same value, i.e. zero variance

2. Univariate Feature Selection

    It selects features based on univariate statistical tests. There are 3 APIs for univariate feature selection:

    * **SelectKBest** - removes all but the k highest scoring features
    * **SelectPercentile** - removes all but a user-specified highest scoring percentage of features
    * **GenericUnivariateSelect** - performs univariate feature selection with a configurable strategy, which can be found via hyper-parameter search

  sklearn provides one more class of univariate feature selection methods that work on common univariate statistical tests for each feature:

    * `SelectFpr` selects features based on a false positive rate test
    * `SelectFdr` selects features based on an estimated false discovery rate
    * `SelectFwe` selects features based on family-wise error rate

#####**Univariate Scoring Function**



  * Each API need a scoring function to score each feature.
  * Three classes of scoring functions are proposed:
      * Mutual Information (MI)
      * Chi-square
      * F-statistics
  * MI and F-statistics can be used in both `classification` and `regression` problems.
      * `mutual_info_regression`
      * `mutual_info_classif`
      * `f_regression`
      * `f_classif`
  * Chi-square can be used only in classification problems
      * `chi2`


**NOTE** : *Do not use regression feature scoring function with a classification problem. It will lead to useless results.*

1. Mutual Information (MI)

  * measures dependency between two variables
  * it returns a non-negative value.
      * `MI = 0` for `independent` variables
      * `Higher MI` indicates `higher dependency`

2. Chi-square

  * Measures dependence between two variables
  * Computes chi-square stats between `non-negative feature` (boolean or frequencies) and `class label`.
  * Higher chi-square values indicates that the features and labels are likely to be correlated. *(such features that are correlated with the labels are highly useful for classification problems)*

In [None]:
from sklearn.feature_selection import SelectKBest, SelectPercentile, GenericUnivariateSelect, chi2

In [None]:
#SelectKBest

skb = SelectKBest(chi2, k=20) #selects 20 best features based on chi-square scoring function
X_new = skb.fit_transform(X,y)


In [None]:
#SelectPercentile

sp = SelectPercentile(chi2, percentile = 20) #selects top 20 percentile best features based on chi-square scoring function
X_new = sp.fit_transform(X,y)

In [None]:
#GenericUnivariateSelect :
    # selects set of features based on a feature selection mode and a scoring function
    # the mode could be : mode = ['pencentile' --> default , 'k_best', 'fpr', 'fdr', 'fwe']
    # the param argument takes value corresponding to the mode

transformer = GenericUnivariateSelect(chi2, mode = 'k_best', param = 20) #selects 20 best features based on chi-square scoring function
X_new = transformer.fit_transform(X, y)

#### ii. Wrapper based filter selection

Unlike filter based methods, wrapper based methods use `estimator class` rather than a `scoring function`

1. Recursive Feature Elimination (RFE)

  * Uses an estimator to recursively remove features.
    * Intitally fits an estimator on all features
  * Obtains feature importance from the estimator and removes the least important feature
  * Repeats the process by removing features one by one, until desired number of features are obtained.

  **NOTE :**

  Use RFECV if we don not want to specify the desired number of features in RFE.

  It performs RFE in a cross-validation loop to find the optimal number of features.

2. SelectFromModel

  * It selects the desired number of important features (as specified with `max_features` parameter) above **certain threshold of feature importance** as obtained **from the trained estimator**
  * The feature importance is obtained via `coef_` , `feature_importances_` or an `importance_getter` callable from the trained estimator.
  * The feature importance threshold can be specified either numerically or through string argument based on built-in heuristics such as `mean`, `median` and float multiples of these like `o.1*mean`

In [None]:
from sklearn.feature_selection import SelectFromModel

In [None]:
clf = LinearSVC(C=0.01, penalty='11', dual=False)
clf = clf.fit(X,y)
clf.coef__

model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)

  * Here we use a linear support vector classifier to get coefficients of features for `SelectFromModel` transformer.
  * It ends up selecting features with non-zero weights or coefficients.

3. Sequential Feature Selection

  * Performs feature selection by selecting or deselecting features one by one in a greedy manner.
  * Uses one of the two approaches:
    * Forward Selection:

      * It starts with zero feature and then go on adding a feature one by one until the desired number of features are obtained.

      * Starting with a zero feature, it finds one feature that obtains the best CV score for an estimator when trained on that feature

      * Repeats the process by adding a new feature to the set of selected features.

    * Backward Selection:

      * It starts with all features and then go on deselecting or reducing features one by one until the desired number of features are obtained.

      * Starting with all features and removes least important features one by one following the idea of forward selection.

      Stop when reach the desired number of features.

  * The `direction` parameter controls whether forward or backward SFS is used.
  * In general, forward and backward selection do not yield equivalent results.
  * Select the direction that is efficient for the required number of selected features:
    * When we want to select 7 out of 10 features,
      * Forward selection would need to perform 7 iterations.
      * Backward selection would only need to perform 3
    * Backward selection seems to be a reasonable choice here.

  * SFS does not require the underlying model to expose a `coef_` or `feature_importances_` attributes unlike in `RFE` and `SelectFromModel`.
  * SFS may be slower than `RFE` and `SelectFromModel` as it needs to evaluate more models compared to the other two approaches.

    For example in backward selection, the iteration going from 𝖒 features to
    m-1 features using 𝐤-fold cross validation requires fitting 𝖒 x 𝚔 models, while:

      * `RFE` would require only a single fit
      * `SelectFromModel` performs a single fit and requires no iterations


####iii. Heterogeneous Feature Transformations

* Generally training datat contains diverse features such as numeric and catgorical.
* Different feature types are processed with different transformers
* Need a way to combine different feature transformers seamlessly

#####a. Composite Transformer

`sklearn.compose` has useful classes and methods to apply transformation on subset of features and combine them:

  * Column Transformer

    * It applies a set of transformers to columns of an array or `pandas.DataFrame`, concatenates the transformed outputs from different transformers into a single matrix.
    * It is useful for transforming heterogenous data by applying different transformers to separate subsets of features.
    * It combines different feature selection mechanisms and transformation into a single transformer object.
    * `ColumnTransformer()`
    * Each tuple has a format: `(estimatorName, estimator(..), columnIndices`

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
# ColumnTransformer Example
  # In this example, lets apply MaxAbsScaler on the numeric column and OneHotEncoder on categorical column

from sklearn.preprocessing import MaxAbsScaler, OneHotEncoder

X = [[20.0, 'male'],
     [11.2, 'female'],
     [15.6, 'female'],
     [13.0, 'male'],
     [18.6, 'male'],
     [16.4, 'female']]

listOfTrans = [('ageScaler', MaxAbsScaler(), [0]),('genderEncoder', OneHotEncoder(dtype = 'int'), [1])]

column_trans = ColumnTransformer(listOfTrans, remainder='drop', verbose_feature_names_out=False)

print(column_trans.fit_transform(X))

[[1.   0.   1.  ]
 [0.56 1.   0.  ]
 [0.78 1.   0.  ]
 [0.65 0.   1.  ]
 [0.93 0.   1.  ]
 [0.82 1.   0.  ]]


#####b. Tranforming Target for Regression

`TranformedTargetRegressor`

  * Transforms the target variable `y` before fitting a regression model.
  * The predicted values are mapped back to the original space via an inverse transform.
  * `TransformedTargetRegressor` takes `regressor` and `transformer` to be applied to the target variable as arguments.


In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor

tt = TransformedTargetRegressor(regressor = LinearRegression(), func = np.log, inverse_func = np.exp)

X = np.arange(4).reshape(-1,1)
y = np.exp(2 * X).ravel()
tt.fit(X, y)

##**PART 4 Dimensionality Reduction by PCA**

Another way to reduce the number of features is through `unsupervised dimensionality reduction` techniques.

`sklearn.decomposition` module has a number of APIs for this task.


####PCA

* PCA is a linear dimensionality reduction technique.
* It uses singular value decomposition(SVD) to project the feature matrix or data to a lower dimensional space.
* The first principle component (PC) is in the direction of `maximum variance` in the data
  * It captures bulk of the variance in the data
* The subsequent PCs are orthogonal to the first PC and gradually capture lesser and lesser variance in the data
* We can select first k PCs such that we are able to capture the desired variance

`sklearn.decomposition.PCA` API is used for performing PCA based dimensionality reduction

In [None]:
from sklearn.decomposition import PCA

## **PART 5 Chaining Transformers**

* The preprocessing transformations are applied one after another on the input feature matrix.
* It is important to apply exactly same transformation on training, evaluation and test set in the same order
* Failing to do so would lead to incorrect predictions from model due to distribution shift and hence incorrect performance evaluation

* The `sklearn.pipeline` module provides utilities to build a composite estimator, as a chain of transformers and estimators.
* There are 2 classes: *i. Pipeline* and *ii. FeatureUnion*

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline

####1. Pipeline
`sklearn.pipeline.Pipeline`




  * Constructs a chain of multiple transformers to execute a fixed sequence of steps in data preprocessing and modelling
  * Sequentially apply a list of transformers and estimators
  * Intermediate steps of the pipeline must be "transformers" that is, they must implement fit and transform methods.
  * The final estimator only needs to implement fit
  * The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.


**Two ways to create a pipeline object**
    
  * `Pipeline()`

      * It takes a list of `('estimatorName', estimator(..))` tuples.
      * The pipeline object exposes interface of the last step.

    

In [None]:
#Pipeline()

estimators = [('simpleImputer', SimpleImputer()), ('standardScaler', StandardScaler())]

pipe = Pipeline(steps = estimators)

* `make_pipeline()`

    * It takes a number of estimator objects only.

In [None]:
#make_pipeline()

pipe = make_pipeline(SimpleImputer(), StandardScaler())

**Accessing individual steps in Pipeline**



In [None]:
estimators = [ ('simpleImputer', SimpleImputer()), ('pca' , PCA()), ('regressor' , LinearRegression())]

pipe = Pipeline(steps = estimators)

The second estimator can be accessed in the following 4 ways:
  * `pipe.named_steps.pca`
  * `pipe.steps[1]`
  * `pipe[1]`
  * `pipe['pca']`

**Accessing parameters of each step in Pipeline**

Parameters of the estimators in the pipeline can be accessed using the `<estimator>__<parameterName>` syntax

(Note there are two underscores between `<estimator>` and `<parameterName>`)

In [None]:
estimators = [ ('simpleImputer', SimpleImputer()), ('pca' , PCA()), ('regressor' , LinearRegression())]

pipe = Pipeline(steps = estimators)

pipe.set_params(pca__n_components = 2)

# here n_components of PCA() step is set after the pipeline is created

**Performing grid search with pipeline**

By using naming convention of nested parameters, grid search can be implemented

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

param_grid = dict(imputer = ['passthrough', SimpleImputer(), KNNImputer()], clf = [SVC(), LogisticRegression()], clf__C = [0.1,10,100])

grid_search = GridSearchCV(pipe, param_grid = param_grid)

* By `passthrough` we mean that we do not want to perform any imputation

* `C` is an inverse of regularization, lower its value stronger the regularization is.

* In the example above `clf__C` provides a set of values for grid search

**Caching transformers**

* Transforming data is a computationally expensive step
* For grid search, transformers need not be applied for every parameter configuration. They can be applied only once, and the transformed data can be reused.
* This can be achieved by setting `memory` parameter of a pipeline object
* `memory` can take either location of a directory in string format or `joblib.Memory` object.


In [None]:
estimators = [ ('simpleImputer', SimpleImputer()), ('pca' , PCA()), ('regressor' , LinearRegression())]

pipe = Pipeline(steps = estimators, memory = '/path/to/cache/dir')

**Advantages of Pipeline**

* Combines multiple steps of end to end ML into single object such as `missing value imputation`, `feature scaling` and `encoding`, `model training` and `cross validation`.

* Enables joint grid search over parameters of all the estimators in the pipeline.

* Makes configuring and tuning end to end ML quick and easy.

* Offers convinience, as a developer has to call `fit()` and `predict()` methods only on a `Pipeline` object (assuming last step in the pipeline is an estimator)

* Reduces code duplication : With a `Pipeline` object, one doesn't have to repeat code for preprocessing and transforming the test data.

####2. FeatureUnion
`sklearn.pipeline.FeatureUnion`

* Combines output from several transformer objects by creating a new transformer from them.

* Concatenates `results` of multiple transformer objects.

* Applies a list of transformer objects in parallel, and their outputs are concatenated side-by-side into a larger matrix

* `FeatureUnion` and `Pipeline` can be used to create complex transformers

In [None]:
from sklearn.pipeline import FeatureUnion

**Combining Transformers and Pipelines**

* `FeatureUnion` accepts a list of tuples
* Each tuple is of format: `('estimatorName', estimator(...))`

In [None]:
num_pipeline = Pipeline(
    [('selector', ColumnTransformer([('select_first_4', 'passthrough', slice(0,4))])),
     ('imputer', SimpleImputer(strategy = 'median')),
     ('std_scaler', StandardScaler())])

cat_pipeline = ColumnTransformer([('label_binarizer', LabelBinarizer(), [4])])

full_pipeline = FeatureUnion(transformer_list = [("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline)])

**Visualizing Composite Transformers**

In [None]:
from sklearn import set_config
set_config(display = 'diagram') #displays HTML representation in a jupyter context
full_pipeline