#Transformer class

## Table of Contents
1. Transformer
1. Build custom transformer class

##1. Transformer

A transformer is similar to a converting function. It takes data of one form as input and returns data of another form as output. Transformers can be trained using some training dataset, and these trained parameters can be used to convert testing data.

Transformers have two key functions (methods):

- `fit()`: This takes a training set of data as input and sets internal parameters (attributes).

- `transform()`: This performs the transformation itself. This can take either the training dataset, or a new dataset of the same format.

Import the libraries `numpy` and `sklearn`. Import an existing transformer `Imputer` which is used to complete missing values.

In [7]:
import pandas  as pd
import numpy   as np
import sklearn as sk
from sklearn.preprocessing import Imputer

Instantiate an instance  `imp` of the `Imputer` class. The keyword argument `missing_values='NaN'` here specifies that missing values are represented by `NaN`; `strategy='mean'` specifies that we will use the mean of the columns (`axis 0`) that contain the missing values.

In [9]:
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)                       

Create a sample data set `A`.

In [11]:
A = [[0, 2, 0], 
     [3, 0, 0], 
     [7, 1, 6],
     [2, 1, 2]]

Fit this imputer to the sample data `A` using the `fit` method. The mean of each column will be stored in the object `imp`.

In [13]:
imp.fit(A) 

The `.statistics_` attribute stores an array of values, each of which are the mean of each feature.

In [15]:
imp.statistics_

Create a similar dataset `B` with missing values.

In [17]:
B = [[np.nan, 2, 5],
     [6, np.nan, 4],
     [7, 6, np.nan]]

Transform the dataset using the `transform` method with the fitted imputer, and return a completed version of `B`.

In [19]:
imp.transform(B)

Compare the output above with the arithmetic means along columns in `A` with the values that replaced the `NaN` values.

In [21]:
np.mean(A, axis=0)

This session introduces applying existing transformers in `Scikit-Learn` library to convert data to the form we need. A transformer class is useful to learn parameters from dataset with the `fit()` method rather than take fixed values, and perform the transformation with the `transform()` method.

##2. Build custom transformer

To make your transformer work seamlessly with scikit-learn functionalities (such as pipelines), it is useful to create a class and implement three methods:
1. `__init__`
1. `fit()` (returning `self`)
1. `transform()`.

The `fit_transform()` method is added automatically when the `TransformerMixin` is used as a base class.

###Example 1: `DataFrameSelector` class

####1. Build a `DataFrameSelector` transformer class

Import `TransformerMixin` and `BaseEstimator` (this one is added to have `get_params()` and `set_params()` methods that will be useful for automatic hyperparameter tuning).

In [28]:
from sklearn.base import BaseEstimator, TransformerMixin

Build a small transformer class that selects the desired attributes, drops the rest, and converts the resulting DataFrame to a NumPy array.

In [30]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
  def __init__(self, attribute_names):
    self.attribute_names = attribute_names
  def fit(self, X, y=None):
    return self
  def transform(self, X):
    return X[self.attribute_names].values

In this class the transformer has one parameter, `attribute_names`. 
- The `fit` method returns `self` (this is standard behavior for `.fit` methods in scikit-learn).
- The `transform` method takes dataset and returns only the values in the DataFrame as a Numpy array.

####2. Apply the `DataFrameSelector` to the diamonds dataset

Load the diamonds dataset as a Pandas DataFrame. Check the first five observations with column headers.

In [34]:
dataset = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/diamonds.csv')
dataset.head()

Define the features to choose and store the list of feature names in the `variable_list` object.

In [36]:
variable_list=["carat", "depth", "table", "price"]

Create a `DataFrameSelector` class with attribute names as stated in the `variable_list` object, and store the transformer class in the `selector` object.

In [38]:
selector=DataFrameSelector(variable_list)

Use the instance `selector` of transformer class to transform the diamonds dataset.

In [40]:
selector.fit_transform(dataset)

The output shows that the value of the four numeric attributes are converted into a numpy array.

###Example 2: `Scaler_MinMax` class

####1. Introduction to min-max scaling

The `MinMaxScaler` in scikit-learn transforms features by scaling each feature to a given range (normally between 0 and 1), and follows the following formula for each feature:


- \\(\dfrac{x{_i}-min(x)}{max(x)-min(x)}\\)

Import the built-in transformer `MinMaxScaler` from the scikit-learn library.

In [46]:
from sklearn.preprocessing import MinMaxScaler

Create an instance of `MinMaxScaler` and store it in an object `minmax1`.

In [48]:
minmax1 = MinMaxScaler()

Create a sample array data `a` with two features ranging (0, 8) and (1, 9).

In [50]:
a = np.arange(10).reshape((5,2))
a

Fit the transformer object to the sample data `a` and return a transformed version of `a` with two features ranging from 0 to 1.

In [52]:
minmax1.fit_transform(a)

####2. Build a `Scaler_MinMax` class

Build a transformer class `Scaler_MinMax` that perform the similar transformation as `MinMaxScaler` does.

In [55]:
class Scaler_MinMax(BaseEstimator, TransformerMixin):
  def __init__(self, axis=0):
        self.axis = axis
  def fit(self, X, y=None):
        data_min = np.nanmin(X, axis=0)
        data_max = np.nanmax(X, axis=0)
        data_range = data_max - data_min
        self.data_min_ = data_min
        self.data_max_ = data_max
        self.data_range_ = data_range
        return self
  def transform(self, X):
        X_scaled = (X - self.data_min_) / self.data_range_
        return X_scaled

In this class the transformer has one parameter, `axis` which the default is set to be 0 (columnwise). 

In the `fit` method, the class will:
- learn the minimum of an array by computing `np.nanmin(X, axis=0)` 
- learn the maximum of an array by computing `np.nanmax(X, axis=0)`
- get the range between the minimum and maximum by computing `data_max - data_min`

Each of the calculation is then stored as an object attribute. After that the `fit` method returns `self`.

In the `transform` method, the class takes an array of data and uses the stored attributes to calculate:
- a scaled version of data `X_scaled` after subtracting the minimum value and divided by the range

Then the `transform` method returns the final scaled version of data `X_scaled` with a range from 0 to 1.

####3. Apply the `Scaler_MinMax` class to sample data

Create an instance of `Scaler_MinMax` and store it in an object `minmax2`. Use the transformer fit to the sample data `a` and transform it to get a scaled version of `a` with two features ranging from 0 to 1.

In [59]:
minmax2 = Scaler_MinMax()
minmax2.fit_transform(a)

After fitting the transformer class on data `a`, the object `minmax2` has calculated and stored the three attributes:
- `.data_min_` which is the minimum value of each feature
- `.data_max_` which is the maximum value of each feature
- `.data_range_` which is the range (`data_min_ - data_max_`) value of each feature

In [61]:
minmax2.data_min_, minmax2.data_max_, minmax2.data_range_

Create another sample data `b` with two feature ranging (0, 4) and (1, 5).

In [63]:
b=np.arange(6).reshape(3,2)
b

Use the fitted transformer object `minmax2` to perform transformation on `b` and get a scaled version of `b` with two features ranging from 0 to 0.5.

In [65]:
minmax2.transform(b)

__Exercise__: Compare the output of `minmax2.fit_transform(b)` with the output above.

In [67]:
minmax2.fit_transform(b)