#Transformer Class

## Reference
- https://scikit-learn.org/stable/data_transforms.html
- https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html

## Table of Contents
1. Introduction
1. Create a transformer __object__
1. Create a transformer __class__

## Setup

In [19]:
%%sh 
git clone https://github.com/datalab-datasets/file-samples.git

Cloning into 'file-samples'...


In [26]:
ls /content/file-samples/*

/content/file-samples/diamonds.csv
/content/file-samples/dict_of_lists.json
/content/file-samples/each_line.json
/content/file-samples/enron.json
/content/file-samples/imports-85.csv
/content/file-samples/imports-85.names
/content/file-samples/imports-85.url
/content/file-samples/iris.csv
/content/file-samples/list_of_dicts.json
/content/file-samples/one_dictionary.json
/content/file-samples/one_list.json
/content/file-samples/one_list_with_metadata.json
/content/file-samples/pima-indians-diabetes.csv
/content/file-samples/README.md
/content/file-samples/simple_dict.json
/content/file-samples/simple_list.json
/content/file-samples/stocks.json
/content/file-samples/world_bank.json
/content/file-samples/zips.json


##1. Introduction

A _transformer_ is an object that converts (transforms) input data into output data. 

Often both the input and output data are dataframes, matrices or numpy arrays, but this is not required.  

Often internal attributes of the transformer object are set using information from one dataframe, but then these attributes are used to convert (transform) other dataframes.

Transformers have two key functions (methods):

- `fit()`: This sets internal parameters (attributes) based on the input data.

- `transform()`: This performs the transformation itself.

Import the `pandas` and `numpy` libraries. In addition, import the `Imputer` class which is a transformer that is used to complete missing values.

In [0]:
import pandas  as pd
import numpy   as np
import sklearn as sk
from sklearn.preprocessing import Imputer

Display the version numbers of the numpy, pandas and scikit-learn packages:

In [2]:
print('numpy  :',np.__version__)
print('pandas :',pd.__version__)
print('sklearn:',sk.__version__)

numpy  : 1.16.4
pandas : 0.24.2
sklearn: 0.21.2


Note that these version number may not be identical to the references provide above.

## 2. Create a transformer object

The code cell below creates a transformer object `imp`, which is an instance of the `Imputer` class, by calling the constructor (init method) of that class. 

Several arguments (of the init method) are used to configure the `imp` object. The keyword argument:
- `missing_values='NaN'` specifies that missing values are represented by `NaN`
- `strategy='mean'` specifies that the mean will be used to complete missing values
- `axis=0` specifies that the mean is taken for each column

In [3]:
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)                       



Create a sample data set `A`.

In [0]:
A = [[np.nan, 2, 0], 
     [3, 0, 0], 
     [7, 1, 6],
     [2, 1, 2]]

Fit the imputer object `imp` to the sample data `A` using the `fit` method.

In [5]:
imp.fit(A) 

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

Recall that the `fit` method returns the object itself.

The `.statistics_` attribute stores an array of values, which in this case contains the mean of each column.

In [6]:
imp.statistics_

array([4., 1., 2.])

Create a similar dataset `B`, but this time with missing values. For convenience, the missing values are on the diagonal.

In [0]:
B = [[np.nan, 2, 5],
     [6, np.nan, 4],
     [7, 6, np.nan]]

Transform the new dataset `B` using the `transform` method of `imp`, the fitted imputer, and return a completed version of `B`.

In [8]:
imp.transform(B)

array([[4., 2., 5.],
       [6., 1., 4.],
       [7., 6., 2.]])

Notice in the output above that the missing values (along the diagonal) in `B` have been replaced with the values from `imp.statistics_` (which is an array of the mean of each column of `A`).

In [9]:
imp.statistics_

array([4., 1., 2.])

The `mean` function from numpy computes the mean of each column of `A`.

In [10]:
np.mean(A, axis=0)

array([nan,  1.,  2.])

In [11]:
np.nanmean(A, axis=0)

array([4., 1., 2.])

This session introduces applying existing transformers in `Scikit-Learn` library to convert data to the form we need. A transformer class is useful to learn parameters from dataset with the `fit()` method rather than take fixed values, and perform the transformation with the `transform()` method.

## 3. Create a transformer class

Every transformer class should 
- define an init method, named `__init__`
- define two methods,`fit` and `transform` 
- inherit the `BaseEstimator` and `TransformerMixin` classes (supplied by scikit-learn) 

The `fit` method should return `self` and the `transform` method should returned the transformed output. 

The `fit_transform()` method is added from the `TransformerMixin` class. Calling the `fit_transform` method is equivalent to chaining the `fit` method and then `transform` method, with the same inputs. For example, `imp.fit_transform(A)` is equivalent to `imp.fit(A).transform(A)`.

Methods `get_params()` and `set_params()` are added from the `BaseEstimator` class and are useful for automatic hyperparameter tuning.

### 3.1 Example 1: create `DataFrameSelector` class

The `DataFrameSelector` will transform a dataset by returned only a specified collection of columns from that dataset:
- The `__init__` method records the names of the columns to return
- The `fit` method does nothing, except return `self` which is required of all `fit` methods
- The `transform` method returns the specified columns of the dataset (input to the `transform` method)

Recall that transformers must inherit `TransformerMixin` and `BaseEstimator`. Import these classes:

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin

Define the `DataFrameSelector` transformer class to select the desired attributes from the input dataset.

In [0]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
  def __init__(self, attribute_names):
    self.attribute_names = attribute_names
  def fit(self, X, y=None):
    return self
  def transform(self, X):
    return X[self.attribute_names]

In this class:
- The init method has one parameter, `attribute_names`. 
- The `fit` method returns `self` (this is standard behavior for `.fit` methods in scikit-learn).
- The `transform` method takes dataset and returns only the values in the DataFrame as a Numpy array.

The remainder of this section applies this transformer to the diamonds dataset.

Load the diamonds dataset as a Pandas DataFrame. Check the first five observations with column headers.

In [28]:
diamonds_pdf = pd.read_csv('/content/file-samples/diamonds.csv')
diamonds_pdf.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


Define the features to choose and store the list of feature names in the `variable_list` object.

In [0]:
variable_list=["carat", "depth", "table", "price"]

Create a `DataFrameSelector` class with attribute names as stated in the `variable_list` object, and store the transformer class in the `selector` object.

In [0]:
selector=DataFrameSelector(variable_list)

Use the `selector` instance of the transformer class to transform the diamonds dataset.

In [31]:
selector.fit_transform(diamonds_pdf)

Unnamed: 0,carat,depth,table,price
0,0.23,61.5,55.0,326
1,0.21,59.8,61.0,326
2,0.23,56.9,65.0,327
3,0.29,62.4,58.0,334
4,0.31,63.3,58.0,335
5,0.24,62.8,57.0,336
6,0.24,62.3,57.0,336
7,0.26,61.9,55.0,337
8,0.22,65.1,61.0,337
9,0.23,59.4,61.0,338


The output of the `transform` method (of the `imp`)

### 3.2 Example 2: `Scaler_MinMax` class

#### 3.2.1. Introduction to min-max scaling

The `MinMaxScaler` in scikit-learn transforms features by scaling each feature to a given range (normally between 0 and 1), and follows the formula below for each feature:

$$\dfrac{x{_i}-min(x)}{max(x)-min(x)}
$$

where \\(x\\) refers to a column of data and \\(x_i\\) refers to the \\(i\\)-th value in that column.

Import the built-in transformer `MinMaxScaler` from the scikit-learn library.

In [0]:
from sklearn.preprocessing import MinMaxScaler

Create an instance of `MinMaxScaler` and store it in an object `minmax1`.

In [0]:
minmax1 = MinMaxScaler()

Create a sample array data `a` with two features ranging (0, 8) and (1, 9).

In [34]:
a = np.arange(10).reshape((5,2))
a

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

Fit the transformer object to the sample data `a` and return a transformed version of `a` with two features ranging from 0 to 1.

In [35]:
minmax1.fit_transform(a)

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [0.75, 0.75],
       [1.  , 1.  ]])

The following section defines a transformer class the performs the same function as the `MinMaxScaler` class.

#### 3.2.2. Build a `Scaler_MinMax` class

The transformer class `Scaler_MinMax` defined below performs the same transformation as `MinMaxScaler`.

In [0]:
class Scaler_MinMax(BaseEstimator, TransformerMixin):
  def __init__(self, axis=0):
    self.axis = axis
  def fit(self, X, y=None):
    data_min = np.nanmin(X, axis=self.axis)
    data_max = np.nanmax(X, axis=self.axis)
    data_range = data_max - data_min
    self.data_min_ = data_min
    self.data_max_ = data_max
    self.data_range_ = data_range
    return self
  def transform(self, X):
    X_scaled = (X - self.data_min_) / self.data_range_
    return X_scaled

In this class the transformer has one parameter, `axis` with a default of `0` which indicates that minimum and maximum values should be computed for each __column__. See lines `5` and `6` which use the `axis` attribute.

In the `fit` method, the class will:
- learn the minimum of an array by computing `np.nanmin(X, axis=self.axis)` 
- learn the maximum of an array by computing `np.nanmax(X, axis=self.axis)`
- compute the range between the minimum and maximum by computing `data_max - data_min`

Each of the calculation is then stored as an attribute of the object. Finally, the `fit` method returns `self`.

The `transform` method takes the input data `X` and uses the stored attributes to calculate:
- a scaled version `X_scaled` (of the input data `X`) by subtracting the minimum value and divided by the range

The `transform` method then returns this scaled version of data `X_scaled`.

The same sample dataset `a` is used to demonstrate the `Scaler_MinMax` class:
1. Create an instance of `Scaler_MinMax` and store it in an object `minmax2`. 
1. Fit and transform the dataset `a` using this instance.

In [37]:
minmax2 = Scaler_MinMax()
minmax2.fit_transform(a)

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [0.75, 0.75],
       [1.  , 1.  ]])

After fitting the transformer class on the dataset `a`, the object `minmax2` has calculated and stored the three attributes:
- `data_min_` which is the minimum value of each feature
- `data_max_` which is the maximum value of each feature
- `data_range_` which is the range (`data_min_ - data_max_`) value of each feature

These attributes are displayed below.

In [38]:
minmax2.data_min_, minmax2.data_max_, minmax2.data_range_

(array([0, 1]), array([8, 9]), array([8, 8]))

Create another sample data `b` with two feature ranging (0, 4) and (1, 5).

In [39]:
b=np.arange(6).reshape(3,2)
b

array([[0, 1],
       [2, 3],
       [4, 5]])

Use the fitted transformer object `minmax2` to perform transformation on `b` and get a scaled version of `b` with two features ranging from 0 to 0.5.

In [40]:
minmax2.transform(b)

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ]])

__The End__