# Discussion 08

### Due Saturday December 3rd, 11:59:59PM


# Scikit-Learn: Transformers, Estimators, and Pipelines

In [1]:
# import libraries
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# from discussion import *

In [3]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

# Transformers and Estimators

Scikit learn includes two *base* modeling classes: Transfomers and Estimators. Both are classes that are meant to be *fit* on (training) data and then later used to transform (or predict with) unseen data.

### Transformers

* Transformers take input data and transform it into output data via the `transform` method.
* Sometimes transformers need prior information (parameters) about the data before transforming it.
    - In this case, the transformer is *fit* using the `fit` method on training data to estimate the parameters.
    - Once fit, the transformer may then be applied to `test` data (or unseen, new data).
* Fit parameters are accessed via an instance variable that ends in an *underscore*.

**Question 1** Using `Binarizer`, transform the `city-mpg` and `highway-mpg` column to 0 if the mpg is less than or equal to 25 and 1 if it's greater than 25.

In [4]:
cars = pd.read_csv('data/cars.csv')

In [5]:
from sklearn.preprocessing import Binarizer

**Question 2** Using `FunctionTransformer`, transform the `city-mpg` and `highway-mpg` columns to a log-scale.

In [6]:
from sklearn.preprocessing import FunctionTransformer

Most transformers you will use will require being *fit to training data* before using it. An example of this is one-hot-encoding: before applying one-hot encoding to a column, you must determine the number of distinct values in the column (as that number determines the number of columns in the output).

**Question 3** *(Fit transformers properly handle unseen values)*

1. One-hot encode the `body-style` column using `OneHotEncoder`. What is the dimension of the output? Which column corresponds to which value of `body-style`? (If you can't remember the attribute name, look up the documentation!)
1. Fit a `OneHotEncoder` on the *first 5 rows* of the `body-style` column. Use this 'training data' to one-hot-encode the `body-style` in rest of the dataset. Why does it throw an exception? Look at the documentation -- what is the relevant parameter to avoid this? What are the implications of setting this parameter? What is the dimension of the output?

In [7]:
from sklearn.preprocessing import OneHotEncoder

As you observed, the categories of `OneHotEncoder` are learned from training data and saved as an attribute of the transformer. These categories are then used to transform new, unseen data which is then used by other pieces of the ML pipeline.

Below is an illustration of why *fitting* a transformer is so important:

*The dangers of `pd.get_dummies`*: Pandas offers it's own one-hot encoder called `get_dummies`. This function is stateless; every time it's called, it determines the categories of the input data and one-hot encodes that data using those categories. However, as you saw in the above question, this is not a realistic use of the function!

To illustrate this:
1. We will create a one-hot encoding using scikit-learn that we will pass into a linear regression model.
1. We will create a *stateless* one-hot encoder that we will pass into a linear regression model.

Both of these models will be trained on the first 5 rows of the dataset; the rest will be used as 'unseen' data.

In [8]:
from sklearn.linear_model import LinearRegression

In [9]:
X_train, y_train = cars[['body-style']].head(5), cars['price'].head(5)
X_test, y_test = cars[['body-style']].tail(-5), cars['price'].tail(-5)

In [10]:
# Using sklearn transformers. What are the categories of the OneHotEncoder?
ohe = OneHotEncoder(handle_unknown='ignore')
lr = LinearRegression()

ohe.fit(X_train)
features = ohe.transform(X_train)
lr.fit(features, y_train)

LinearRegression()

In [11]:
# predict on the new data using the fit model:
lr.predict(ohe.transform(X_test))[:10]

array([15700. , 15700. , 15732.5, 15700. , 15700. , 15700. , 15700. ,
       15700. , 15700. , 15700. ])

In [12]:
# using a stateless one-hot encoder.
ohe = FunctionTransformer(pd.get_dummies, validate=False)
lr = LinearRegression()

ohe.fit(X_train)
features = ohe.transform(X_train)
lr.fit(features, y_train)

LinearRegression()

In [13]:
# debug this!
# lr.predict(ohe.transform(X_test))[:10]

**Note:** Even worse, there are cases where such an ML pipeline doesn't even throw an exception -- the incorrect columns get silently passed on. 

**The below predictions are wrong! Can you tell that it's wrong?** Debugging statistical output is notoriously hard -- this is why statistical analysis of the output data is always an important step to check your work!

In [14]:
X_train, y_train = cars[['body-style']].head(5), cars['price'].head(5)
X_test, y_test = cars[['body-style']].iloc[5:20], cars['price'].iloc[5:20]

In [15]:
ohe = FunctionTransformer(pd.get_dummies, validate=False)
lr = LinearRegression()

ohe.fit(X_train)
features = ohe.transform(X_train)
lr.fit(features, y_train)

LinearRegression()

In [16]:
# THIS IS WRONG! debug this! 
lr.predict(ohe.transform(X_test))[:10]

array([16500., 16500., 15700., 16500., 16500., 16500., 16500., 16500.,
       16500., 16500.])

# Building custom Transformers

Building your own transformer class is easy! There is are convenient base-classes you can inherit: `TranformerMixin` and `BaseEstimator`. To subclass these classes, you need to:
1. implement the `fit` method, and
2. implement the `transform` method.
Once you have done that, you can use your custom transformer as part of a `Pipeline` that can leverage all the nice features of scikit-learn (like feature-selection libraries and cross-validation).

To get acquainted with the structure of a transformer, it's useful to look at `sklearn` source code. First, you will walk through the source code of the `Binarizer` transformer ([source code](https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/preprocessing/data.py#L1789)).

The source code is included below. Note, that there is a lot of boiler-plate code, but the *transform* method is the relevant method. (Why does the fit method do nothing?)

```
class Binarizer(BaseEstimator, TransformerMixin):
    """Binarize data (set feature values to 0 or 1) according to a threshold
    Values greater than the threshold map to 1, while values less than
    or equal to the threshold map to 0. With the default threshold of 0,
    only positive values map to 1.
    Binarization is a common operation on text count data where the
    analyst can decide to only consider the presence or absence of a
    feature rather than a quantified number of occurrences for instance.
    It can also be used as a pre-processing step for estimators that
    consider boolean random variables (e.g. modelled using the Bernoulli
    distribution in a Bayesian setting).
    Read more in the :ref:`User Guide <preprocessing_binarization>`.
    Parameters
    ----------
    threshold : float, optional (0.0 by default)
        Feature values below or equal to this are replaced by 0, above it by 1.
        Threshold may not be less than 0 for operations on sparse matrices.
    copy : boolean, optional, default True
        set to False to perform inplace binarization and avoid a copy (if
        the input is already a numpy array or a scipy.sparse CSR matrix).
    Examples
    --------
    >>> from sklearn.preprocessing import Binarizer
    >>> X = [[ 1., -1.,  2.],
    ...      [ 2.,  0.,  0.],
    ...      [ 0.,  1., -1.]]
    >>> transformer = Binarizer().fit(X)  # fit does nothing.
    >>> transformer
    Binarizer(copy=True, threshold=0.0)
    >>> transformer.transform(X)
    array([[1., 0., 1.],
           [1., 0., 0.],
           [0., 1., 0.]])
    Notes
    -----
    If the input is a sparse matrix, only the non-zero values are subject
    to update by the Binarizer class.
    This estimator is stateless (besides constructor parameters), the
    fit method does nothing but is useful when used in a pipeline.
    See also
    --------
    binarize: Equivalent function without the estimator API.
    """

    def __init__(self, threshold=0.0, copy=True):
        self.threshold = threshold
        self.copy = copy

    def fit(self, X, y=None):
        """Do nothing and return the estimator unchanged
        This method is just there to implement the usual API and hence
        work in pipelines.
        Parameters
        ----------
        X : array-like
        """
        check_array(X, accept_sparse='csr')
        return self

    def transform(self, X, copy=None):
        """Binarize each element of X
        Parameters
        ----------
        X : {array-like, sparse matrix}, shape [n_samples, n_features]
            The data to binarize, element by element.
            scipy.sparse matrices should be in CSR format to avoid an
            un-necessary copy.
        copy : bool
            Copy the input X or not.
        """
        copy = copy if copy is not None else self.copy
        return binarize(X, threshold=self.threshold, copy=copy)

    def _more_tags(self):
        return {'stateless': True}
```

The relevant (portion of the) function `binarize` from the transform method is here:

```
def binarize(X, threshold=0.0, copy=True):
    ...
    cond = X > threshold
    not_cond = np.logical_not(cond)
    X[cond] = 1
    X[not_cond] = 0
    return X
```

**Question 4** 

As a warm-up, create a transformer that drops the `i`th row of an input dataset.

In [17]:
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnDropper(BaseEstimator, TransformerMixin):

    def __init__(self, index):
        self.index = index

    def fit(self, X, y=None):
        """
        Does nothing! Stateless!
        """
        return self

    def transform(self, X, y=None):
        """
        Drops the ith column of X, where i=index
        """
        
        return ...

In [18]:
cd = ColumnDropper(index=3)
cd.transform(cars.iloc[:,:5])

Ellipsis

**Question 5**

Columns that don't have much variation are not very useful for prediction. An extreme case is when a column has only a single value.  Create a "feature selection" transformer that drops any columns that don't have a standard-deviation greater than an input threshold.

* What needs to be calculated during the fitting process? When is the standard deviation calculated?

In [19]:
class LowStdColumnDropper(BaseEstimator, TransformerMixin):

    def __init__(self, thresh=0):
        '''
        Drops columns whose standard deviation is less than thresh.
        '''
        self.thresh = thresh

    def fit(self, X, y=None):
        """
        ...
        """
        # self.columns_ = ...
        # return self
        
        # BEGIN SOLUTION
        self.columns_ = (X.std(axis=0) >= self.thresh).values
        
        return self
        # END SOLUTION

    def transform(self, X, y=None):
        """
        
        ...
        """
        # BEGIN SOLUTION
        if isinstance(X, np.ndarray):
            out = X[:, self.columns_]
        else:
            out = X.loc[:, self.columns_]
        
        return out
        # END SOLUTION

In [20]:
lvd = LowStdColumnDropper(thresh=10.0)
lvd.fit(cars.select_dtypes('number'))
lvd.transform(cars.select_dtypes('number'))

Unnamed: 0,length,curb-weight,engine-size,horsepower,peak-rpm,price
0,168.8,2548,130,111,5000,13495.0
1,168.8,2548,130,111,5000,16500.0
2,171.2,2823,152,154,5000,16500.0
3,176.6,2337,109,102,5500,13950.0
4,176.6,2824,136,115,5500,17450.0
...,...,...,...,...,...,...
188,188.7,2952,141,114,5400,16845.0
189,188.7,3049,141,160,5300,19045.0
190,188.7,3012,173,134,5500,21485.0
191,188.7,3217,145,106,4800,22470.0


In [21]:
# Do not modify this cell, it is needed for the doctests to work
cars = pd.read_csv('data/cars.csv').select_dtypes('number')
out = LowStdColumnDropper(thresh=0.25)
out1 = LowStdColumnDropper(thresh=1.0)
out2 = LowStdColumnDropper(thresh=10.0)

In [22]:
import doctest
doctest.run_docstring_examples(LowStdColumnDropper.transform, {'LowStdColumnDropper.transform': LowStdColumnDropper.transform,
                                                              'LowStdColumnDropper': LowStdColumnDropper,
                                                              'pd': pd})

In [23]:
""" # BEGIN TEST CONFIG
points: 2
failure_message: 'doctest 1'
""" # END TEST CONFIG
out.fit_transform(cars).shape[0] == cars.shape[0]

True

In [24]:
""" # BEGIN TEST CONFIG
points: 2
failure_message: 'doctest 2'
""" # END TEST CONFIG
out.fit_transform(cars).shape[1] <= cars.shape[1]

True

In [25]:
""" # BEGIN TEST CONFIG
points: 2
failure_message: 'correct number of columns dropped for thresh=0.25'
""" # END TEST CONFIG
out.fit_transform(cars).shape[1] == 14

True

In [26]:
""" # BEGIN TEST CONFIG
points: 2
failure_message: 'correct number of columns dropped for thresh=1.0'
""" # END TEST CONFIG
out1.fit_transform(cars).shape[1] == 12

True

In [27]:
""" # BEGIN TEST CONFIG
points: 2
failure_message: 'correct number of columns dropped for thresh=10.0'
""" # END TEST CONFIG
out2.fit_transform(cars).shape[1] == 6

True

# Scikit-Learn Pipelines

Pipelines are ways of chaining transformers and estimators together. 
* A **Pipeline** object is a sequence of transformers that perhaps end with estimator.
* When you call `.fit(X, y)` on a pipeline, the pipeline calls `fit_transform` on each successive transformer in the pipeline, passing the transformed data to the next transformer in the sequence.
* A pipeline that consists of a sequence of transformers is itself a transformer.
* A pipeline that consists of a sequence of transformers, followed by an estimator, is itself an estimator.
    - These observations allow you put pipelines inside of other pipelines!

**Question 6** Create a *single* pipeline that consists of a `OneHotEncoder`, followed by a `LowStdColumnDropper`, followed by a `LinearRegression` model. That is, one-hot encode the categorical features, drop the low-variance columns, and use those features to fit a linear regression model. Your threshold for "small standard deviation" should be `0.1`. 

*Remark:* Be sure to pass `sparse=False` into `OneHotEncoder` (or else, make your `LowStdColumnDropper` handle sparse matrices).

How many columns did your pipeline drop? (Use the `named_steps` attribute!)



In [28]:
from sklearn.pipeline import Pipeline

### Putting columns together using `ColumnTransformer`

You can run many different transformers in parallel on different subsets of columns using `ColumnTransformer` (see the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer)).

One common pattern for using `ColumnTransformer` is to create a transformer pipeline for each kind of data in your dataset and use `ColumnTransformer` to put the transformed features together into a single dataset.

**Question 7**

Create a general pipeline that transforms different data-kinds with appropriate generic features (e.g. one-hot-encoder, ordinal-encoder) by filling in the `...` below:

In [29]:
from sklearn.compose import ColumnTransformer

In [30]:
quantitative_cols = ...
ordinal_cols = ...
nominal_cols = ...

In [31]:
# quantitative_pipeline = Pipeline([...])
# ordinal_pipeline = Pipeline([...])
# nominal_pipeline = Pipeline([...])

In [32]:
# feature_eng_pipeline = ColumnTransformer([
#     ('quant', quantitative_pipeline, quantitative_cols),
#     ('ordin', ordinal_pipeline, ordinal_cols),
#     ('nomin', nominal_pipeline, nominal_cols)
# ])

In [33]:
# pl = Pipeline([feature_eng_pipeline, LinearRegression()])

**Question 8** 

Fit the pipeline and predict with it. Check the outputs at each step using `named_steps`.

# Finished? Turn in Question 5 for Grading

## Congratulations! You're done!

* Submit your `.py` file to Gradescope. Note that you only need to submit the `.py` file; this notebook should not be uploaded. Make sure that all of your work is in the `.py` file and not here by running the doctests: `python -m doctest discussion.py`.