In [1]:
import seaborn as sns
sns.set()

In [2]:
import expectexception  #  Used to specify if an exception *should* occur (for notebook testing purposes)

<!-- requirement: images/PML_feature_union.svg -->
<!-- requirement: small_data/DC_properties.csv -->

# Scikit-learn Workflow

While scikit-learn provides a number of powerful, built-in transformers and predictors, we sometimes need custom functionality, especially for preprocessing and data wrangling. The data I deal with in the real world does not often come to us in a clean ready-to-use format. 

In this notebook I will demonstrate how to process data that is in a format which is unsuitable for scikit-learn and transform it into a 2D matrix that scikit-learn predictors expect. I'll be working a data set that consists of house prices in the Washington, D.C. area, together with various features about those houses. This data is stored as a CSV file.  

> **Note:** I obtained this data from [this website](https://www.kaggle.com/christophercorrea/dc-residential-properties) under a [Creative Commons CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/), and have modified the data somewhat ourselves before making it available here.  

The next block of code reads the CSV file and loads it into a pandas DataFrame object. Note that I know that one of the columns contains a date/time string, so I have told pandas to treat that accordingly when we load the CSV file. 

In [18]:
import pandas as pd
import numpy as np

data = pd.read_csv('small_data/DC_properties.csv', parse_dates=['SALEDATE'], low_memory=False,)

Examining the first element in the data set.

In [4]:
data.head(1)

Unnamed: 0,PRICE,ROOMS,BATHRM,HF_BATHRM,BEDRM,KITCHENS,FIREPLACES,LANDAREA,EYB,SALEDATE,SALE_NUM,HEAT,AC,QUALIFIED,SOURCE,ZIPCODE,LATITUDE,LONGITUDE,ASSESSMENT_NBHD,QUADRANT
0,1095000.0,8,4,0,4,2.0,5,1680,1972,2003-11-25,1,Warm Cool,Y,Q,Residential,20009,38.91468,-77.040832,Old City 2,NW


Notice that this data set contains information about the houses and for every observation in the data set I
have several features ranging from categorical features of the houses, as well as numeric features.  

In [5]:
y = data['PRICE'].values
X = data.loc[:, 'ROOMS':]

In [6]:
X.head(3)

Unnamed: 0,ROOMS,BATHRM,HF_BATHRM,BEDRM,KITCHENS,FIREPLACES,LANDAREA,EYB,SALEDATE,SALE_NUM,HEAT,AC,QUALIFIED,SOURCE,ZIPCODE,LATITUDE,LONGITUDE,ASSESSMENT_NBHD,QUADRANT
0,8,4,0,4,2.0,5,1680,1972,2003-11-25,1,Warm Cool,Y,Q,Residential,20009,38.91468,-77.040832,Old City 2,NW
1,9,3,1,5,2.0,4,1680,1984,2016-06-21,3,Hot Water Rad,Y,Q,Residential,20009,38.914684,-77.040678,Old City 2,NW
2,8,3,1,5,2.0,3,1680,1984,2006-07-12,1,Hot Water Rad,Y,Q,Residential,20009,38.914683,-77.040629,Old City 2,NW


In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98216 entries, 0 to 98215
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   ROOMS            98216 non-null  int64         
 1   BATHRM           98216 non-null  int64         
 2   HF_BATHRM        98216 non-null  int64         
 3   BEDRM            98216 non-null  int64         
 4   KITCHENS         97979 non-null  float64       
 5   FIREPLACES       98216 non-null  int64         
 6   LANDAREA         98216 non-null  int64         
 7   EYB              98216 non-null  int64         
 8   SALEDATE         98216 non-null  datetime64[ns]
 9   SALE_NUM         98216 non-null  int64         
 10  HEAT             98216 non-null  object        
 11  AC               98216 non-null  object        
 12  QUALIFIED        98216 non-null  object        
 13  SOURCE           98216 non-null  object        
 14  ZIPCODE          98216 non-null  int64

## Remark

I am going to be discussing various scikit-learn classes in this notebook. It is highly recommended to spend some time reading the [documentation](https://scikit-learn.org/stable/index.html) on new classes as you encounter them for the first time.  The scikit-learn documentation is very thorough, informative, and has good examples included in it.  

## Categorical features

Let us focus first on some of the categorical feature from the data set. These categorical features include columns such as `HEAT`, `AC` and `SOURCE`.  `HEAT` specifies a particular type of heating equipment that is installed in the house.

In [8]:
sorted(X['HEAT'].unique())

['Air Exchng',
 'Air-Oil',
 'Elec Base Brd',
 'Electric Rad',
 'Evp Cool',
 'Forced Air',
 'Gravity Furnac',
 'Hot Water Rad',
 'Ht Pump',
 'Ind Unit',
 'No Data',
 'Wall Furnace',
 'Warm Cool',
 'Water Base Brd']

In [9]:
from sklearn.preprocessing import OrdinalEncoder

le = OrdinalEncoder()
le.fit_transform(X[['HEAT', 'AC']])[:10]

array([[12.,  2.],
       [ 7.,  2.],
       [ 7.,  2.],
       [ 7.,  2.],
       [ 7.,  2.],
       [12.,  2.],
       [12.,  2.],
       [12.,  2.],
       [ 7.,  2.],
       [ 7.,  2.]])

In [10]:
from sklearn.preprocessing import OneHotEncoder


ohe = OneHotEncoder()
ohe.fit_transform(X[['HEAT', 'AC', 'SOURCE']])

<98216x19 sparse matrix of type '<class 'numpy.float64'>'
	with 294648 stored elements in Compressed Sparse Row format>

In [11]:
#  Do not call .toarray() on the full size sparse matrix!!  It will crash your kernel!

ohe.fit_transform(X[['HEAT', 'AC', 'SOURCE']])[0:3, :].toarray()

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        1., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 1.]])

I shall very soon how we can use the `ColumnTransformer` to utilize several transformers on columns and put the results together into a feature matrix. Let's use [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).  I have added additional categorical features below to the ones we have mentioned above.  

Note that even though the `ZIPCODE` is stored as a numeric value, we are encoding it as a categorical feature for similar reasons described above, i.e. there is no reason that an increase in the value of `ZIPCODE` should necessarily correspond to an increase/decrease in the house price.  

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

categorical_columns = ['HEAT', 'AC', 'SOURCE', 'QUALIFIED', 'ZIPCODE', 'ASSESSMENT_NBHD']

est = Pipeline([
    ('categorical', OneHotEncoder()),
    ('regressor', Ridge())
])

est.fit(X[categorical_columns], y)

Pipeline(steps=[('categorical', OneHotEncoder()), ('regressor', Ridge())])

I can get predictions in the usual fashion by feeding in appropriate data. Since I trained on only the categorical columns, I only feed in the categorical columns when we make predictions.  

In [13]:
est.predict(X[categorical_columns])[:10]

array([  -4496.41871641,  -29127.27267785,  -29127.27267785,
        -29127.27267785,  -29127.27267785,   -4496.41871641,
         -4496.41871641,   -4496.41871641,  -29127.27267785,
       1796722.49619058])

I see that our model isn't very good, giving us negative values for the predictions of some house prices. There can be many reasons for this, but it can be that we are only using these categorical features of the data, and I am trying to fit a linear model when the categorical features have inherently non-linear relationships to the prices.  

In [14]:
print(f'R^2 score using selected columns and transformers: {est.score(X[categorical_columns], y)}')

R^2 score using selected columns and transformers: 0.15178909180404088


I have some "signal" from the categorical variables alone.  Adding numeric features should give us a better score.  

There are many other encoding methods, but one-hot encoding is the most common and is usually a fine choice. If you're interested in alternatives, you may want to look at the [category encoder package](http://contrib.scikit-learn.org/category_encoders/).

## Combining features using `ColumnTransformer`

I have categorical features above, and I know how to one-hot encode them. How can we combine them together with numeric features such as `LANDAREA`, `ROOMS`, and `BEDRM`?  

The [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) lets us combine together outputs from transformers on different columns into a single feature matrix.  `ColumnTransformer` works similarly to `Pipeline`, but we give tuples of three elements, the first being a "name" for the step, the second a transformer, and the third part of the tuple is a list of columns to which the transformer should apply.  If the input matrix is a DataFrame, we can use the names of the columns.  If the input is a numpy array (or also a DataFrame), we can specify columns by numeric indices.  If we want to just "pass through" the values of the columns, i.e. get the values without transforming them, we can use `passthrough` in place of a transformer.  

Here's how we can use `ColumnTransformer` to combine the one-hot encoded categorical features above with the (unaltered) values of several numeric features we have selected.  Note that the output is always a numpy array or sparse array regardless of the format of the input feature matrix.  Also, the order of the columns in the output is governed by the order of the transformers and specified columns.  

In [15]:
from sklearn.compose import ColumnTransformer

numeric_columns = ['ROOMS', 'BATHRM', 'HF_BATHRM', 'BEDRM', 'EYB', 'FIREPLACES', 
                   'LANDAREA', 'SALE_NUM', 'LATITUDE', 'LONGITUDE']

features = ColumnTransformer([
    ('categorical', OneHotEncoder(), categorical_columns),
    ('numeric', 'passthrough', numeric_columns)
])

est = Pipeline([
    ('features', features),
    ('regressor', Ridge())
])

est.fit(X, y)

print(f'R^2 score using selected columns and transformers: {est.score(X, y)}')

R^2 score using selected columns and transformers: 0.14272254717554334


**Remark:**  Even though I have only used a subset of the columns in building the new feature matrix, because of the way the `ColumnTransformer` is implemented, I will receive an error if I feed in a feature matrix that has fewer columns than the training data, even if all the columns used in all transformers are present.  

In [16]:
used_columns = categorical_columns + numeric_columns
print(used_columns)

['HEAT', 'AC', 'SOURCE', 'QUALIFIED', 'ZIPCODE', 'ASSESSMENT_NBHD', 'ROOMS', 'BATHRM', 'HF_BATHRM', 'BEDRM', 'EYB', 'FIREPLACES', 'LANDAREA', 'SALE_NUM', 'LATITUDE', 'LONGITUDE']


In [17]:
%%expect_exception ValueError

est.predict(X[used_columns])

[1;31m---------------------------------------------------------------------------[0m
[1;31mValueError[0m                                Traceback (most recent call last)
[1;32m<ipython-input-17-c368f5d05b46>[0m in [0;36m<module>[1;34m[0m
[1;32m----> 1[1;33m [0mest[0m[1;33m.[0m[0mpredict[0m[1;33m([0m[0mX[0m[1;33m[[0m[0mused_columns[0m[1;33m][0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[0m
[1;32m~\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py[0m in [0;36m<lambda>[1;34m(*args, **kwargs)[0m
[0;32m    117[0m [1;33m[0m[0m
[0;32m    118[0m         [1;31m# lambda, but not partial, allows help() to work with update_wrapper[0m[1;33m[0m[1;33m[0m[1;33m[0m[0m
[1;32m--> 119[1;33m         [0mout[0m [1;33m=[0m [1;32mlambda[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m:[0m [0mself[0m[1;33m.[0m[0mfn[0m[1;33m([0m[0mobj[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m

## Building custom scikit-learn classes

I can build custom classes as extensions of existing scikit-learn classes. As I have already learned, there are two main types of classes in scikit-learn: predictors and transformers. 

Predictors, such as `LinearRegression` or `RandomForestClassifier`, tend to represent the final step in a machine learning model, such as performing regression or classification.

Transformers, such as `StandardScaler`, are steps for preprocessing and transforming data, and in a pipeline they precede predictors.

### Transformers

All transformers in scikit-learn support `fit` and `transform` methods and implement the following interface:
``` python
class Transformer(BaseEstimator, TransformerMixin):
  def __init__(self, ...):
    # initialization code
    
  def fit(self, X, y=None):
    # fit the transformer
    return self
  
  def transform(self, X):
    # apply the transformation
    return ...
```

Note that for transformers, `fit` is often empty and only `transform` actually does something. In general, the `fit` method contains the code for the transformer to learn parameters from the training data that can then be used during the data transformation process. Note, all transformers need to return `self` in the `fit` method to be compatible with the scikit-learn infrastructure.

Conforming to the convention outline here has the benefit that many tools (e.g. pipelines, cross-validation, grid search) rely on this interface so you can use your new transformers with the existing scikit-learn infrastructure. 

### Predictors

All predictors (e.g. `LinearRegression`, `DecisionTreeRegressor`, etc ...) support `fit` and `predict` methods.  You can build your own predictor, for example, for regression by using the following template. 
``` python                                                                                                                                        
class Predictor(BaseEstimator, RegressorMixin):
  def __init__(self, ...):
    # initialization code
  
  def fit(self, X, y):
    # fit the model ...
    return self
    
  def predict(self, X):
    # make predictions 
    return ...
```
``` python    
  def score(self, X, y):
    # custom score implementation
    # this is optional, if not defined default is R^2
    return ...
```

## Building a custom transformer using LATITUDE and LONGITUDE

We included `LATITUDE` and `LONGITUDE` in our numeric features above.  While they are certainly continuous features, so worthy to include there, we can also explore other uses of these two values.  For example, is the distance to the White House relevant to the sales price of a house?  Does it affect the price of the house if there is a nearby airport?  Georgetown is a rather affluent part of Washington, D.C., is home to Georgetown University, and the port of Georgetown predated the establishment of Washington, D.C. by 40 years or so.  Columbia Heights is known for its diversity, major retailers, and is home to Howard University.  And so forth.  

Let's build a custom transformer that will take in some latitude/longitude pairs and compute distances ("as the crow flies") between the `LATITUDE` and `LONGITUDE` data and the given pair(s), allowing us to engineer distances from any point that we want to do so.  

To define a custom transformer, we follow the outline above by defining the `__init__`, `fit` and `transform` methods.  
The `__init__` method just needs to store the data supplied in the constructor, while the `fit` method doesn't need to "learn" anything and so only needs to `return self`.  

The `transform` method will do the actual job of computing the distances.  We assume that the input to the constructor is a list of lists/tuples, and the feature matrix the transformer receives is in the form of a DataFrame.

In [19]:
from sklearn.base import BaseEstimator, TransformerMixin

class DistanceTransformer(BaseEstimator, TransformerMixin):
    """
    Will create distances from a feature matrix.  
        locations is a list of tuples/lists of latitude/longitude pairs
    Assumes the feature matrix X is a DataFrame with columns 'LATITUDE and 'LONGITUDE'
    Returns a numpy array of distances between each location and the lat/long pairs in X
    """
    def __init__(self, locations):
        self.locations = locations
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        lat_long = X[['LATITUDE', 'LONGITUDE']].values
        result = []
        for place in self.locations:
            result.append(np.sqrt(((lat_long-place)**2).sum(axis=1)).reshape(-1,1))
            
        return np.hstack(result)

Let's pick a few places to use in our `DistanceTransformer` and try it out.

In [20]:
COLUMBIA_HEIGHTS = (38.922741, -77.046356)
GEORGETOWN_UNIVERSITY = (38.912180, -77.076080)
DULLES_AIRPORT = (38.944444, -77.455833)
WHITE_HOUSE = (38.897957, -77.036560)

locations = [COLUMBIA_HEIGHTS, GEORGETOWN_UNIVERSITY, DULLES_AIRPORT, WHITE_HOUSE]

dist = DistanceTransformer(locations)

dist.fit_transform(X[['LATITUDE','LONGITUDE']][:5])

array([[0.00977192, 0.03533652, 0.41606692, 0.01726025],
       [0.00985719, 0.03549053, 0.41622045, 0.01722594],
       [0.00988622, 0.03553923, 0.41626928, 0.01721338],
       [0.01071615, 0.03642903, 0.41720663, 0.01667518],
       [0.00962061, 0.03609586, 0.4167171 , 0.01781187]])

And then we can add it to our feature matrix and try our regression one more time.  

In [21]:
features = ColumnTransformer([
    ('categorical', OneHotEncoder(), categorical_columns),
    ('numeric', 'passthrough', numeric_columns),
    ('distances', DistanceTransformer(locations), ['LATITUDE', 'LONGITUDE'])
])

est = Pipeline([
    ('features', features),
    ('regressor', Ridge())
])

est.fit(X,y)

print(f'R^2 score using selected columns and transformers: {est.score(X, y)}')

R^2 score using selected columns and transformers: 0.14577596541515392


The extra features are only giving us a slightly better performance in this case.  

## Using `FunctionTransformer` to build stateless transformers

If we want to use the `SALEDATE` in a linear model, we will also need to translate this into a numeric value as this field is a date/time.  To do this transformation, we can make use of scikit-learn's [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html).  As the documentation states, `FunctionTransformer`  is designed to "construct a transformer from an arbitrary callable".  This is useful for a "stateless" transformation such as taking the logarithm of a column, performing a custom scaling method, etc.  

Each element in the `SALEDATE` column is a pandas [`Timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html#pandas.Timestamp) object, which is the pandas equivalent of Python's [`datetime`](https://docs.python.org/3/library/datetime.html) object.

In [22]:
X.loc[0, 'SALEDATE']

Timestamp('2003-11-25 00:00:00')

We will want to convert these timestamps to the so-called "Unix epoch time", or the number of seconds since midnight Jan 1, 1970.  We could use any "zero reference" point in time, but this will serve our purposes for use in a linear regressor. There are several approaches to convert a pandas `Timestamp` to Unix epoch time. We'll use the method detailed in the official pandas [documenation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#from-timestamps-to-epoch).

In [23]:
(X['SALEDATE'] - pd.Timestamp('1970-01-01')) // pd.Timedelta('1s')

0        1069718400
1        1466467200
2        1152662400
3        1267142400
4        1317254400
            ...    
98211    1257984000
98212    1428019200
98213    1380844800
98214    1222732800
98215    1428969600
Name: SALEDATE, Length: 98216, dtype: int64

To use the `FunctionTransformer`, we need to define a function that will take a collection of `datetime` objects and returns an "array-like" structure of corresponding epoch times.  We will take advantage of the fact that we know the input will be a pandas Series of `datetime` objects, as we will use the result inside of a `ColumnTransformer` to add this to our feature matrix.

In [24]:
def to_epoch(series_of_times):
    """
    Assumes the input is a pandas Series of datetime objects.
    Returns a numpy array of Unix epoch times, measured as seconds since midnight Jan 1, 1970.
    """
    return ((series_of_times - pd.Timestamp('1970-01-01')) // pd.Timedelta('1s')).values.reshape(-1, 1)

A quick check that it works the way we need it to work...

In [25]:
to_epoch(X['SALEDATE'][0:5])  # seconds since midnight Jan 1, 1970

array([[1069718400],
       [1466467200],
       [1152662400],
       [1267142400],
       [1317254400]])

And making it into a transformer we can use in a pipeline, with a small test first.  

In [26]:
from sklearn.preprocessing import FunctionTransformer

date_transformer = FunctionTransformer(to_epoch)
date_transformer.transform(X['SALEDATE'][0:5])

array([[1069718400],
       [1466467200],
       [1152662400],
       [1267142400],
       [1317254400]])

Then we can add this newly transformed column to our feature matrix and re-fit the regression.  

In [27]:
features = ColumnTransformer([
    ('categorical', OneHotEncoder(), categorical_columns),
    ('numeric', 'passthrough', numeric_columns),
    ('distances', DistanceTransformer(locations), ['LATITUDE', 'LONGITUDE']),
    ('dates', date_transformer, 'SALEDATE')
])

est = Pipeline([
    ('features', features),
    ('regressor', Ridge())
])

est.fit(X,y)

print(f'R^2 score using selected columns and transformers: {est.score(X, y)}')

R^2 score using selected columns and transformers: 8.626013346180184e-05


Our score actually decreased here, a lot!  Why?  

The transformed date/time values are on a large scale, and the distances computed by the `DistanceTransformer` are on a much smaller scale.  All of the one-hot encoded values are either 0 or 1 (by definition).  So the features in the new feature matrix live on very different scales to one another which leads to some numerical instabilities in the analytic solution to the linear regression.  

So let's try scaling the values in our feature matrix.  As mentioned in the last notebook, we would often use `StandardScaler` here, but this won't work (with the default parameters) in this case as our transformed feature matrix is a sparse array.  This is because subtracting the mean of each column (a usual step in this transformation) will turn a sparse matrix into a dense matrix, usually giving a very large feature matrix that won't fit into the memory of our computer.  

Let's use the [`MaxAbsScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html) as a final transformer on all the data in the final feature matrix, right before the regressor.  This transformer will transform the data such that the maximal (absolute) value of each feature is 1 and *can* be applied to sparse matrices.  

In [28]:
from sklearn.preprocessing import MaxAbsScaler

est = Pipeline([
    ('features', features),
    ('scaling', MaxAbsScaler()),
    ('regressor', Ridge())
])

est.fit(X,y)

print(f'R^2 score using selected columns and transformers: {est.score(X, y)}')

R^2 score using selected columns and transformers: 0.15662689565093602


## Feature unions

As we have seen, we often need to preprocess different features with different transformers. We saw how the `ColumnTransformer` gives us one way to combined different transformers applied to different columns of an input feature matrix.  Sometimes it is also useful to have another way of combining feature matrices together into a single matrix, and this is what the [`FeatureUnion`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) class helps us to accomplish. 

![feature union](images/PML_feature_union.svg)

We could have built a number of independent transformer pipelines and pasted together the results using `FeatureUnion`.  That wasn't necessary in this case, but can sometimes be useful.  

## Imputation

When data is missing, it's often preferable to impute or artificially assign values to empty fields rather than disregarding incomplete observations entirely. This is especially important when we expect the model we are training to be applied in situations with incomplete information.

Scikit-learn offers the [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) transformer, which replaces missing values (instances of `np.nan`) with the average value of the appropriate feature (choose your preferred definition of 'average' using the `strategy` argument). More sophisticated imputation is better preformed in NumPy or Pandas (or writing your own custom transformer to do the job), but `SimpleImputer` has the advantage of convenience. Being a transformer means that it is easy to reuse, behaves consistently, and can be incorporated into pipelines.

For the D.C. property data, we have used all of the columns except for `KITCHENS` and `QUADRANT`.  Both of these columns have missing values, so we can't use them directly in a predictor since the scikit-learn predictors can't use missing values.  We can demonstrate how to use the `SimpleImputer` to fill in missing values for `KITCHENS`.  

First, let's look at the values that are present in the data.

In [29]:
data['KITCHENS'].value_counts(dropna=False)

1.0     87784
2.0      7656
4.0      1859
3.0       627
NaN       237
0.0        43
5.0         5
6.0         4
44.0        1
Name: KITCHENS, dtype: int64

We will add to the `ColumnTransformer` that is generating the features, by including an instance of `SimpleImputer` to fill in missing data for `KITCHENS`.  

While not 100% accurate, it is reasonable to assume that each property has at least one kitchen in it, as only a tiny fraction (43/98216, less than 0.05%) listed above do not have one.  `SimpleImputer` lets us specify a "strategy" to use for imputation.  In this case we will use the `most_frequent` strategy, meaning that missing values will be filled in with 1 (the most frequent occurence) in this case.  

In [30]:
from sklearn.impute import SimpleImputer

features = ColumnTransformer([
    ('categorical', OneHotEncoder(), categorical_columns),
    ('numeric', 'passthrough', numeric_columns),
    ('distances', DistanceTransformer(locations), ['LATITUDE', 'LONGITUDE']),
    ('dates', date_transformer, 'SALEDATE'),
    ('fill_kitchens', SimpleImputer(strategy='most_frequent'), ['KITCHENS'])
])

est = Pipeline([
    ('features', features),
    ('scaling', MaxAbsScaler()),
    ('regressor', Ridge())
])

est.fit(X,y)

print(f'R^2 score using selected columns and transformers: {est.score(X, y)}')

R^2 score using selected columns and transformers: 0.15696113614325713


**Note:**  Astute observers might notice a seeming discrepancy in the `features` transformer above.  For the `date_transformer` we specified a scaler value for the column (`SALEDATE`), whereas for the `SimpleImputer` we gave a list with one value in it, namely `['KITCHENS']`.  This is because *we wrote* the `to_epoch` (and hence the `date_transformer`) and did it in such a fashion that the input is expected as a 1-dimensional array (a pandas Series).  

On the other hand, `SimpleImputer` is expecting 2-dimensional input.  Specifying the input as a Python list of one item is analogous to passing the (2-dimensional) DataFrame `X[['KITCHENS']]` (consisting of only a single column) to the imputer.  

## A fancy display...

Scikit-learn has a useful way of displaying a `Pipeline` (or, say, a `ColumnTransformer`).  To see this display in a Jupyter notebook, we can use the following code.

In [31]:
from sklearn import set_config
set_config(display='diagram')

est

To go back to the "normal" display, we can use `set_config(display='text')`.  

Or we can get the HTML representation written to a file in this fashion (the HTML representation is what you are seeing above, rendered inside of the Jupyter notebook).

In [32]:
from sklearn.utils import estimator_html_repr

with open('estimator.html', 'w') as f:
    f.write(estimator_html_repr(est))

- real-world data rarely comes in a ready-to-use format so preprocessing and data wrangling is an important part of an ML model which sometimes requires building custom functionality
- scikit-learn predictor and transformer classes follow a particular template
- conforming to the scikit-learn convention when building custom estimators has the benefit of being able to use custom classes together with many other scikit-learn tools (e.g. pipelines, cross-validation, grid search)
- a scikit-learn pipeline allows for multiple transformers to be applied in succession
- a scikit-learn feature union applies a list of transformer objects in parallel to the input data, then concatenates the results