In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

# set defaults
plt.style.use('seaborn-white')   # seaborn custom plot style
plt.rc('figure', dpi=100, figsize=(7, 5))   # set default size/resolution
plt.rc('font', size=12)   # font size

### Part 1

# Sklearn transformers

## The modeling pipeline



<img src="imgs/image_0.png" width="100%">

### The steps of the modeling pipeline

1. Create features to best reflect the meaning behind data
2. Create model appropriate to capture relationships between features
    - e.g. linear, non-linear
3. Select a loss function and fit the model (determine $\hat{\theta}$).
4. Evaluate model (e.g. using RMSE)

After these steps, use the model for prediction and/or inference.

### Software development and the modeling pipeline 

* Each step may contain complicated transformations and logic
* The pipeline above represents a single attempt at a model
    - May have thousands of feature/model/paramater combinations to choose from!
    - Remember the Data Science Life Cycle!
* ML pipelines: [the high interest credit card of technical debt](https://ai.google/research/pubs/pub43146)

### Features and Models using `Scikit Learn`

<div class="image-txt-container">
    
* Scikit-Learn implements many common steps in the feature/model creation pipeline.
* Interfaces with `numpy` arrays and Pandas dataframes (somewhat)
    - Some work required keeping track of columns in scikit
    
    
<img src="imgs/sklearn.png" width="50%"/>

</div>

### Scikit-Learn feature transformers

<div class="image-txt-container">

<img src="imgs/feature_part.png" width="50%">

<img src="imgs/image_1.png" width="50%">

</div>

### Scikit-Learn (linear) models

<div class="image-txt-container">

<img src="imgs/model_part.png" width="50%">


<img src="imgs/image_2.png" width="50%">

</div>

## Note...

- `sklearn` documentation is (usually) very, *very* good
- there's a lot to learn, but lots of examples

## Scikit-Learn Transformer Classes

* Transformers process data and output features (transformed data).
    - Input data should be a (multi-column) Numpy Array (`sklearn` coerces a dataframe using `.values`).
    - Output data is also a Numpy Array.

|Property|Example|Description|
|---|---|---|
|Initialize with parameters| `binar = Binarizer(thresh)` | 'set x=0 if x < thresh, else 1'|
|Transform data in a dataset | `feat = binar.transform(data)` | Binarize all columns in `data`|

In [None]:
# initialize the transformer and use it in the dataset
from sklearn.preprocessing import Binarizer

tips = sns.load_dataset('tips')
tips.head()

In [None]:
bi = Binarizer(threshold = 20)                     # initialize with the parameter
binarized = bi.transform(tips[['total_bill']])     # called transform on a data 
binarized[:5]

In [None]:
# check if works
# 
(
    pd.concat([tips.total_bill, pd.DataFrame(binarized, columns=['binarized'])], axis=1)
    .sort_values('total_bill')
    .plot(x='total_bill', y='binarized')
);

## Some transformer classes require fitting

* Transformation logic often requires some knowledge of the dataset before transforming.
    - z-score: z = (x-μ)/σ, where x is the raw score, μ is the population mean, and σ is the population standard deviation.
* These transformers must be *fit* to the data before use.
* Typical usage: fit transformer on a sample; use that fit transformer to transform future data.


|Property|Example|Description|
|---|---|---|
|Initialize with parameters| `stdscaler = StandardScaler()` | z-scale the data (no parameters) |
|Fit the transformer| `stdscaler.fit(data)` | compute the mean and std-dev of `data`|
|Transform data in a dataset | `feat = stdscaler.transform(newdata)` | z-scale `newdata` with mean/stdev of `data`|

In [None]:
from sklearn.preprocessing import StandardScaler

tips = sns.load_dataset('tips')
quantcols = ['total_bill', 'size', 'tip']
tips[quantcols].head()

In [None]:
stdscaler = StandardScaler()

In [None]:
# This doesn't work!
stdscaler.transform(tips[quantcols])

In [None]:
#stdscaler.mean_

In [None]:
stdscaler.fit(tips[quantcols])

In [None]:
stdscaler.mean_

In [None]:
stdscaler.var_

In [None]:
# z-scaled data
stdscaler.transform(tips[quantcols])

## Example: One-Hot Encoding

- `sklearn` provides a preprocessor to compute One-Hot encoding for categoricals
- last time, we code to do this by hand : (

In [None]:
# just categoricals
from sklearn.preprocessing import OneHotEncoder

tips_cat = ['sex', 'smoker', 'day', 'time']
regdata = tips[tips_cat]
regdata.head()

In [None]:
ohe = OneHotEncoder()  # create
ohe.fit(regdata)       # fit to data
ohe.categories_        # you can look into created categories =

In [None]:
features = ohe.transform(regdata)  # why toarray()? to avoid sparse matrix!
features[:10]

In [None]:
features.todense()

In [None]:
ohe.get_feature_names()

In [None]:
ohe.inverse_transform(features[:10])

### Part 2

# Sklearn models

## Scikit-Learn Model Classes

`Sklearn` model classes (estimators) behave like transformers, but use outcomes (target variables, dependent variables that you train your model on) to fit and evaluate.

|Property|Example|Description|
|---|---|---|
|Initialize model parameters| `lr = LinearRegression()` | Create (empty) linear regression model|
|Fit the model to the data | `lr.fit(data, outcomes)` | Determines regression coefficients|
|Use model for prediction |`lr.predict(newdata)`| Use regression line make predictions|
|Evaluate the model| `lr.score(data, outcomes)` | Calculate the $R^2$ of the LR model|
|Access model attributes| `lr.coef_` | Access the regression coefficients|

*Note:* Once `fit`, estimators are just transformers (`predict` <-> `transform`)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()
lr.fit(tips[['total_bill', 'size']], tips.tip)

In [None]:
# we can predict with it
lr.predict(tips[['total_bill', 'size']])[:10]

In [None]:
# regression coefficients, why 2 slopes?
lr.coef_

In [None]:
lr.intercept_

In [None]:
# RMSE
np.sqrt(np.mean((preds - tips.tip.values)**2))

In [None]:
# R^2
lr.score(tips[['total_bill', 'size']], tips.tip)

### Building models with transformers and estimators


<div class="image-txt-container">
    
1. Define your transformations; models.
1. Transform input data to features.
1. Use (transformed) features to fit model.
1. Predict outcomes from features using fit model.

<img src="imgs/image_0.png" width="50%"/>

</div>

In [None]:
features.toarray()[:10]

In [None]:
# multicollinearity
# which variables should be dropped?
ohe.get_feature_names()

In [None]:
lr = LinearRegression()
lr.fit(features, tips.tip)

In [None]:
# higher dim. plane of best fit in 10 dim  (slopes)
lr.coef_

In [None]:
preds = lr.predict(features)

In [None]:
preds[:10]

In [None]:
# RMSE: terrible model! (why?)
np.sqrt(np.mean((preds - tips.tip.values)**2))

In [None]:
# add predictions to original data
tips.assign(preds=preds)

## Redundant Features

- For any categorical feature, can always leave out one of the categories (inferred from the rest).
- E.g., "Yes" and "No". Just have binary feature: "Yes".
- Done with `drop='first'` in `OneHotEncoder`.

In [None]:
ohe_better = OneHotEncoder(drop='first')
features = ohe_better.fit_transform(tips[tips_cat])

In [None]:
features.toarray()

In [None]:
ohe_better.get_feature_names()

In [None]:
lr = LinearRegression()
lr.fit(features, tips.tip)

In [None]:
preds = lr.predict(features)

In [None]:
# doesn't affect RMSE, but makes the individual weights on features more stable
np.sqrt(np.mean((preds - tips.tip.values)**2))

### Part 3

# Pipelines

## The "entire model"

- The "entire" model is the combination of all preprocessing + estimation done to the raw data.
- `sklearn` allows you to combine your preprocessing + estimation into a single "pipeline" object.

## Putting it together: Scikit-Learn Pipelines

<div class="image-txt-container">

* Put together transformers and models using `sklearn.Pipeline`.
* Create a pipeline: `pl = Pipeline([feat_trans, mdl])`
* Fit *all* the transformer(s)/model(s) in the pipeline using `pl.fit(data, target)`
* Predict from *raw* input data through the pipeline using `pl.predict`.
* Note: a fit pipeline is also a transformer!

<img src="imgs/image_0.png" width="50%">



In [None]:
# pass a list of feature trans. and models, in sequence.
# does all fitting and transforming

from sklearn.pipeline import Pipeline

In [None]:
# pipelines are lists of steps: each is a transformation/estimator
# each transformation is a tuple: the 'name' for the step name, and the transformer/estimator object.
pl = Pipeline([
    ('one-hot', OneHotEncoder()),
    ('lin-reg', LinearRegression())
])

In [None]:
regdata.head()

In [None]:
pl.fit(regdata, tips.tip)

In [None]:
# inspect the 'steps' of the pipeline using .named_steps
# gives a dictionary
# key: name you gave
# values: fit pipleline objects
pl.named_steps['one-hot'].transform(regdata).toarray()

In [None]:
pl.named_steps['one-hot'].categories_

In [None]:
pl.predict(regdata)[:10]

In [None]:
# R^2 -- still terrible! (1 is good, 0 is bad)
pl.score(regdata, tips.tip)

### (Realistic) Sklearn Pipelines
<div class="image-txt-container">
    
* `ColumnTransformer` was a recent addition (2018).
* Transforms using multiple transformers, each on different columns.
* `ColumnTransformer` performs the transformations and concatenates the output (axis=1).

<img src="imgs/image_3.png">

</div>

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
import sklearn.preprocessing as pp

In [None]:
# 1. split data up into quant. and cat. features
# 2. z-scale for quant features
# 3. One-hot encode for cat. features
# 4. Two pipelines
# 5. Use column transformer to put everything back
# 6. Apply the model => predictions

In [None]:
tips.drop(['tip', 'total_bill', 'size'], axis=1).head()

In [None]:
# Numeric columns and associated transformers
num_feat = ['total_bill', 'size']
num_transformer = Pipeline(steps=[
    ('scaler', pp.StandardScaler())   # z-scale
])

# Categorical columns and associated transformers
cat_feat = ['sex', 'smoker', 'day', 'time']
cat_transformer = Pipeline(steps=[
    ('onehot', pp.OneHotEncoder())     # output from Ordinal becomes input to OneHot
])

# preprocessing pipeline (put them together)
preproc = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_feat),
        ('cat', cat_transformer, cat_feat)
    ])

pl = Pipeline(steps=[('preprocessor', preproc), ('regressor', LinearRegression())])

In [None]:
pl.fit(tips.drop('tip', axis=1), tips.tip)

In [None]:
preds = pl.predict(tips.drop('tip', axis=1))
preds[:10]

In [None]:
np.sqrt(np.mean((preds - tips.tip)**2))

In [None]:
pl.score(tips.drop('tip', axis=1), tips.tip)

In [None]:
pl.named_steps['preprocessor'].transform(tips.drop('tip', axis=1)).shape

### Part 4

# Evaluating the fit model



<img src="imgs/image_4.png" width="100%">

## Evaluating the quality of a model

* Given a fit regressor on dataset, calculate e.g. the root-mean-square error.
* If the error is low, do you think it's a good model?
    - It fits the given *data* well, but is it a good model? (Is the sample representative?)
    - Will it give good predictions on similar, unknown, data?

## Fundamental Concepts of the quality of a 'fit model'

* **Bias**: the expected deviation between the predicted value and true value
* **Variance**: 
    - **Observation Variance**: the variability of the random noise in the process we are trying to model. 
    - **Estimated Model Variance**: the variability in the predicted value across different datasets. (Does the model generalize?)

### Model Quality: Bias and Variance

<div class="image-txt-container">
    
* The red bulls-eye: the true behavior of DGP
* Each dart: a specific function that models/predicts the DGP
* The model parameters $\theta$ select these functions.
* Credit: Scott Fortmann-Roe
    
<img src="imgs/image_5.png" width="100%">

</div>


## Evaluating the quality of a linear model

Given a dataset on which to fit the regression coefficients:
1. Calculate the RMSE to test for bias.
2. To test for variance, bootstrap estimate the regression coefficients:
    - sample the data.
    - For each sample, calculate the linear predictor.
    - For each input feature, calculate the CI for the distribution of predictions.
    - Large "prediction intervals" imply the model is susceptible to noise (e.g. outliers)
    
Still, this relies on a "representative sample" for generalization to new data!

In [None]:
sns.lmplot(data=tips, x='total_bill', y='tip');

## Evaluating the quality of a (general) model

* Given a fit (non-linear) model, there are three possibilities for quality:
    - The model doesn't fit the given data well (high bias; underfit)
    - Does it reflect the process of interest? (good fit; robust)
    - Does it just fit the data (noise and all)? (high variance; overfit)

* How can we ascertain the quality on similar, out-of-sample data?

## Evaluating the quality of a (general) model

* Given a quadratic process, a linear model has high bias.
* "Connecting-the-dots" will fail to generalize (high variance).
* Balance model complexity with complexity of DGP.

![overfit](imgs/under-over-fit.png)

### Example: predicting survival on the Titanic with Decision Trees

<div class="image-txt-container">

* Did a given passenger survive the Titanic distaster?
* The (simple) tree below has mediocre accuracy

<img src="imgs/image_6.png" width="50%">

</div>

### Reducing Bias with more complicated models

* Improve performance by "growing" the decision tree model.
* Decrease the number of passengers required in leaf nodes.
* Effect: "Learn" individual passengers?
* How do the know your model generalizes?

<img src="imgs/Titanic_Decision_Tree.png" width="100%">


## Train-Test Split

To assess your model for overfitting to the data, randomly split the data into a "training set" and a "test set".

<div class="image-txt-container">

* The training set is used to fit the model (train the predictor).
* The test set is used to test the goodness-of-fit of the fit model.
* *similar* to bootstrap estimating a regression model.

<img src="imgs/train-test.png">

</div>



## The machine learning training pipeline:

<img src="imgs/train-test.png" width="50%">

Scikit-Learn has functions that help us do this.

### Using Scikit-Learn for train-test split

* Splitting a dataset using `sklearn.model_selection.train_test_split` 
* Given features `X` and a target array `y`,
```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
```
randomly splits the features and target into training and test sets.

In [None]:
from sklearn.model_selection import train_test_split

X = tips.drop('tip', axis=1)
y = tips.tip
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [None]:
print(
    len(X_train)/len(X),
    len(X_test)/len(X)
)

### Example Prediction Pipeline

* Train a simple linear regression model on the tips data
* Split the data into a training and test set:
    - fit the model on the training set
    - compute the error on the test set

In [None]:
X = tips.drop(['tip', 'sex', 'smoker', 'day', 'time' ], axis=1)
y = tips.tip

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)


pl = Pipeline([
   ('lin-reg', LinearRegression())
])

pl.fit(X_train, y_train)

# performance on training data
pred_train = pl.predict(X_train)
rmse_train = np.sqrt(np.mean((pred_train - y_train)**2))

# performance on test data -- what we really care about
pred_test = pl.predict(X_test)
rmse_test = np.sqrt(np.mean((pred_test - y_test)**2))

print ("train RMSE: %s" % rmse_train)
print ("test RMSE: %s" % rmse_test)

### Conclusion: evaluating model fit

* Complex models are required to model complex phenomena.
* How can you tell a complex model isn't over-fitting to the data?
    - Answer: split into a training set and a test set.
- If test performance is << train performance, you've overfit