# Pipelines and Grid Search
*Author: Douglas Strodtman (SaMo)*

Here we'll use the Boston data with `VarianceThreshold`, `SelectKBest`, `StandardScaler`, and `Lasso`.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, VarianceThreshold, f_regression
from sklearn.model_selection import GridSearchCV
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



In [None]:
boston = pd.read_csv('data/boston_data.csv')

## Pipeline Syntax

We set up a pipeline by passing a list of tuples in the format
```
('string_name', ClassObject())
```
Note that we can name our steps beforehand (each of the methods that we're using are a class in sklearn).
```
lasso = Lasso()
('lasso', lasso)
```

We can include as many steps as we'd like. Look at the following example:

In [None]:
# create a pipeline with the following steps using the provided names (and arguments where provided)
#     var_thresh: VarianceThreshold(.05)
#     ss: StandardScaler()
#     kbest: SelectKBest(f_regression, k=5)
#     lasso: Lasso()

To use this, we just `fit` on our training data.

In [None]:
# fit with our training data

Then we can `score` on our train

In [None]:
# score training data

and our test

In [None]:
# score test data

and in this case conclude that our model is overfit.

## GridSearch Syntax

`GridSearch` accepts a `Pipeline` object as an estimator and a param grid.

The param grid uses the `string_name`s from your pipeline followed by a dunder `__` and the argument name for that particular step. You then provide an iterable to search over (generally a list or a range-style object).

In [None]:
# set up a param grid with the following:
#     var_thresh: threshold: [0, .05, .1, .25]
#     kbest: k: [3, 5, 7, 9]
#     lasso: alpha: np.logspace(-3, 3, 7)

You can also specify the number of folds using `cv`. Default is 3.

In [None]:
# instantiate our gridsearch with our pipe and params

We use this the same as other models, `fit`ting and `score`ing like normal (but now using the hyperparameters that gave us the best results).

In [None]:
# fit on the training data

In [None]:
# score the training data

In [None]:
# score the test data

So what are our best parameters?

In [None]:
# look at `.best_params_`

Note that we'll use our `best_estimator_` to access the `Pipeline` that was fit with our `best_params_`.

Within the `best_estimator_`there is a dictionary called `named_steps`. We can use our `string_names` to access the steps in our `Pipeline`. This is where we'll go to access info about the transformations and parameters done at each step.

`VarianceThreshold` with our best threshold of 0.05 removes one of our columns.

In [None]:
# get_support for our variance threshold step

`SelectKBest` with `k` of 9 removed several of our columns. We can use this boolean list to align our coefficients with our original features (as our modeling step will only see these 9 features).

In [None]:
# get_support for our k best step

Here our `Lasso` uses an alpha of .001. We can look at the `coef_` to see how our features are weighted.

In [None]:
# look at the lasso coefficients

In [None]:
# look at the lasso y intercept

The following code demonstrates using these methods to align our original columns with our final betas.

In [None]:
columns = boston.columns
columns = columns[gs.best_estimator_.named_steps['var_thresh'].get_support()]
columns = columns[gs.best_estimator_.named_steps['kbest'].get_support()]

pd.DataFrame(gs.best_estimator_.named_steps['lasso'].coef_, 
             index = columns, 
             columns=['weight'])

This is a simple scatter plot to compare our true values to our predictions to visualize our errors.

In [None]:
plt.scatter(y_test, gs.predict(X_test))
plt.ylabel('predicted')
plt.xlabel('true')
plt.plot([0, 50], [0, 50], color='r')
plt.legend(['Actual'])