<a href="https://colab.research.google.com/github/arutraj/.githubcl/blob/main/04_cross_validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd

rs = np.random.RandomState(seed=666)

# Let's create some fake data, just for practice
n = 2170
p = 3

X = rs.random(size=(n, p))

betas = np.arange(1, p+2)

y = betas[0] + (X @ betas[1:]) + rs.normal(size=n)

df = pd.DataFrame({f"X{i+1}": X[:, i] for i in range(p)}).assign(y=y)

model = " + ".join([f"{betas[i]} X{i}" for i in range(1, p+1)])
print(f"The model is y ~ {betas[0]} + {model}\n")

df

The model is y ~ 1 + 2 X1 + 3 X2 + 4 X3



Unnamed: 0,X1,X2,X3,y
0,0.700437,0.844187,0.676514,7.081488
1,0.727858,0.951458,0.012703,4.831854
2,0.413588,0.048813,0.099929,2.249246
3,0.508066,0.200248,0.744154,6.280237
4,0.192892,0.700845,0.293228,3.886229
...,...,...,...,...
2165,0.822996,0.476527,0.565155,8.199809
2166,0.329048,0.287782,0.712354,5.987331
2167,0.671275,0.361078,0.949851,7.408811
2168,0.452967,0.041320,0.336625,3.361783


# Cross validation

As a broader ML libarary, `sklearn` provides many useful abstractions for common industry practices, including cross-validation.

The idea is to build a `Pipeline` with all the different processes, which is later triggered with `fit()` (see [official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html))


Let's test finding a optimal value of the penalty weight for regularized linear regression.

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [4]:
# Note that, for regularized linear regression, we generally want to standardize
# the covariates. However, the mean/standard deviation used must only be
# estimated from the training set!
# sklearn ColumnTransformer and StandardScaler provide this functionality

ct = ColumnTransformer([("std", StandardScaler(), ["X1", "X2", "X3"])])
ct

In [5]:
# For example, we can "fit" the column transformer with one subset of the data,
# and use that to "transform" a different subset
ct.fit(df.iloc[:500])
ct.transform(df.iloc[500:])

array([[-0.63712859,  0.89193542,  1.72863033],
       [-0.78130111, -0.0742917 , -0.89029186],
       [-0.10390469,  1.14800935, -0.66367349],
       ...,
       [ 0.64824778, -0.52380543,  1.59665505],
       [-0.11024564, -1.61494902, -0.51662817],
       [ 1.39601679, -0.04862662, -0.88198919]])

In [6]:
# Next, we can build our pipeline for fitting the linear regression model with
# a LinearRegression estimator
pipe = Pipeline(
  [
    ("transform", ct),
    ("reg", ElasticNet())
  ]
)
pipe

In [8]:
ElasticNet

In [9]:
# We can fit _one_ ElasticNet model
pipe.fit(X=df[["X1", "X2", "X3"]], y=df.y)

## Cross validation

The `GridSearchCV` pipeline helps us run an entire `Pipeline` while varying the parameters. Note how we "named" the components of our pipeline:

```python
pipe = Pipeline(
  [
    ("transform", ct),
    ("reg", ElasticNet())
  ]
)
```

We specify the parameters of each "step" in the pipeline by specifying parameter values for `"{name}__{param}"`. For example, if we want to test our `Pipeline` with values of "`l1_ratio`" parameter of `ElasticNet` (which we've named `"reg"`) set to 0.2, 0.5, and 0.8, we would set

```python
param_grid={
  "reg__l1_ratio": [0.2, 0.5, 0.8],
}
```

Let's try a grid search for 2 values of `l1_ratio` and 3 values of `alpha`.

`np.linspace` and `np.logspace` are popular ways to create a list of "`n`" parameters within some range.

In [10]:
# We can further wrap our pipeline with a GridSearchCV, which specifies how to
# search the parameters of LinearRegression
cv_pipe = GridSearchCV(
    estimator=pipe,
    param_grid={
        "reg__l1_ratio": np.linspace(0, 1, num=2),
        "reg__alpha": np.logspace(-1, -2, num=3),
    },
)
cv_pipe

We can then `fit` the overall `GridSearchCV` just as we would a `Pipeline`.

In [11]:
cv_pipe.fit(X=df[["X1", "X2", "X3"]], y=df.y)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


## CV results

Note that, be default, `GridSearchCV` splits the data to five folds. The fitted `cv_pipe` object contains all the results in a `cv_results_` property.

In [12]:
cv_pipe.cv_results_

{'mean_fit_time': array([0.0101716 , 0.00442595, 0.00992966, 0.00435696, 0.01020017,
        0.00455256]),
 'std_fit_time': array([0.0009363 , 0.00016959, 0.00036815, 0.00018826, 0.00106634,
        0.0004495 ]),
 'mean_score_time': array([0.00251837, 0.00229487, 0.0025218 , 0.00226088, 0.00237765,
        0.00227427]),
 'std_score_time': array([8.93660006e-05, 4.89543888e-05, 8.36121958e-05, 2.70482279e-05,
        9.33822726e-05, 7.34909559e-05]),
 'param_reg__alpha': masked_array(data=[0.1, 0.1, 0.03162277660168379, 0.03162277660168379,
                    0.01, 0.01],
              mask=[False, False, False, False, False, False],
        fill_value=1e+20),
 'param_reg__l1_ratio': masked_array(data=[0.0, 1.0, 0.0, 1.0, 0.0, 1.0],
              mask=[False, False, False, False, False, False],
        fill_value=1e+20),
 'params': [{'reg__alpha': 0.1, 'reg__l1_ratio': 0.0},
  {'reg__alpha': 0.1, 'reg__l1_ratio': 1.0},
  {'reg__alpha': 0.03162277660168379, 'reg__l1_ratio': 0.0},
  {'re

If we just want to use the model that achieved the best out-of-sample score, we can use the `best_estimator_` property.

In [13]:
cv_pipe.best_estimator_.predict(df)

array([7.64693731, 5.32663822, 2.41104296, ..., 7.27437063, 3.42433179,
       5.18781937])

In [15]:
fitted_pipeline = cv_pipe.best_estimator_

In [16]:
fitted_pipeline

In [17]:
dir(fitted_pipeline)

['__abstractmethods__',
 '__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__sklearn_clone__',
 '__sklearn_is_fitted__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_build_request_for_signature',
 '_can_fit_transform',
 '_can_inverse_transform',
 '_can_transform',
 '_check_feature_names',
 '_check_method_params',
 '_check_n_features',
 '_doc_link_module',
 '_doc_link_template',
 '_doc_link_url_param_generator',
 '_estimator_type',
 '_final_estimator',
 '_fit',
 '_get_default_requests',
 '_get_doc_link',
 '_get_metadata_request',
 '_get_param_names',
 '_get_params',
 '_get_tags',
 '_iter',
 '_log_message',
 '_more_tags

In [19]:
fitted_pipeline.get_params()

{'memory': None,
 'steps': [('transform',
   ColumnTransformer(transformers=[('std', StandardScaler(), ['X1', 'X2', 'X3'])])),
  ('reg', ElasticNet(alpha=0.01, l1_ratio=0.0))],
 'verbose': False,
 'transform': ColumnTransformer(transformers=[('std', StandardScaler(), ['X1', 'X2', 'X3'])]),
 'reg': ElasticNet(alpha=0.01, l1_ratio=0.0),
 'transform__force_int_remainder_cols': True,
 'transform__n_jobs': None,
 'transform__remainder': 'drop',
 'transform__sparse_threshold': 0.3,
 'transform__transformer_weights': None,
 'transform__transformers': [('std', StandardScaler(), ['X1', 'X2', 'X3'])],
 'transform__verbose': False,
 'transform__verbose_feature_names_out': True,
 'transform__std': StandardScaler(),
 'transform__std__copy': True,
 'transform__std__with_mean': True,
 'transform__std__with_std': True,
 'reg__alpha': 0.01,
 'reg__copy_X': True,
 'reg__fit_intercept': True,
 'reg__l1_ratio': 0.0,
 'reg__max_iter': 1000,
 'reg__positive': False,
 'reg__precompute': False,
 'reg__random_