# Intro to Data Science @ SzISz Part VII.
## Grand Finale:<br> Model Selection, Pipeline building, and Custom Sklearn Nodes 

### Table of contents
- <a href="#What-is-Model-Selection?">Model Selection Theory</a>
- <a href="#Cross-Validation">Cross Validation</a>
    - <a href="#Grid-Search-Cross-Validation">Randomized Search Cross Validation</a>
    - <a href="#Randomized-Search-Cross-Validation">Randomized Search Cross Validation</a>
- <a href="#Building-Pipelines">Building Pipelines</a>
- <a href="#Building-a-Custom-Sklearn-Node">Building a Custom Scikit-learn Node</a>
    

## What is Model Selection?
_"Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered."_ from: <a href="https://en.wikipedia.org/wiki/Model_selection">Wiki</a>  
In this context we also include the process of finding the optimal hyperparameters.

## Why is it important?
To find the optimal solution to a given problem, one must train several models with similar predictive/exploratory power and select the simplest one. This process includes selecting models and finding optimal hyperparameters which is a time consuming and tedious work when done by hand. We use automatized solutions to overcome this problem, save time, and yield better results.

## Tools
- Grid Search
- Randomized search
- etc.

In [None]:
%matplotlib inline
import numpy as np
import scipy.sparse as sp
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_digits

np.random.seed = 42

In [None]:
digits = load_digits()
X, y = digits.data, digits.target

## <a href="http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation">Cross Validation</a>

In order to select an optimal model, first one must be able to measure a model's/pipe's accuracy.  

First, one must select a valid metric for the model. In sklearn, the basic validation metric is accuracy score in case of classification, and $r^{2}$ for regression. Altough, several other metrics can be selected from <a href="http://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules">this</a> list.

_"Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting."_ [1]. To overcome this problem, one must split the data to __training__ and __test__ dataset; train the model on the train dataset, then measure the precision on the test dataset.

However different splits can produce different outcomes, so this process must be repeated several times to give a good approximation to the examined model's accuracy. This process is called __Cross Validation__ and there are <a href="http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators">different strategies</a> to make these splits.

A simple model can yield different solutions to the same data based on its hyperparameters so multiple models must be trained to select the ideal hyperparamter settings. Cross Validation gives a good approximation to a trained model's accuracy, but additional methods are required to select the ideal hyperparameters. 

[1] <a href="http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance">Scikit-learn 0.17.1 documentation</a>


### Grid Search Cross Validation
Grid search is a method which generates a parameter grid from a list of settings, and measure the input model's accuracy in every setting using cross validation.

In [None]:
from sklearn.grid_search import GridSearchCV

---
## Intermission:
## Building Pipelines
A Quick reminder how to build sklearn pipes and how to access their parameters.

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion

In [None]:
# http://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
# http://scikit-learn.org/stable/auto_examples/exercises/plot_cv_digits.html#example-exercises-plot-cv-digits-py
steps = [
    ('dummy', pass)
]
pipe = Pipeline(steps=steps)

In [None]:
pipe.fit_predict()

In [None]:
pipe.steps

In [None]:
pipe.get_params()

In [None]:
pipe.set_params()

__end of intermission__

---

In [None]:
# http://scikit-learn.org/stable/modules/grid_search.html
# http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#grid-search-and-cross-validated-estimators
param_grid = [
    {},
]

In [None]:
grid = GridSearchCV(estimator=pipe, param_grid=param_grid,)

### Randomized Search Cross Validation
Randomized search randomly generate a fixed number of hyperparameter setups. It selects the parameters from the provided parameter parameter ranges and then measures them with cross validation.

In [None]:
from sklearn.grid_search import RandomizedSearchCV

In [None]:
# http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html
param_dist = [
    ()
]

In [None]:
search = RandomizedSearchCV(estimator=pipe,param_distributions=param_dist, random_state=42)

## Building a Custom Sklearn Node

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin, ClusterMixin

In [None]:
class MyEstimator(BaseEstimator, TransformerMixin):
    pass

In [None]:
class MyClassifier(BaseEstimator, ClassifierMixin):
    pass

In [None]:
class MyClustering(BaseEstimator, ClusterMixin):
    pass