## Model Selection
Decide which model to select :
* Overfitting vs Underfitting and Bias vs Variance
* Cross validation techniques
* Hyper parameter tuning


### 1. Overfitting vs Underfitting & Bias vs Variance

Overfitting : Producing a model that performs well on the data you train it on but generalizes poorly to any new data.

Underfitting : Producing a model that doesn’t perform well even on the training data

![Overfitting vs Underfitting](images/dsf2_1101.png)

### Bias-Variance Tradeoff

Both are measures of what would happen if you were to retrain your model many times on different sets of training data (from the same larger population).

For example, the degree 0 model in “Overfitting and Underfitting” will make a lot of mistakes for pretty much any training set (drawn from the same population), which means that it has a high bias. However, any two randomly chosen training sets should give pretty similar models (since any two randomly chosen training sets should have pretty similar average values). So we say that it has a low variance. High bias and low variance typically correspond to underfitting.

On the other hand, the degree 9 model fit the training set perfectly. It has very low bias but very high variance (since any two training sets would likely give rise to very different models). This corresponds to overfitting.

Thinking about model problems this way can help you figure out what to do when your model doesn’t work so well.

If your model has high bias (which means it performs poorly even on your training data), one thing to try is adding more features. Going from the degree 0 model in “Overfitting and Underfitting” to the degree 1 model was a big improvement.

If your model has high variance, you can similarly remove features. But another solution is to obtain more data (if you can).

In Figure 11-2, we fit a degree 9 polynomial to different size samples. The model fit based on 10 data points is all over the place, as we saw before. If we instead train on 100 data points, there’s much less overfitting. And the model trained from 1,000 data points looks very similar to the degree 1 model. Holding model complexity constant, the more data you have, the harder it is to overfit. On the other hand, more data won’t help with bias. If your model doesn’t use enough features to capture regularities in the data, throwing more data at it won’t help.



### 2. Cross Validation

Method of evaluating the generalised performance that is more stable and thorough than using a split into training and test set.

![Cross Validation](images/grid_search_cross_validation.png)

This method is also known as K-Fold cross validation.

In [115]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape

((150, 4), (150,))

In [116]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [73]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,random_state=0)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train,y_train)
clf.score(X_test, y_test)

(90, 4) (90,)
(60, 4) (60,)


0.9166666666666666

In [107]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf,X_train,y_train,cv=5)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(),scores.std()*2))

[0.66666667 0.61111111 0.88888889 0.72222222 0.88888889]
Accuracy: 0.76 (+/- 0.23)


In [111]:
from sklearn.ensemble import RandomForestClassifier
rf_clf= RandomForestClassifier()
scores = cross_val_score(rf_clf, X_train, y_train,cv=5)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[1.         1.         1.         0.88888889 0.94444444]
Accuracy: 0.97 (+/- 0.09)


In [113]:
from sklearn.model_selection import cross_validate
import pandas as pd
res = cross_validate(rf_clf,X_train,y_train,cv=5,return_train_score=True)
print(pd.DataFrame(res))

   fit_time  score_time  test_score  train_score
0  0.233753    0.019873    1.000000          1.0
1  0.225783    0.018792    1.000000          1.0
2  0.201671    0.010619    1.000000          1.0
3  0.167428    0.016453    0.888889          1.0
4  0.176990    0.018620    0.944444          1.0


Cross validation does not build a model, but merely gives you an idea of how the model might perform for diffrent train and test combination.

#### Straified K-fold

In [77]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=0)

In [78]:
res = cross_validate(clf,X,y,cv=kfold,return_train_score=True)
print(pd.DataFrame(res))

   fit_time  score_time  test_score  train_score
0  0.037583    0.000446    1.000000     0.966667
1  0.022896    0.000294    0.833333     0.975000
2  0.023446    0.000260    1.000000     0.966667
3  0.022938    0.000250    1.000000     0.975000
4  0.018184    0.000233    0.933333     0.983333


#### Group K-Fold

GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. For example if the data is obtained from different subjects with several samples per-subject and if the model is flexible enough to learn from highly person specific features it could fail to generalize to new subjects. GroupKFold makes it possible to detect this kind of overfitting situations.

In [118]:
from sklearn.model_selection import GroupKFold

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]


#### TimeSeriesSplit

TimeSeriesSplit is a variation of k-fold which returns first  folds as train set and the  th fold as test set. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.

This class can be used to cross-validate time series data samples that are observed at fixed time intervals.


In [80]:
from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=3)
print(tscv)

for train, test in tscv.split(X):
    print("%s %s" % (train, test))

TimeSeriesSplit(max_train_size=None, n_splits=3)
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]


### 3. Hyper-parameter Tuning

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. 

eg : n_estimators for RandomForest or C for Linear regression etc

* GridSearchCV : exhaustively considers all parameter combinations.
* RandomizedSearchCV : can sample a given number of candidates from a parameter space with a specified distribution.

#### GridSearchCV

In [120]:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV

iris = datasets.load_iris()
parameters = {'C' : [1,10,20,100],'penalty':['l2', 'l1']}

lr = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,random_state=0)

clf = GridSearchCV(lr, parameters, cv=kfold )
clf.fit(X_train,y_train)

print("Test score : {}".format(clf.score(X_test,y_test)))
print("Best parameters: {}".format(clf.best_params_))
print("Best cross-validation score: {:.2f}".format(clf.best_score_))
print("Best estimator:\n{}".format(clf.best_estimator_))

Test score : 0.9666666666666667
Best parameters: {'C': 10, 'penalty': 'l2'}
Best cross-validation score: 0.98
Best estimator:
LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='saga', tol=0.01, verbose=0,
                   warm_start=False)


#### RandomizedSearchCV

In [121]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
iris = load_iris()

logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
                              random_state=0)

distributions = dict(C=range(1,100,10),
                     penalty=['l2', 'l1'])

clf = RandomizedSearchCV(logistic, distributions, random_state=0)

clf.fit(X_train, y_train)

print("Test score : {}".format(clf.score(X_test,y_test)))
print("Best parameters: {}".format(clf.best_params_))
print("Best cross-validation score: {:.2f}".format(clf.best_score_))
print("Best estimator:\n{}".format(clf.best_estimator_))

Test score : 0.9666666666666667
Best parameters: {'penalty': 'l2', 'C': 91}
Best cross-validation score: 0.98
Best estimator:
LogisticRegression(C=91, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='saga', tol=0.01, verbose=0,
                   warm_start=False)


In [106]:
import numpy as np

from time import time
import scipy.stats as stats
from sklearn.utils.fixes import loguniform

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier

# get some data
X, y = load_digits(return_X_y=True)

# build a classifier
clf = SGDClassifier(loss='hinge', penalty='elasticnet',
                    fit_intercept=True)


# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})"
                  .format(results['mean_test_score'][candidate],
                          results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")


# specify parameters and distributions to sample from
param_dist = {'average': [True, False],
              'l1_ratio': stats.uniform(0, 1),
              'alpha': loguniform(1e-4, 1e0)}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

# use a full grid over all parameters
param_grid = {'average': [True, False],
              'l1_ratio': np.linspace(0, 1, num=10),
              'alpha': np.power(10, np.arange(-4, 1, dtype=float))}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_['params'])))
report(grid_search.cv_results_)

RandomizedSearchCV took 14.69 seconds for 20 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.927 (std: 0.035)
Parameters: {'alpha': 0.00022537743949353187, 'average': True, 'l1_ratio': 0.794240001901195}

Model with rank: 2
Mean validation score: 0.923 (std: 0.029)
Parameters: {'alpha': 0.00024682587297395186, 'average': True, 'l1_ratio': 0.9939262905181496}

Model with rank: 3
Mean validation score: 0.921 (std: 0.024)
Parameters: {'alpha': 0.07520565521065481, 'average': False, 'l1_ratio': 0.18504965056340195}

GridSearchCV took 97.29 seconds for 100 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.931 (std: 0.027)
Parameters: {'alpha': 1.0, 'average': False, 'l1_ratio': 0.0}

Model with rank: 2
Mean validation score: 0.928 (std: 0.028)
Parameters: {'alpha': 0.0001, 'average': True, 'l1_ratio': 0.0}

Model with rank: 3
Mean validation score: 0.928 (std: 0.031)
Parameters: {'alpha': 0.001, 'average': True, 'l1_ratio': 0.11111111111111

#### Putting it all together:
![All together](images/all_together.png)

End-to-usecase, plot to use..
Kaggle competiion