# ITMAL Exercise

REVISIONS| |
---------| |
2018-0301| CEF, initial.
2018-0305| CEF, updated.
2018-0306| CEF, updated and spell checked.
2018-0306| CEF, major overhaul of functions.
2018-0306| CEF, fixed problem with MNIST load and Keras.
2018-0307| CEF, modified report functions and changed Qc+d.
2018-0311| CEF, updated Qd.
2018-0312| CEF, added grid and random search figs and added bullets to Qd.
2018-0313| CEF, fixed SVC and gamma issue, and changed dataload to be in fetchmode (non-keras).
2019-1015| CEF, updated for ITMAL E19
2019-1019| CEF, minor text update.
2019-1023| CEF, changed demo model i Qd) from MLPClassifier to SVC.

## Hyperparameters and Gridsearch 

When instantiating a Scikit-learn model in python most or all constructor parameters have _default_ values. These values are not part of the internal model and are hence called ___hyperparametes___---in contrast to _normal_ model parameters, say the weights, $\mathbf w$, for an `SGD` model.

An example could be the python constructor for the support-vector classifier `sklearn.svm.SVC`, with, say the `kernel` hyperparameter having the default value `'rbf'`. If you should choose, what would you set it to other than `'rbf'`? 

```python
class sklearn.svm.SVC(
    C=1.0, 
    kernel=’rbf’, 
    degree=3,
    gamma=’auto_deprecated’, 
    coef0=0.0, 
    shrinking=True, 
    probability=False, 
    tol=0.001, 
    cache_size=200, 
    class_weight=None, 
    verbose=False, 
    max_iter=-1, 
    decision_function_shape=’ovr’, 
    random_state=None
  )
```  

The default values might be sensible a general starting point, but for your data, you might want to optimize the hyperparameters to yield a better result. 

To be able to set `kernel` to a sensible value you need to go into the documentation for the `SVC` and understand what the kernel parameter represents and what values it can be set to, and you need to understand the consequences of setting `kernel` to something different than the default...and the story repeats for every other hyperparameter!

An alternative is just to __brute-force__ a search of interesting hyperparameters, an choosing the 'best' parameters according to a fit-predict and some performance metric, say 'f1'. 

Now, you just pick out some hyperparameters, that you figure are important, set them to a suitable range, say

```python
    'kernel':('linear', 'rbf'), 
    'C':[1, 10]
```
and fire up a full (grid) search on this hyperparameter set, that will try out all combination of `kernel` and `C` for the model, and then prints the hyperparameter set with the highest score...

<img src="https://itundervisning.ase.au.dk/E19_itmal/L08/Figs/gridsearch.png" style="width:350px">
<small><em>
    <center> Conceptual graphical view of grid search for two distinct hyperparameters. </center> 
    <center> Notice that you would normally search hyperparameters like `alpha` with an exponential range, say [0.01, 0.1, 1, 10] or similar.</center>
</em></small>

The demo code below sets up some of our well known 'hello-world' data and run a _grid search_ on a particular model, here a _support-vector classifier_ (SVC)

Other models and datasets  ('mnist', 'iris', 'moon') can also be examined.

### Qa Explain GridSearchCV

There are two code cells below: 1:) function setup, 2) the actual grid-search.

Review the code cells and write a __short__ summary. Mainly focus on __cell 2__, but dig into cell 1 if you find it interesting (notice the use of local-function, a nifty feature in python).
  
In detail, examine the lines:  
  
```python
grid_tuned = GridSearchCV(model, tuning_parameters, ..
grid_tuned.fit(X_train, y_train)
..
FullReport(grid_tuned , X_test, y_test, time_gridsearch)
```
and write a short description of how the `GridSeachCV` works: explain how the search parameter set is created and the overall search mechanism is functioning (without going into to much detail).

What role does the parameter `scoring='f1_micro'` play in the `GridSearchCV`, and what does `n_jobs=-1` mean? 

NOTICE: you need the dataloader module from `libitmal`, clone 
```
> git clone https://cfrigaard@bitbucket.org/cfrigaard/itmal

```
or pull the GIT repository to get the latest version, and put `libitmal` into the python path.

In [2]:
# TODO: Qa, code review..cell 1) function setup

from time import time
import numpy as np

from sklearn import svm
from sklearn.linear_model import SGDClassifier

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import classification_report, f1_score
from sklearn import datasets

from libitmal import dataloaders as itmaldataloaders

currmode="N/A" # GLOBAL var!

def SearchReport(model): 
    
    def GetBestModelCTOR(model, best_params):
        def GetParams(best_params):
            r=""          
            for key in sorted(best_params):
                value = best_params[key]
                t = "'" if str(type(value))=="<class 'str'>" else ""
                if len(r)>0:
                    r += ','
                r += f'{key}={t}{value}{t}'  
            return r            
        try:
            p = GetParams(best_params)
            return type(model).__name__ + '(' + p + ')' 
        except:
            return "N/A(1)"
        
    print("\nBest model set found on train set:")
    print()
    print(f"\tbest parameters={model.best_params_}")
    print(f"\tbest '{model.scoring}' score={model.best_score_}")
    print(f"\tbest index={model.best_index_}")
    print()
    print(f"Best estimator CTOR:")
    print(f"\t{model.best_estimator_}")
    print()
    try:
        print(f"Grid scores ('{model.scoring}') on development set:")
        means = model.cv_results_['mean_test_score']
        stds  = model.cv_results_['std_test_score']
        i=0
        for mean, std, params in zip(means, stds, model.cv_results_['params']):
            print("\t[%2d]: %0.3f (+/-%0.03f) for %r" % (i, mean, std * 2, params))
            i += 1
    except:
        print("WARNING: the random search do not provide means/stds")
    
    global currmode                
    assert "f1_micro"==str(model.scoring), f"come on, we need to fix the scoring to be able to compare model-fits! Your scoreing={str(model.scoring)}...remember to add scoring='f1_micro' to the search"   
    return f"best: dat={currmode}, score={model.best_score_:0.5f}, model={GetBestModelCTOR(model.estimator,model.best_params_)}", model.best_estimator_ 

def ClassificationReport(model, X_test, y_test, target_names=None):
    assert X_test.shape[0]==y_test.shape[0]
    print("\nDetailed classification report:")
    print("\tThe model is trained on the full development set.")
    print("\tThe scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, model.predict(X_test)                 
    print(classification_report(y_true, y_pred, target_names))
    print()
    
def FullReport(model, X_test, y_test, t):
    print(f"SEARCH TIME: {t:0.2f} sec")
    beststr, bestmodel = SearchReport(model)
    #ClassificationReport(model, X_test, y_test)    
    print(f"CTOR for best model: {bestmodel}\n")
    print(f"{beststr}\n")
    return beststr, bestmodel
    
def LoadAndSetupData(mode, test_size=0.3):
    assert test_size>=0.0 and test_size<=1.0
    
    def ShapeToString(Z):
        n = Z.ndim
        s = "("
        for i in range(n):
            s += f"{Z.shape[i]:5d}"
            if i+1!=n:
                s += ";"
        return s+")"

    global currmode
    currmode=mode
    print(f"DATA: {currmode}..")
    
    if mode=='moon':
        X, y = itmaldataloaders.MOON_GetDataSet(n_samples=5000, noise=0.2)
        itmaldataloaders.MOON_Plot(X, y)
    elif mode=='mnist':
        X, y = itmaldataloaders.MNIST_GetDataSet(fetchmode=False)
        if X.ndim==3:
            X=np.reshape(X, (X.shape[0], -1))
    elif mode=='iris':
        X, y = itmaldataloaders.IRIS_GetDataSet()
    else:
        raise ValueError(f"could not load data for that particular mode='{mode}'")
        
    print(f'  org. data:  X.shape      ={ShapeToString(X)}, y.shape      ={ShapeToString(y)}')

    assert X.ndim==2
    assert X.shape[0]==y.shape[0]
    assert y.ndim==1 or (y.ndim==2 and y.shape[1]==0)    
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=0, shuffle=True
    )
    
    print(f'  train data: X_train.shape={ShapeToString(X_train)}, y_train.shape={ShapeToString(y_train)}')
    print(f'  test data:  X_test.shape ={ShapeToString(X_test)}, y_test.shape ={ShapeToString(y_test)}')
    print()
    
    return X_train, X_test, y_train, y_test

print('OK')

Using TensorFlow backend.


OK


In [6]:
# TODO: Qa, code review..cell 2) the actual grid-search

# Setup data
X_train, X_test, y_train, y_test = LoadAndSetupData('iris') # 'iris', 'moon', or 'mnist'

# Setup search parameters
model = svm.SVC(gamma=0.001) # NOTE: gamma="scale" does not work in older Scikit-learn frameworks, 
                             # FIX:  replace with model = svm.SVC(gamma=0.001)

tuning_parameters = {
    'kernel':('linear', 'rbf'), 
    'C':[1, 10]
}

CV=5
VERBOSE=0

# Run GridSearchCV for the model
start = time()
grid_tuned = GridSearchCV(model, tuning_parameters, cv=CV, scoring='f1_micro', verbose=VERBOSE, n_jobs=-1, iid=True)
grid_tuned.fit(X_train, y_train)
t = time()-start

# Report result
b0, m0= FullReport(grid_tuned , X_test, y_test, t)
print('OK')

DATA: iris..
  org. data:  X.shape      =(  150;    4), y.shape      =(  150)
  train data: X_train.shape=(  105;    4), y_train.shape=(  105)
  test data:  X_test.shape =(   45;    4), y_test.shape =(   45)

SEARCH TIME: 0.04 sec

Best model set found on train set:

	best parameters={'C': 1, 'kernel': 'linear'}
	best 'f1_micro' score=0.9714285714285714
	best index=0

Best estimator CTOR:
	SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.001, kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Grid scores ('f1_micro') on development set:
	[ 0]: 0.971 (+/-0.048) for {'C': 1, 'kernel': 'linear'}
	[ 1]: 0.695 (+/-0.031) for {'C': 1, 'kernel': 'rbf'}
	[ 2]: 0.952 (+/-0.084) for {'C': 10, 'kernel': 'linear'}
	[ 3]: 0.914 (+/-0.111) for {'C': 10, 'kernel': 'rbf'}

Detailed classification report:
	The model is trained on the full development set.
	The scores are comput

## Qa Answers) 
__Codeblock 1):__
Codeblock 1 consists of a set of helper functions, mainly in the regards to formatting the print of the result, and formatting the data. Now, an important part to note is the use of the GridSearchCV attributes in regards to the print out of the best result(e.g, the set of variables producing the best score from the chosen scoring method). A list of the attributes of GridSearchCV can be found in the scikit learn documentation, but the important part here is to note that after running the search, it automatically stores the best result for us, and it makes it easy for us to access this as seen from the helper functions. 

__Codeblock 2):__
Codeblock 2 is where the magic happens, as it is here that the gridsearch is actually performed. The gridsearch is performed using the GridSearchCV object, which takes several parameters. In order to describe it, lets break down the most important ones: 

- __"model" :__  is the estimator passed to be evaluated using the gridsearch. The estimator must contain a scoring function, and if not, the scoring parameter must be set - intuitively this makes sense as the evaluation requires a measure of goodness in order to compare results. 

- __"tuning_parameters" :__ Must be a dictionary holding string parameters as keys, and lists of values. This dictionary will then be expanded to search evaluate the estimator in regards to combinations of the values and names in the dictionary - a grid is made. 

- __"cv" :__ Is a "cross-validation generator". Here one can specify which method to use for cross validation, some knowns and covered in class is KFold and Elastic Net. 

- __"f1_micro" :__ The scoring function to be used - can either be a callable(user defined) or a predefined one, like the "f1_micro". 

- __"iid" :__ Independently Identically distributed. This requires specific knowledge of the dataset. 

- __"n_jobs" :__ Defines the number of "jobs"(processes) to run in parallel. 

With the above, one can run the objects fit method afterwards, and thus perform the gridsearch of the specified data. The results is printed using the objects attributes and helpers from codeblock 1. 


### Qb Hyperparameter Grid Search using an SDG classifier

Now, replace the `svm.SVC` model with an `SGDClassifier` and a suitable set of the hyperparameters for that model.

You need at least four or five different hyperparameters from the `SDG` in the search-space before it begins to take considerable compute time doing the full grid search.

In [13]:
# Setup data
X_train, X_test, y_train, y_test = LoadAndSetupData('iris') # 'iris', 'moon', or 'mnist'

# Setup search parameters
model = SGDClassifier(eta0=0.1) # NOTE: gamma="scale" does not work in older Scikit-learn frameworks, 
                             # FIX:  replace with model = svm.SVC(gamma=0.001)

tuning_parameters = {
    'loss':('hinge', 'hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron',
            'squared_loss', 'huber', 'epsilon_insensitive','squared_epsilon_insensitive'), 
    
    'learning_rate': ('constant', 'optimal', 'invscaling'),
    'l1_ratio':[0, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1],
    'alpha':[0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008, 0.0009, 0.001, 0.0011, 0.0012, 0.0013, 0.0014, 0.0015],
    'penalty':('none', 'l2', 'l1', 'elasticnet')
}

CV=5
VERBOSE=0

# Run GridSearchCV for the model
start = time()
grid_tuned = GridSearchCV(model, tuning_parameters, cv=CV, scoring='f1_micro', verbose=VERBOSE, n_jobs=-1, iid=True)
grid_tuned.fit(X_train, y_train)
t = time()-start

# Report result
b0, m0= FullReport(grid_tuned , X_test, y_test, t)
print('OK')

DATA: iris..
  org. data:  X.shape      =(  150;    4), y.shape      =(  150)
  train data: X_train.shape=(  105;    4), y_train.shape=(  105)
  test data:  X_test.shape =(   45;    4), y_test.shape =(   45)

SEARCH TIME: 193.23 sec

Best model set found on train set:

	best parameters={'alpha': 0.0001, 'l1_ratio': 0.75, 'learning_rate': 'invscaling', 'loss': 'perceptron', 'penalty': 'l2'}
	best 'f1_micro' score=0.9904761904761905
	best index=1901

Best estimator CTOR:
	SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.1, fit_intercept=True,
              l1_ratio=0.75, learning_rate='invscaling', loss='perceptron',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

Grid scores ('f1_micro') on development set:
	[ 0]: 0.781 (+/-0.153) for {'alpha': 0.0001, '

	[1898]: 0.800 (+/-0.164) for {'alpha': 0.0001, 'l1_ratio': 0.75, 'learning_rate': 'invscaling', 'loss': 'squared_hinge', 'penalty': 'l1'}
	[1899]: 0.857 (+/-0.322) for {'alpha': 0.0001, 'l1_ratio': 0.75, 'learning_rate': 'invscaling', 'loss': 'squared_hinge', 'penalty': 'elasticnet'}
	[1900]: 0.971 (+/-0.077) for {'alpha': 0.0001, 'l1_ratio': 0.75, 'learning_rate': 'invscaling', 'loss': 'perceptron', 'penalty': 'none'}
	[1901]: 0.990 (+/-0.041) for {'alpha': 0.0001, 'l1_ratio': 0.75, 'learning_rate': 'invscaling', 'loss': 'perceptron', 'penalty': 'l2'}
	[1902]: 0.952 (+/-0.059) for {'alpha': 0.0001, 'l1_ratio': 0.75, 'learning_rate': 'invscaling', 'loss': 'perceptron', 'penalty': 'l1'}
	[1903]: 0.829 (+/-0.155) for {'alpha': 0.0001, 'l1_ratio': 0.75, 'learning_rate': 'invscaling', 'loss': 'perceptron', 'penalty': 'elasticnet'}
	[1904]: 0.695 (+/-0.250) for {'alpha': 0.0001, 'l1_ratio': 0.75, 'learning_rate': 'invscaling', 'loss': 'squared_loss', 'penalty': 'none'}
	[1905]: 0.733 (+/-0

	[3397]: 0.390 (+/-0.258) for {'alpha': 0.0002, 'l1_ratio': 0.35, 'learning_rate': 'constant', 'loss': 'squared_epsilon_insensitive', 'penalty': 'l2'}
	[3398]: 0.324 (+/-0.392) for {'alpha': 0.0002, 'l1_ratio': 0.35, 'learning_rate': 'constant', 'loss': 'squared_epsilon_insensitive', 'penalty': 'l1'}
	[3399]: 0.390 (+/-0.231) for {'alpha': 0.0002, 'l1_ratio': 0.35, 'learning_rate': 'constant', 'loss': 'squared_epsilon_insensitive', 'penalty': 'elasticnet'}
	[3400]: 0.829 (+/-0.218) for {'alpha': 0.0002, 'l1_ratio': 0.35, 'learning_rate': 'optimal', 'loss': 'hinge', 'penalty': 'none'}
	[3401]: 0.790 (+/-0.236) for {'alpha': 0.0002, 'l1_ratio': 0.35, 'learning_rate': 'optimal', 'loss': 'hinge', 'penalty': 'l2'}
	[3402]: 0.952 (+/-0.059) for {'alpha': 0.0002, 'l1_ratio': 0.35, 'learning_rate': 'optimal', 'loss': 'hinge', 'penalty': 'l1'}
	[3403]: 0.914 (+/-0.196) for {'alpha': 0.0002, 'l1_ratio': 0.35, 'learning_rate': 'optimal', 'loss': 'hinge', 'penalty': 'elasticnet'}
	[3404]: 0.800 (+

	[4899]: 0.733 (+/-0.302) for {'alpha': 0.0002, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'squared_hinge', 'penalty': 'elasticnet'}
	[4900]: 0.914 (+/-0.065) for {'alpha': 0.0002, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'perceptron', 'penalty': 'none'}
	[4901]: 0.952 (+/-0.059) for {'alpha': 0.0002, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'perceptron', 'penalty': 'l2'}
	[4902]: 0.962 (+/-0.076) for {'alpha': 0.0002, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'perceptron', 'penalty': 'l1'}
	[4903]: 0.924 (+/-0.109) for {'alpha': 0.0002, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'perceptron', 'penalty': 'elasticnet'}
	[4904]: 0.695 (+/-0.199) for {'alpha': 0.0002, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'squared_loss', 'penalty': 'none'}
	[4905]: 0.657 (+/-0.154) for {'alpha': 0.0002, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'squared_loss', 'penalty': 'l2'}
	[4906]: 0.419 (+/-0.

	[6778]: 0.886 (+/-0.204) for {'alpha': 0.0003, 'l1_ratio': 0.7, 'learning_rate': 'optimal', 'loss': 'squared_hinge', 'penalty': 'l1'}
	[6779]: 0.943 (+/-0.038) for {'alpha': 0.0003, 'l1_ratio': 0.7, 'learning_rate': 'optimal', 'loss': 'squared_hinge', 'penalty': 'elasticnet'}
	[6780]: 0.914 (+/-0.067) for {'alpha': 0.0003, 'l1_ratio': 0.7, 'learning_rate': 'optimal', 'loss': 'perceptron', 'penalty': 'none'}
	[6781]: 0.829 (+/-0.180) for {'alpha': 0.0003, 'l1_ratio': 0.7, 'learning_rate': 'optimal', 'loss': 'perceptron', 'penalty': 'l2'}
	[6782]: 0.876 (+/-0.221) for {'alpha': 0.0003, 'l1_ratio': 0.7, 'learning_rate': 'optimal', 'loss': 'perceptron', 'penalty': 'l1'}
	[6783]: 0.838 (+/-0.272) for {'alpha': 0.0003, 'l1_ratio': 0.7, 'learning_rate': 'optimal', 'loss': 'perceptron', 'penalty': 'elasticnet'}
	[6784]: 0.333 (+/-0.070) for {'alpha': 0.0003, 'l1_ratio': 0.7, 'learning_rate': 'optimal', 'loss': 'squared_loss', 'penalty': 'none'}
	[6785]: 0.314 (+/-0.101) for {'alpha': 0.0003, 

	[8658]: 0.648 (+/-0.400) for {'alpha': 0.0004, 'l1_ratio': 0.45, 'learning_rate': 'constant', 'loss': 'squared_hinge', 'penalty': 'l1'}
	[8659]: 0.705 (+/-0.394) for {'alpha': 0.0004, 'l1_ratio': 0.45, 'learning_rate': 'constant', 'loss': 'squared_hinge', 'penalty': 'elasticnet'}
	[8660]: 0.800 (+/-0.263) for {'alpha': 0.0004, 'l1_ratio': 0.45, 'learning_rate': 'constant', 'loss': 'perceptron', 'penalty': 'none'}
	[8661]: 0.657 (+/-0.401) for {'alpha': 0.0004, 'l1_ratio': 0.45, 'learning_rate': 'constant', 'loss': 'perceptron', 'penalty': 'l2'}
	[8662]: 0.714 (+/-0.270) for {'alpha': 0.0004, 'l1_ratio': 0.45, 'learning_rate': 'constant', 'loss': 'perceptron', 'penalty': 'l1'}
	[8663]: 0.619 (+/-0.210) for {'alpha': 0.0004, 'l1_ratio': 0.45, 'learning_rate': 'constant', 'loss': 'perceptron', 'penalty': 'elasticnet'}
	[8664]: 0.343 (+/-0.062) for {'alpha': 0.0004, 'l1_ratio': 0.45, 'learning_rate': 'constant', 'loss': 'squared_loss', 'penalty': 'none'}
	[8665]: 0.352 (+/-0.201) for {'al

	[10323]: 0.810 (+/-0.165) for {'alpha': 0.0005, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'hinge', 'penalty': 'elasticnet'}
	[10324]: 0.819 (+/-0.251) for {'alpha': 0.0005, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'hinge', 'penalty': 'none'}
	[10325]: 0.819 (+/-0.185) for {'alpha': 0.0005, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'hinge', 'penalty': 'l2'}
	[10326]: 0.810 (+/-0.207) for {'alpha': 0.0005, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'hinge', 'penalty': 'l1'}
	[10327]: 0.781 (+/-0.212) for {'alpha': 0.0005, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'hinge', 'penalty': 'elasticnet'}
	[10328]: 0.838 (+/-0.130) for {'alpha': 0.0005, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'log', 'penalty': 'none'}
	[10329]: 0.810 (+/-0.197) for {'alpha': 0.0005, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'log', 'penalty': 'l2'}
	[10330]: 0.743 (+/-0.129) for {'alpha': 0.0005, 'l1_ratio': 0.1, 'learning_rate'

	[12050]: 0.876 (+/-0.220) for {'alpha': 0.0005, 'l1_ratio': 0.8, 'learning_rate': 'optimal', 'loss': 'log', 'penalty': 'l1'}
	[12051]: 0.943 (+/-0.110) for {'alpha': 0.0005, 'l1_ratio': 0.8, 'learning_rate': 'optimal', 'loss': 'log', 'penalty': 'elasticnet'}
	[12052]: 0.867 (+/-0.179) for {'alpha': 0.0005, 'l1_ratio': 0.8, 'learning_rate': 'optimal', 'loss': 'modified_huber', 'penalty': 'none'}
	[12053]: 0.771 (+/-0.331) for {'alpha': 0.0005, 'l1_ratio': 0.8, 'learning_rate': 'optimal', 'loss': 'modified_huber', 'penalty': 'l2'}
	[12054]: 0.924 (+/-0.041) for {'alpha': 0.0005, 'l1_ratio': 0.8, 'learning_rate': 'optimal', 'loss': 'modified_huber', 'penalty': 'l1'}
	[12055]: 0.924 (+/-0.072) for {'alpha': 0.0005, 'l1_ratio': 0.8, 'learning_rate': 'optimal', 'loss': 'modified_huber', 'penalty': 'elasticnet'}
	[12056]: 0.943 (+/-0.109) for {'alpha': 0.0005, 'l1_ratio': 0.8, 'learning_rate': 'optimal', 'loss': 'squared_hinge', 'penalty': 'none'}
	[12057]: 0.743 (+/-0.286) for {'alpha': 0.0

	[13894]: 0.971 (+/-0.077) for {'alpha': 0.0006, 'l1_ratio': 0.5, 'learning_rate': 'invscaling', 'loss': 'modified_huber', 'penalty': 'l1'}
	[13895]: 0.943 (+/-0.090) for {'alpha': 0.0006, 'l1_ratio': 0.5, 'learning_rate': 'invscaling', 'loss': 'modified_huber', 'penalty': 'elasticnet'}
	[13896]: 0.867 (+/-0.189) for {'alpha': 0.0006, 'l1_ratio': 0.5, 'learning_rate': 'invscaling', 'loss': 'squared_hinge', 'penalty': 'none'}
	[13897]: 0.943 (+/-0.089) for {'alpha': 0.0006, 'l1_ratio': 0.5, 'learning_rate': 'invscaling', 'loss': 'squared_hinge', 'penalty': 'l2'}
	[13898]: 0.867 (+/-0.239) for {'alpha': 0.0006, 'l1_ratio': 0.5, 'learning_rate': 'invscaling', 'loss': 'squared_hinge', 'penalty': 'l1'}
	[13899]: 0.838 (+/-0.130) for {'alpha': 0.0006, 'l1_ratio': 0.5, 'learning_rate': 'invscaling', 'loss': 'squared_hinge', 'penalty': 'elasticnet'}
	[13900]: 0.895 (+/-0.090) for {'alpha': 0.0006, 'l1_ratio': 0.5, 'learning_rate': 'invscaling', 'loss': 'perceptron', 'penalty': 'none'}
	[13901]

	[15393]: 0.410 (+/-0.243) for {'alpha': 0.0007, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'epsilon_insensitive', 'penalty': 'l2'}
	[15394]: 0.486 (+/-0.318) for {'alpha': 0.0007, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'epsilon_insensitive', 'penalty': 'l1'}
	[15395]: 0.381 (+/-0.152) for {'alpha': 0.0007, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'epsilon_insensitive', 'penalty': 'elasticnet'}
	[15396]: 0.400 (+/-0.271) for {'alpha': 0.0007, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'squared_epsilon_insensitive', 'penalty': 'none'}
	[15397]: 0.333 (+/-0.410) for {'alpha': 0.0007, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'squared_epsilon_insensitive', 'penalty': 'l2'}
	[15398]: 0.276 (+/-0.288) for {'alpha': 0.0007, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'squared_epsilon_insensitive', 'penalty': 'l1'}
	[15399]: 0.352 (+/-0.130) for {'alpha': 0.0007, 'l1_ratio': 0.1, 'learning_rate': 'constant', 'loss': 'squa

	[16971]: 0.867 (+/-0.239) for {'alpha': 0.0007, 'l1_ratio': 0.75, 'learning_rate': 'optimal', 'loss': 'log', 'penalty': 'elasticnet'}
	[16972]: 0.914 (+/-0.143) for {'alpha': 0.0007, 'l1_ratio': 0.75, 'learning_rate': 'optimal', 'loss': 'modified_huber', 'penalty': 'none'}
	[16973]: 0.838 (+/-0.193) for {'alpha': 0.0007, 'l1_ratio': 0.75, 'learning_rate': 'optimal', 'loss': 'modified_huber', 'penalty': 'l2'}
	[16974]: 0.933 (+/-0.171) for {'alpha': 0.0007, 'l1_ratio': 0.75, 'learning_rate': 'optimal', 'loss': 'modified_huber', 'penalty': 'l1'}
	[16975]: 0.876 (+/-0.249) for {'alpha': 0.0007, 'l1_ratio': 0.75, 'learning_rate': 'optimal', 'loss': 'modified_huber', 'penalty': 'elasticnet'}
	[16976]: 0.924 (+/-0.160) for {'alpha': 0.0007, 'l1_ratio': 0.75, 'learning_rate': 'optimal', 'loss': 'squared_hinge', 'penalty': 'none'}
	[16977]: 0.867 (+/-0.193) for {'alpha': 0.0007, 'l1_ratio': 0.75, 'learning_rate': 'optimal', 'loss': 'squared_hinge', 'penalty': 'l2'}
	[16978]: 0.943 (+/-0.090) 

	[18551]: 0.676 (+/-0.289) for {'alpha': 0.0008, 'l1_ratio': 0.35, 'learning_rate': 'optimal', 'loss': 'huber', 'penalty': 'elasticnet'}
	[18552]: 0.362 (+/-0.084) for {'alpha': 0.0008, 'l1_ratio': 0.35, 'learning_rate': 'optimal', 'loss': 'epsilon_insensitive', 'penalty': 'none'}
	[18553]: 0.400 (+/-0.295) for {'alpha': 0.0008, 'l1_ratio': 0.35, 'learning_rate': 'optimal', 'loss': 'epsilon_insensitive', 'penalty': 'l2'}
	[18554]: 0.476 (+/-0.335) for {'alpha': 0.0008, 'l1_ratio': 0.35, 'learning_rate': 'optimal', 'loss': 'epsilon_insensitive', 'penalty': 'l1'}
	[18555]: 0.362 (+/-0.443) for {'alpha': 0.0008, 'l1_ratio': 0.35, 'learning_rate': 'optimal', 'loss': 'epsilon_insensitive', 'penalty': 'elasticnet'}
	[18556]: 0.248 (+/-0.497) for {'alpha': 0.0008, 'l1_ratio': 0.35, 'learning_rate': 'optimal', 'loss': 'squared_epsilon_insensitive', 'penalty': 'none'}
	[18557]: 0.381 (+/-0.244) for {'alpha': 0.0008, 'l1_ratio': 0.35, 'learning_rate': 'optimal', 'loss': 'squared_epsilon_insensit

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




	[32324]: 0.952 (+/-0.084) for {'alpha': 0.0013, 'l1_ratio': 0.85, 'learning_rate': 'optimal', 'loss': 'hinge', 'penalty': 'none'}
	[32325]: 0.876 (+/-0.090) for {'alpha': 0.0013, 'l1_ratio': 0.85, 'learning_rate': 'optimal', 'loss': 'hinge', 'penalty': 'l2'}
	[32326]: 0.914 (+/-0.138) for {'alpha': 0.0013, 'l1_ratio': 0.85, 'learning_rate': 'optimal', 'loss': 'hinge', 'penalty': 'l1'}
	[32327]: 0.962 (+/-0.038) for {'alpha': 0.0013, 'l1_ratio': 0.85, 'learning_rate': 'optimal', 'loss': 'hinge', 'penalty': 'elasticnet'}
	[32328]: 0.895 (+/-0.228) for {'alpha': 0.0013, 'l1_ratio': 0.85, 'learning_rate': 'optimal', 'loss': 'log', 'penalty': 'none'}
	[32329]: 0.933 (+/-0.072) for {'alpha': 0.0013, 'l1_ratio': 0.85, 'learning_rate': 'optimal', 'loss': 'log', 'penalty': 'l2'}
	[32330]: 0.962 (+/-0.068) for {'alpha': 0.0013, 'l1_ratio': 0.85, 'learning_rate': 'optimal', 'loss': 'log', 'penalty': 'l1'}
	[32331]: 0.867 (+/-0.201) for {'alpha': 0.0013, 'l1_ratio': 0.85, 'learning_rate': 'optim

	[33588]: 0.695 (+/-0.031) for {'alpha': 0.0014, 'l1_ratio': 0.3, 'learning_rate': 'invscaling', 'loss': 'huber', 'penalty': 'none'}
	[33589]: 0.695 (+/-0.031) for {'alpha': 0.0014, 'l1_ratio': 0.3, 'learning_rate': 'invscaling', 'loss': 'huber', 'penalty': 'l2'}
	[33590]: 0.695 (+/-0.031) for {'alpha': 0.0014, 'l1_ratio': 0.3, 'learning_rate': 'invscaling', 'loss': 'huber', 'penalty': 'l1'}
	[33591]: 0.695 (+/-0.031) for {'alpha': 0.0014, 'l1_ratio': 0.3, 'learning_rate': 'invscaling', 'loss': 'huber', 'penalty': 'elasticnet'}
	[33592]: 0.695 (+/-0.031) for {'alpha': 0.0014, 'l1_ratio': 0.3, 'learning_rate': 'invscaling', 'loss': 'epsilon_insensitive', 'penalty': 'none'}
	[33593]: 0.695 (+/-0.031) for {'alpha': 0.0014, 'l1_ratio': 0.3, 'learning_rate': 'invscaling', 'loss': 'epsilon_insensitive', 'penalty': 'l2'}
	[33594]: 0.695 (+/-0.031) for {'alpha': 0.0014, 'l1_ratio': 0.3, 'learning_rate': 'invscaling', 'loss': 'epsilon_insensitive', 'penalty': 'l1'}
	[33595]: 0.695 (+/-0.031) fo

	[35118]: 0.390 (+/-0.334) for {'alpha': 0.0014, 'l1_ratio': 0.95, 'learning_rate': 'optimal', 'loss': 'squared_epsilon_insensitive', 'penalty': 'l1'}
	[35119]: 0.362 (+/-0.075) for {'alpha': 0.0014, 'l1_ratio': 0.95, 'learning_rate': 'optimal', 'loss': 'squared_epsilon_insensitive', 'penalty': 'elasticnet'}
	[35120]: 0.790 (+/-0.143) for {'alpha': 0.0014, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'hinge', 'penalty': 'none'}
	[35121]: 0.848 (+/-0.140) for {'alpha': 0.0014, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'hinge', 'penalty': 'l2'}
	[35122]: 0.800 (+/-0.151) for {'alpha': 0.0014, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'hinge', 'penalty': 'l1'}
	[35123]: 0.800 (+/-0.154) for {'alpha': 0.0014, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'hinge', 'penalty': 'elasticnet'}
	[35124]: 0.848 (+/-0.087) for {'alpha': 0.0014, 'l1_ratio': 0.95, 'learning_rate': 'invscaling', 'loss': 'hinge', 'penalty': 'none'}
	[35125]: 0.857 (+

	[36922]: 0.819 (+/-0.166) for {'alpha': 0.0015, 'l1_ratio': 0.65, 'learning_rate': 'invscaling', 'loss': 'hinge', 'penalty': 'l1'}
	[36923]: 0.895 (+/-0.077) for {'alpha': 0.0015, 'l1_ratio': 0.65, 'learning_rate': 'invscaling', 'loss': 'hinge', 'penalty': 'elasticnet'}
	[36924]: 0.800 (+/-0.158) for {'alpha': 0.0015, 'l1_ratio': 0.65, 'learning_rate': 'invscaling', 'loss': 'hinge', 'penalty': 'none'}
	[36925]: 0.905 (+/-0.101) for {'alpha': 0.0015, 'l1_ratio': 0.65, 'learning_rate': 'invscaling', 'loss': 'hinge', 'penalty': 'l2'}
	[36926]: 0.886 (+/-0.139) for {'alpha': 0.0015, 'l1_ratio': 0.65, 'learning_rate': 'invscaling', 'loss': 'hinge', 'penalty': 'l1'}
	[36927]: 0.886 (+/-0.053) for {'alpha': 0.0015, 'l1_ratio': 0.65, 'learning_rate': 'invscaling', 'loss': 'hinge', 'penalty': 'elasticnet'}
	[36928]: 0.810 (+/-0.170) for {'alpha': 0.0015, 'l1_ratio': 0.65, 'learning_rate': 'invscaling', 'loss': 'log', 'penalty': 'none'}
	[36929]: 0.829 (+/-0.092) for {'alpha': 0.0015, 'l1_ratio

### Qb Answers)
From the above, it is seen that the parameters 'loss', 'learning_rate', 'l1_ratio', 'alpha' and 'penalty' is being searched for optimal values. Now, from the result it is interresting to see both the speed at which the above grid was tested(thanks to n_jobs = -1 using all available proccesses), but also, that many variables was found with different optimal values than their default ones, evaluated in regards to the 'f1_micro' scoring method. The above example definitely goes to show the importance of the helper functions and gridsearch mechanism when evaluating hyperparameters in regards to our chosen model. 

__Note:__ The gridsearch above have been intentionally bloated in order to compare the results with the randomized search method below - the more precise our grid, the more likely it is to be "optimal", and the closeness of the randomized search method can be compared, both in regards to speed and in regards to how optimal it is.

### Qc Hyperparameter Random  Search using an SDG classifier

Now, add code to run a `RandomizedSearchCV` instead.

<img src="https://itundervisning.ase.au.dk/E19_itmal/L08/Figs/randomsearch.png" style="width:350px">
<small><em>
    <center> Conceptual graphical view of randomized search for two distinct hyperparameters. </center> 
</em></small>

Use the same parameters for the random search, but add and investigate the new `n_iter` parameter

```python
random_tuned = RandomizedSearchCV(
    model, 
    tuning_parameters, 
    random_state=42, 
    n_iter=20, 
    cv=CV, 
    scoring='f1_micro', 
    verbose=VERBOSE, 
    n_jobs=-1, 
    iid=True)
```

Comparison of time (seconds) to complete `GridSearch` versus `RandomizedSearchCV`, does not necessarily give any sense, if your grid search completes in a few seconds (as for the iris tiny-data). You need a search that runs for minute, hours, or days.

But you could compare the best-tuned parameter set and best scoring for the two methods. Is the random search best model close to the grid search?  

In [20]:
# TODO: Qc..
# Setup data
X_train, X_test, y_train, y_test = LoadAndSetupData('iris') # 'iris', 'moon', or 'mnist'

# Setup search parameters
model = SGDClassifier(eta0=0.1) # NOTE: gamma="scale" does not work in older Scikit-learn frameworks, 
                             # FIX:  replace with model = svm.SVC(gamma=0.001)

tuning_parameters = {
    'loss':('hinge', 'hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron',
            'squared_loss', 'huber', 'epsilon_insensitive','squared_epsilon_insensitive'), 
    
    'learning_rate': ('constant', 'optimal', 'invscaling'),
    'l1_ratio':[0, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1],
    'alpha':[0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008, 0.0009, 0.001, 0.0011, 0.0012, 0.0013, 0.0014, 0.0015],
    'penalty':('none', 'l2', 'l1', 'elasticnet')
}

CV=5
VERBOSE=0

# Run GridSearchCV for the model
start = time()
random_tuned = RandomizedSearchCV(
    model, 
    tuning_parameters, 
    random_state=42, 
    n_iter=10000, 
    cv=CV, 
    scoring='f1_micro', 
    verbose=VERBOSE, 
    n_jobs=-1, 
    iid=True)
random_tuned.fit(X_train, y_train)
t = time()-start

# Report result
b0, m0= FullReport(random_tuned , X_test, y_test, t)
print('OK')

DATA: iris..
  org. data:  X.shape      =(  150;    4), y.shape      =(  150)
  train data: X_train.shape=(  105;    4), y_train.shape=(  105)
  test data:  X_test.shape =(   45;    4), y_test.shape =(   45)

SEARCH TIME: 44.87 sec

Best model set found on train set:

	best parameters={'penalty': 'l1', 'loss': 'hinge', 'learning_rate': 'optimal', 'l1_ratio': 0.05, 'alpha': 0.0013}
	best 'f1_micro' score=0.9904761904761905
	best index=2097

Best estimator CTOR:
	SGDClassifier(alpha=0.0013, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.1, fit_intercept=True,
              l1_ratio=0.05, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l1',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

Grid scores ('f1_micro') on development set:
	[ 0]: 0.810 (+/-0.170) for {'penalty': 'l2', 'loss': 'hinge', '

	[1544]: 0.924 (+/-0.152) for {'penalty': 'l1', 'loss': 'hinge', 'learning_rate': 'constant', 'l1_ratio': 0.35, 'alpha': 0.0005}
	[1545]: 0.952 (+/-0.084) for {'penalty': 'elasticnet', 'loss': 'hinge', 'learning_rate': 'optimal', 'l1_ratio': 0.3, 'alpha': 0.0006}
	[1546]: 0.695 (+/-0.031) for {'penalty': 'l1', 'loss': 'modified_huber', 'learning_rate': 'constant', 'l1_ratio': 0.65, 'alpha': 0.001}
	[1547]: 0.638 (+/-0.237) for {'penalty': 'none', 'loss': 'hinge', 'learning_rate': 'constant', 'l1_ratio': 0.7, 'alpha': 0.0006}
	[1548]: 0.943 (+/-0.102) for {'penalty': 'l1', 'loss': 'squared_hinge', 'learning_rate': 'optimal', 'l1_ratio': 0.5, 'alpha': 0.0009}
	[1549]: 0.286 (+/-0.194) for {'penalty': 'l2', 'loss': 'squared_epsilon_insensitive', 'learning_rate': 'optimal', 'l1_ratio': 0, 'alpha': 0.0013}
	[1550]: 0.876 (+/-0.158) for {'penalty': 'l1', 'loss': 'hinge', 'learning_rate': 'optimal', 'l1_ratio': 0.4, 'alpha': 0.0003}
	[1551]: 0.771 (+/-0.226) for {'penalty': 'none', 'loss': 'l

	[3986]: 0.724 (+/-0.281) for {'penalty': 'l2', 'loss': 'huber', 'learning_rate': 'optimal', 'l1_ratio': 0.55, 'alpha': 0.0013}
	[3987]: 0.276 (+/-0.147) for {'penalty': 'l2', 'loss': 'squared_loss', 'learning_rate': 'optimal', 'l1_ratio': 0.4, 'alpha': 0.0011}
	[3988]: 0.762 (+/-0.156) for {'penalty': 'l1', 'loss': 'hinge', 'learning_rate': 'constant', 'l1_ratio': 0.25, 'alpha': 0.0007}
	[3989]: 0.314 (+/-0.031) for {'penalty': 'l1', 'loss': 'squared_epsilon_insensitive', 'learning_rate': 'optimal', 'l1_ratio': 0.85, 'alpha': 0.0013}
	[3990]: 0.819 (+/-0.256) for {'penalty': 'elasticnet', 'loss': 'squared_hinge', 'learning_rate': 'invscaling', 'l1_ratio': 0.3, 'alpha': 0.0003}
	[3991]: 0.933 (+/-0.094) for {'penalty': 'none', 'loss': 'modified_huber', 'learning_rate': 'invscaling', 'l1_ratio': 0.4, 'alpha': 0.0004}
	[3992]: 0.867 (+/-0.069) for {'penalty': 'elasticnet', 'loss': 'hinge', 'learning_rate': 'invscaling', 'l1_ratio': 0.5, 'alpha': 0.0011}
	[3993]: 0.924 (+/-0.092) for {'pe

	[5393]: 0.743 (+/-0.194) for {'penalty': 'elasticnet', 'loss': 'modified_huber', 'learning_rate': 'constant', 'l1_ratio': 0.15, 'alpha': 0.0009}
	[5394]: 0.838 (+/-0.178) for {'penalty': 'l1', 'loss': 'log', 'learning_rate': 'constant', 'l1_ratio': 0.5, 'alpha': 0.0015}
	[5395]: 0.924 (+/-0.151) for {'penalty': 'elasticnet', 'loss': 'log', 'learning_rate': 'constant', 'l1_ratio': 0.85, 'alpha': 0.0008}
	[5396]: 0.952 (+/-0.058) for {'penalty': 'l1', 'loss': 'log', 'learning_rate': 'optimal', 'l1_ratio': 0.9, 'alpha': 0.0007}
	[5397]: 0.295 (+/-0.185) for {'penalty': 'l2', 'loss': 'squared_epsilon_insensitive', 'learning_rate': 'constant', 'l1_ratio': 0.25, 'alpha': 0.001}
	[5398]: 0.343 (+/-0.072) for {'penalty': 'l2', 'loss': 'squared_loss', 'learning_rate': 'constant', 'l1_ratio': 0.9, 'alpha': 0.0002}
	[5399]: 0.752 (+/-0.186) for {'penalty': 'none', 'loss': 'perceptron', 'learning_rate': 'constant', 'l1_ratio': 0.85, 'alpha': 0.0004}
	[5400]: 0.819 (+/-0.217) for {'penalty': 'none

	[7718]: 0.495 (+/-0.125) for {'penalty': 'l2', 'loss': 'squared_loss', 'learning_rate': 'invscaling', 'l1_ratio': 0.6, 'alpha': 0.0006}
	[7719]: 0.695 (+/-0.031) for {'penalty': 'l1', 'loss': 'epsilon_insensitive', 'learning_rate': 'invscaling', 'l1_ratio': 0.9, 'alpha': 0.0006}
	[7720]: 0.733 (+/-0.307) for {'penalty': 'elasticnet', 'loss': 'hinge', 'learning_rate': 'constant', 'l1_ratio': 0.95, 'alpha': 0.001}
	[7721]: 0.943 (+/-0.109) for {'penalty': 'l1', 'loss': 'modified_huber', 'learning_rate': 'invscaling', 'l1_ratio': 0.55, 'alpha': 0.0013}
	[7722]: 0.857 (+/-0.057) for {'penalty': 'l2', 'loss': 'log', 'learning_rate': 'invscaling', 'l1_ratio': 0.45, 'alpha': 0.0011}
	[7723]: 0.971 (+/-0.077) for {'penalty': 'elasticnet', 'loss': 'squared_hinge', 'learning_rate': 'optimal', 'l1_ratio': 1, 'alpha': 0.0015}
	[7724]: 0.410 (+/-0.282) for {'penalty': 'elasticnet', 'loss': 'epsilon_insensitive', 'learning_rate': 'constant', 'l1_ratio': 0.4, 'alpha': 0.0011}
	[7725]: 0.352 (+/-0.04

	[9068]: 0.829 (+/-0.288) for {'penalty': 'none', 'loss': 'squared_hinge', 'learning_rate': 'constant', 'l1_ratio': 0.25, 'alpha': 0.0001}
	[9069]: 0.905 (+/-0.136) for {'penalty': 'elasticnet', 'loss': 'squared_hinge', 'learning_rate': 'invscaling', 'l1_ratio': 0.05, 'alpha': 0.0013}
	[9070]: 0.752 (+/-0.312) for {'penalty': 'none', 'loss': 'modified_huber', 'learning_rate': 'optimal', 'l1_ratio': 0.8, 'alpha': 0.0001}
	[9071]: 0.838 (+/-0.056) for {'penalty': 'elasticnet', 'loss': 'hinge', 'learning_rate': 'invscaling', 'l1_ratio': 0.6, 'alpha': 0.0001}
	[9072]: 0.667 (+/-0.184) for {'penalty': 'none', 'loss': 'squared_hinge', 'learning_rate': 'constant', 'l1_ratio': 0.4, 'alpha': 0.0005}
	[9073]: 0.333 (+/-0.049) for {'penalty': 'l2', 'loss': 'squared_epsilon_insensitive', 'learning_rate': 'optimal', 'l1_ratio': 0.05, 'alpha': 0.0004}
	[9074]: 0.800 (+/-0.158) for {'penalty': 'elasticnet', 'loss': 'hinge', 'learning_rate': 'invscaling', 'l1_ratio': 0.5, 'alpha': 0.0008}
	[9075]: 0.9

### Qc Answers)

Firstly, after a few runs, changing the number of iterations, n, some samples yilded: 

__N = 100__

SEARCH TIME: 0.49 sec

Best model set found on train set:

	best parameters={'penalty': 'l1', 'loss': 'perceptron', 'learning_rate': 'invscaling', 'l1_ratio': 0.5, 'alpha': 0.0008}
	best 'f1_micro' score=0.9619047619047619
	best index=18

__N = 1000__

SEARCH TIME: 4.16 sec

Best model set found on train set:

	best parameters={'penalty': 'none', 'loss': 'modified_huber', 'learning_rate': 'optimal', 'l1_ratio': 0, 'alpha': 0.0013}
	best 'f1_micro' score=0.9904761904761905
	best index=996
    
__N = 10000__

SEARCH TIME: 44.87 sec

Best model set found on train set:

	best parameters={'penalty': 'l1', 'loss': 'hinge', 'learning_rate': 'optimal', 'l1_ratio': 0.05, 'alpha': 0.0013}
	best 'f1_micro' score=0.9904761904761905
	best index=2097

First and foremost, the time change between randomized search and gridsearch in regards to their score of best params is extreme. Randomized search yields alot of "useable" results fast, while gridsearch searchs for _the_ optimal solution given the hyperparameter values chosen to search. Now, is one better than the other? The answer seems to be it depends: 

- Randomized search provides "good" results and fast, but, how would one know "how good" they are? 
- Grid search yields _the_ best result in regards to the chosen parameters - using this one is sure to get the optimal model using the hyperparameters chosen. But how can one know if the hyperparameter values chosen is either the correct ones or not?

With the above in mind, the specific use case of the model must be evaluated in regards to which search method to choose. One could imagine a use case with specific demands to the score, and as such, a combination of both could be used. Randomized search to find "good values", and grid afterwards in order to examine the found hyperparameters further. 

### Qd MNIST Search Quest II

Finally, we create yet a search-quest competition: who can find the best model+hyperparameters for MNIST dataset?

You change to the MNIST data by calling `LoadAndSetupData('mnist')`, and this is a completely other ball-game that the _tiny-data_ iris: it's much larger (but still far from _big-data_)!

* You might opt for an exhaustive grid search, or a faster but-less optimal random search...your choice. 

* You are free to pick any classifier in Scikit-learn, even algorithms we have not discussed yet---except Neural Networks!. Keep the score function at `f1_micro`, otherwise, we will be comparing 'æbler og pærer'. 

* And, you may also want to scale you input data for some models to perform better (neural networks in particular).

* DO NOT USE any Neural Network models, including Keras or Tensorflow models...not yet, and there are too many examples on the net to cut-and-paste from!

Check your result by printing the first _return_ value from `FullReport()` 
```python 
b1, m1 = FullReport(random_tuned , X_test, y_test, time_randomsearch)
print(b1)
```
that will display a result like
```
best: dat=iris, score=0.97143, model=SVC(C=1, kernel='linear')
```
Now, check if your score (for MNIST) is better that the currently best score on Blackboard: "L07: Optimization and searching" | "Search Quest for MNIST"

> https://blackboard.au.dk/webapps/blackboard/content/listContentEditable.jsp?content_id=_2302642_1&course_id=_131051_1&mode=reset

and paste your best model into the message box, like
```
best(Grp99): dat=mnist, score=0.47090, model=SVC(C=1, kernel='linear')
```
Remember to provide a ITMAL group name manually, so we can identify a winnner: the 1.st price is yet a cake! 

For the handin, report your progress in scoring choosing different models, hyperparameters to search and how you might need to preprocess your data...

In [3]:
from sklearn.neighbors import KNeighborsClassifier

# TODO: Qd..(in code and text..)
# Setup data
X_train, X_test, y_train, y_test = LoadAndSetupData('mnist') # 'iris', 'moon', or 'mnist'

# Setup search parameters
model = KNeighborsClassifier(n_jobs=-1)

tuning_parameters = {
    'n_neighbors':[3,4,5],
    'weights':('uniform', 'distance'),
    'algorithm':('ball_tree', 'kd_tree', 'brute'),
    'p':[2,3,4],    
}

CV=5
VERBOSE=0

# Run Randomized Search - RandomizedSearchCV for the model
start = time()
random_tuned = RandomizedSearchCV(
    model, 
    tuning_parameters, 
    random_state=42, 
    n_iter=8, 
    cv=CV, 
    scoring='f1_micro', 
    verbose=10, 
    n_jobs=-1, 
    iid=True)
random_tuned.fit(X_train, y_train)
t = time()-start

# Report result
b0, m0= FullReport(random_tuned , X_test, y_test, t)
print(b0)


DATA: mnist..
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
  org. data:  X.shape      =(70000;  784), y.shape      =(70000)
  train data: X_train.shape=(49000;  784), y_train.shape=(49000)
  test data:  X_test.shape =(21000;  784), y_test.shape =(21000)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:  5.2min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed: 11.4min
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed: 21.4min
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed: 143.7min
[Parallel(n_jobs=-1)]: Done  30 out of  40 | elapsed: 274.3min remaining: 91.4min
[Parallel(n_jobs=-1)]: Done  35 out of  40 | elapsed: 289.6min remaining: 41.4min
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed: 341.9min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed: 341.9min finished


SEARCH TIME: 20543.59 sec

Best model set found on train set:

	best parameters={'weights': 'distance', 'p': 4, 'n_neighbors': 3, 'algorithm': 'ball_tree'}
	best 'f1_micro' score=0.9736938775510204
	best index=5

Best estimator CTOR:
	KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=3, p=4,
                     weights='distance')

Grid scores ('f1_micro') on development set:
	[ 0]: 0.970 (+/-0.002) for {'weights': 'distance', 'p': 2, 'n_neighbors': 3, 'algorithm': 'kd_tree'}
	[ 1]: 0.969 (+/-0.004) for {'weights': 'distance', 'p': 2, 'n_neighbors': 5, 'algorithm': 'brute'}
	[ 2]: 0.968 (+/-0.004) for {'weights': 'uniform', 'p': 2, 'n_neighbors': 5, 'algorithm': 'brute'}
	[ 3]: 0.968 (+/-0.004) for {'weights': 'uniform', 'p': 2, 'n_neighbors': 5, 'algorithm': 'ball_tree'}
	[ 4]: 0.971 (+/-0.004) for {'weights': 'uniform', 'p': 3, 'n_neighbors': 4, 'algorithm': 'brute'}
	[ 5]: 0.974 (+/-0.003) f