# Hyperparameter optimization
## Iterative approach with GridSearchCV and RandomForestClassifier

## <font color=blue> Demo Supercomputing 2019 </font>

### PyCOMPSs + Dislib + Fault tolerance

In this example, we are going to perform a number of **GridSearchCV** with **RandomForestClassifier** in order to evaluate a set of **Hyperparameters**.

The main idea is to perform a Grid Search over a small set of hyperparameters (with low granularity) and get its accuracy, and repeat this proces over sets of higher granularity. This process is parallelized with **PyCOMPSs** considering that a task is a **GridSearchCV** with **RandomForestClassifier** for a given set of hyperparameters, which returns its accuracy and the best hyperparameter configuration.

Since the algorithm starts with low granularity, and continues with higher granularity, the number of tasks increases, and we can compare the current accuracy with the previous hyperparameters. This fact enables us to consider the possibility of deciding if continuing analysing the hyperparameter sets with higher granularity or not. It can be achieved thanks to the COMPSs **Fault tolerance** mechanism, whose behaviour can be defined by the user within the **@task** decorator with the **on_failure** parameter.

First of all, start the COMPSs runtime. Graph generation and Tracing are activated to see the final DAG and trace. These postmortem information enables us to see how the *Fault Tolerance* mechanisms have been applied.

In [None]:
import os
import pycompss.interactive as ipycompss
if 'BINDER_SERVICE_HOST' in os.environ:
    ipycompss.start(project_xml='../xml/project.xml',
                    resources_xml='../xml/resources.xml')
else:
    ipycompss.start(graph=True, trace=False, monitor=1000)

Import **PyCOMPSs** task decorator and synchronization function

In [None]:
from pycompss.api.task import task
from pycompss.api.on_failure import on_failure
from pycompss.api.api import compss_wait_on

Import the Scykit-learn datasets module.

Import the **Dislib** required functions to perform *GridSearchCV* with *RandomForestClassfier*.

In [None]:
from sklearn import datasets
import dislib as ds
import numpy as np
from dislib.classification import RandomForestClassifier
from dislib.model_selection import GridSearchCV

Define the estimators that are going to be applied per level.

In [None]:
from graphviz import Digraph
dot = Digraph()
dot.node('A', '(1, 32, 64)')
dot.node('B', '(2, 16, 31)')
dot.node('C', '(33, 48, 63)')
dot.node('D', '(3, 9, 15)')
dot.node('E', '(17, 23, 30)')
dot.node('F', '(34, 40, 47)')
dot.node('G', '(49, 55, 62)')
dot.node('H', '(4, 5, 6, 7, 8)')
dot.node('I', '(10, 11, 12, 13, 14)')
dot.node('J', '(18, 19, 20, 21, 22)')
dot.node('K', '(24, 25, 26, 27, 28, 29)')
dot.node('L', '(35, 36, 37, 38, 39)')
dot.node('M', '(41, 42, 43, 44, 45, 46)')
dot.node('N', '(50, 51, 52, 53, 54)')
dot.node('O', '(56, 57, 58, 59, 60, 61)')
dot.edges(['AB', 'AC', 'BD', 'BE', 'CF', 'CG', 'DH', 'DI', 'EJ', 'EK', 'FL', 'FM', 'GN', 'GO'])
dot

In [None]:
estimators = {'level_1': [(1, 32, 64)],
              'level_2': [(2, 16, 31), (33, 48, 63)],
              'level_3': [[(3, 9, 15), (17, 23, 30)], 
                          [(34, 40, 47), (49, 55, 62)]],
              'level_4': [[[(4, 5, 6, 7, 8), (10, 11, 12, 13, 14)], 
                           [(18, 19, 20, 21, 22), (24, 25, 26, 27, 28, 29)]],
                          [[(35, 36, 37, 38, 39), (41, 42, 43, 44, 45, 46)], 
                           [(50, 51, 52, 53, 54), (56, 57, 58, 59, 60, 61)]]]}

Define the task that will perform the **GridSearchCV** with **RandomForestClassfier** for the given hyperparameters.
It also receives the previous score to decide if raising an exception or the result is better that the previous.

In [None]:
@on_failure(management='CANCEL_SUCCESSORS')
@task(returns=2)
def evaluate(x_dsarray, y_dsarray, estimators, previous_score):
    parameters = {'n_estimators': estimators,
                  'max_depth': range(2, 4)}
    rf = RandomForestClassifier()
    searcher = GridSearchCV(rf, parameters, cv=5)
    np.random.seed(5)
    searcher.fit(x_dsarray, y_dsarray)
    print(str(searcher.best_params_) + " " + str(searcher.best_score_))
    if searcher.best_score_ < previous_score:
        raise Exception("The score achieved is worse than the previous")
    return searcher.best_params_, searcher.best_score_

Finally, the main code iterates over the estimators (hyperparameters) per levels, requesting the *evaluate* task per hyperparameter set. The main dependency among these tasks is the previous score (**partialX** where *X* is the level).

In [None]:
x, y = datasets.load_iris(return_X_y=True)
x_dsarray = ds.array(x, (30, 4))
y_dsarray = ds.array(y[:, np.newaxis], (30, 1))
params = []
results = []
i = 0
for l1 in estimators['level_1']:
    params1, partial1 = evaluate(x_dsarray, y_dsarray, l1, 0.9)
    results.append(('L1', params1, partial1))
    j = 0
    for l2 in estimators['level_2']:
        params2, partial2 = evaluate(x_dsarray, y_dsarray, l2, partial1)
        results.append(('L2', params2, partial2))
        k = 0
        for l3 in estimators['level_3'][j]:
            params3, partial3 = evaluate(x_dsarray, y_dsarray, l3, partial2)
            results.append(('L3', params3, partial3))
            for l4 in estimators['level_4'][j][k]:
                results.append(['L4'] + list(evaluate(x_dsarray, y_dsarray, l4, partial3)))
            k += 1
        j += 1
    i += 1

Check the results:

In [None]:
ok = 0
cancelled = 0
for i, result in enumerate(results):
    l = result[0]
    p = compss_wait_on(result[1])
    r = compss_wait_on(result[2])
    results[i] = (l, p, r) # round(r, 2))
    if p and r:
        ok += 1
    else:
        result = None
        results[i] = result
        cancelled +=1
for result in results:
    print(result)

print("OK: " + str(ok))
print("Cancelled: " + str(cancelled))

Compare the results achieved with the graph, and represent in <font color=red> red </font> the tasks that have been cancelled. The tasks that have finished successfully and achieved a good score are shown in <font color=green> green </font> with the accuracy.

In [None]:
dot = Digraph()
Cancelled_color = 'red'
Ok_color ='green'
field = 2
dot.node('A', str(results[0][field]) if results[0] else "X", style='filled', fillcolor=Ok_color if results[0] else Cancelled_color)
dot.node('B', str(results[1][field]) if results[1] else "X", style='filled', fillcolor=Ok_color if results[1] else Cancelled_color)
dot.node('C', str(results[8][field]) if results[8] else "X", style='filled', fillcolor=Ok_color if results[8] else Cancelled_color)
dot.node('D', str(results[2][field]) if results[2] else "X", style='filled', fillcolor=Ok_color if results[2] else Cancelled_color)
dot.node('E', str(results[5][field]) if results[5] else "X", style='filled', fillcolor=Ok_color if results[5] else Cancelled_color)
dot.node('F', str(results[9][field]) if results[9] else "X", style='filled', fillcolor=Ok_color if results[9] else Cancelled_color)
dot.node('G', str(results[12][field]) if results[12] else "X", style='filled', fillcolor=Ok_color if results[12] else Cancelled_color)
dot.node('H', str(results[3][field]) if results[3] else "X", style='filled', fillcolor=Ok_color if results[3] else Cancelled_color)
dot.node('I', str(results[4][field]) if results[4] else "X", style='filled', fillcolor=Ok_color if results[4] else Cancelled_color)
dot.node('J', str(results[6][field]) if results[6] else "X", style='filled', fillcolor=Ok_color if results[6] else Cancelled_color)
dot.node('K', str(results[7][field]) if results[7] else "X", style='filled', fillcolor=Ok_color if results[7] else Cancelled_color)
dot.node('L', str(results[10][field]) if results[10] else "X", style='filled', fillcolor=Ok_color if results[10] else Cancelled_color)
dot.node('M', str(results[11][field]) if results[11] else "X", style='filled', fillcolor=Ok_color if results[11] else Cancelled_color)
dot.node('N', str(results[13][field]) if results[13] else "X", style='filled', fillcolor=Ok_color if results[13] else Cancelled_color)
dot.node('O', str(results[14][field]) if results[14] else "X", style='filled', fillcolor=Ok_color if results[14] else Cancelled_color)
dot.edges(['AB', 'AC', 'BD', 'BE', 'CF', 'CG', 'DH', 'DI', 'EJ', 'EK', 'FL', 'FM', 'GN', 'GO'])
dot
# dot.render('graph2', view=False) # To save into file

In [None]:
ipycompss.stop()