# Using Arangopipe to Store Hyper-parameter Optimization Data

In this notebook we will demonstrate the use of Arangopipe to store hyperparameter optimization data. We will use the california housing dataset for this purpose. We will conduct a hyperopt experiment to determine the best model for this problem. We will consider three regression models. The hyperopt experiment will run a large number of experiments (500) to determine the best model. We can store the parameterization of the experiment and the results from the experiment, in Arangopipe using jsonpickle. The data is encoded into json for storage. The json representation can then be transformed back to a python object at anytime. For example, if in subsequent experimental evaluation, you are interested in examining the parameter space used for earlier experiments, you can easily retrieve the older configuration to examine it.

ModuleNotFoundError: No module named 'arangopipe'

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import ArangoPipe
fp = "cal_housing.csv"
df = pd.read_csv(fp)
ap = ArangoPipe()

ModuleNotFoundError: No module named 'arangopipe'

## Read the data

In [2]:
ds_reg = ap.lookup_dataset("cal_housing_dataset")

## Register Experiment Featureset

In [3]:
import numpy as np
df["medianHouseValue"] = df["medianHouseValue"].apply(lambda x: np.log(x))
featureset = df.dtypes.to_dict()
featureset = {k:str(featureset[k]) for k in featureset}
featureset["name"] = "log_transformed_median_house_value"
fs_reg = ap.register_featureset(featureset, ds_reg["_key"])

## Run the Experiment

In [4]:
from sklearn.model_selection import train_test_split
preds = df.columns.tolist()
preds.remove("medianHouseValue")
X = df[preds]
Y = df["medianHouseValue"]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [5]:
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn import neighbors
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
hyper_param_options =     [{
        'type': 'lasso',
        'alpha': hp.uniform('alpha', 0.0, 1)
    },

    {
        'type': 'randomforest',
        'max_depth': hp.choice('max_depth', range(1,5)),
        'max_features': hp.choice('max_features', range(1,8)),
        'n_estimators': hp.choice('n_estimators', range(1,50))
    },
    {
        'type': 'knn',
        'n_neighbors': hp.choice('knn_n_neighbors', range(1,20))
    }
]
space = hp.choice('regressor_type', hyper_param_options)

  from numpy.core.umath_tests import inner1d


In [6]:
from sklearn.metrics import mean_squared_error
def hyperopt_train_test(params):
    regressor_type = params['type']
    del params['type']
    if regressor_type == 'lasso':
        reg = linear_model.Lasso(**params)
    elif regressor_type == 'randomforest':
        reg = RandomForestRegressor(**params)
    elif regressor_type == 'knn':
        reg = neighbors.KNeighborsRegressor(**params)
    else:
        return 0
    reg.fit(X_train, y_train)
    ytest_pred = reg.predict(X_test)
    return mean_squared_error(y_test, ytest_pred)

In [7]:
count = 0
best = 100
def f(params):
    global best, count
    count += 1
    rmse = hyperopt_train_test(params.copy())
    if rmse < best:
        print ('new best:', rmse, 'using', params['type'])
        best = rmse
    if count % 100 == 0:
        print ('iters:', count, ', acc:', rmse, 'using', params)
    return {'loss': rmse, 'status': STATUS_OK}

trials = Trials()
best = fmin(f, space, algo=tpe.suggest, max_evals=500, trials=trials)
print ('best:', best)

new best:                                            
0.33060970265276035                                  
using                                                
knn                                                  
new best:                                                                      
0.1785698052526255                                                             
using                                                                          
randomforest                                                                   
new best:                                                                      
0.12935838517553624                                                           
using                                                                         
randomforest                                                                  
new best:                                                                       
0.12636883777172153                                               

## Convert Hyperopt Space to JSON

In [8]:
import jsonpickle
frozen_space = jsonpickle.encode(space)

## Register the Hyperopt Model

In [9]:
model_info = {"name": "calhousing_hyperparam_exp",\
              "type": "hyperopt experiment"}
model_reg = ap.register_model(model_info)

## Generate ID for Experiment Storage

In [12]:
import uuid
ruuid = str(uuid.uuid4().int)

## Store Results in Arangopipe

In [13]:
model_params = {"hyper-param-space": frozen_space, "run_id": ruuid}

In [14]:
import datetime
model_perf = {"best": jsonpickle.encode(best), "run_id": ruuid, "timestamp": str(datetime.datetime.now())}

In [15]:
run_info = {"dataset" : ds_reg["_key"],\
                    "featureset": fs_reg["_key"],\
                    "run_id": ruuid,\
                    "model": model_reg["_key"],\
                    "model-params": model_params,\
                    "model-perf": model_perf,\
                    "pipeline" : "cal_housing_hyper_parameter_optimization",\
                    "project": "House Price Estimation"}
ap.log_run(run_info)