<font color='red'>THIS NOTEBOOK IS FROM THE ARANGOML MULTI-MODEL COLLABORATION ARTICLE. PLEASE REFER TO THAT ARTICLE FOR FURTHER CONTEXT [HERE](https://www.arangodb.com/2021/01/arangoml-series-multi-model-collaboration/).</font>

<a href="https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ML_Collab_Article/example_output/ML_Collaboration_Hyperopt_Integration_output.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview of Arangopipe

In [1]:
%%capture
!pip install python-arango
!pip install arangopipe==0.0.70.0.0
!pip install pandas PyYAML==5.1.1 sklearn2 hyperopt uuid datetime jsonpickle

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn import neighbors
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.metrics import mean_squared_error

import jsonpickle
import uuid
import datetime

This notebook provides an overview of **Arangopipe**, a component of ArangoDB for managing metadata from machine learning pipelines. Arangopipe has two API's:
1. **arangopipe_api**
2. **arangopipe_admin_api**
**arangopipe_api** is the set of API used for machine learning metadata management. **arangopipe_admin_api** is the API used to provision users into **Arangopipe**. The following notebook illustrates both these API's. We will illustrate this with a machine learning model to predict house prices. The data is available in the UCI machine learning repository.

### Connect to Arangopipe
In a real environment you would reconnect to the same database and update the existing project, this would make it so that your colleagues could reference your work later.

If you have been following along with the previous notebooks [here]() and [here](), you can see this continuity with a couple small changes.
1. Uncomment and update these `conn_params` variable properties with the credentials generated in the first noteook:
 * `DB_NAME`
 * `DB_USER_NAME`
 * `DB_PASSWORD`
2. Change the ArangoPipeAdmin `reuse_connection` parameter to `True`
3. Comment out registering a new project and uncomment the project lookup.
4. Comment out registering a new dataset and uncomment the dataset lookup.

In [19]:
from arangopipe.arangopipe_storage.arangopipe_api import ArangoPipe
from arangopipe.arangopipe_storage.arangopipe_admin_api import ArangoPipeAdmin
from arangopipe.arangopipe_storage.arangopipe_config import ArangoPipeConfig
from arangopipe.arangopipe_storage.managed_service_conn_parameters import ManagedServiceConnParam
mdb_config = ArangoPipeConfig()
msc = ManagedServiceConnParam()
conn_params = { msc.DB_SERVICE_HOST : "arangoml.arangodb.cloud", \
                        msc.DB_SERVICE_END_POINT : "createDB",\
                        msc.DB_SERVICE_NAME : "createDB",\
                        # msc.DB_NAME: 'YOUR DATABASE NAME',\
                        # msc.DB_USER_NAME:'YOUR USERNAME',\
                        # msc.DB_PASSWORD: 'YOUR PASSWORD',\
                        msc.DB_SERVICE_PORT : 8529,\
                        msc.DB_CONN_PROTOCOL : 'https',\
                        msc.DB_REPLICATION_FACTOR: 3}
mdb_config = mdb_config.create_connection_config(conn_params)
admin = ArangoPipeAdmin(reuse_connection = False, config = mdb_config) # Change reuse_connection to True
ap_config = admin.get_config()
ap = ArangoPipe(config = ap_config)

# Prints the temporary login credentials
# These credentials are only valid for a short time
mdb_config.get_cfg()

{'arangodb': {'DB_end_point': 'createDB',
  'DB_service_host': 'arangoml.arangodb.cloud',
  'DB_service_name': 'createDB',
  'DB_service_port': 8529,
  'arangodb_replication_factor': 3,
  'conn_protocol': 'https',
  'dbName': 'MLcb8t7o8ksjkklhujwwlav8',
  'password': 'MLp2rlotji9hl6jk41otqsv',
  'username': 'MLrdigyhsof680b6q0lpk5vtv'},
 'mlgraph': {'graphname': 'enterprise_ml_graph'}}

## Lookup Project

Normally you would not need to register a new project each time, this is only necessary because we typically generate a new temporary database with the tutorial notebooks.

If you have been following along you could instead uncomment the project lookup and comment out or delete the two project registration lines.

In [4]:
# project = ap.lookup_entity("Housing_Price_Estimation_Project", "project")

proj_info = {"name": "Housing_Price_Estimation_Project"}
project = admin.register_project(proj_info)

### Associate Model with Project
This pipeline is going to determine the best regression model to use for the project. We will conduct this experiment with hyperopt. First, however we link the the model developed in this pipeline with the project

In [5]:
model_info = {"name": "hyper-param-optimization",  "type": "hyper-opt-experiment"}
model_reg = ap.register_model(model_info, project = "Housing_Price_Estimation_Project")

## Pipeline Development
This notebook illustrates the process of storing pipeline metadata while executing a machine learning pipeline. The objective with this experiment is to determine the best model for the dataset using the **Hyperopt** library. After conducting the experiments, the result is tagged and stored in **Arangopipe**. 

### Read Data

In [6]:
data_url = "https://raw.githubusercontent.com/arangoml/arangopipe/arangopipe_examples/examples/data/cal_housing.csv"
df = pd.read_csv(data_url, error_bad_lines=False)

fp = "cal_housing.csv"

### Register the Dataset

Here we register the same dataset we registered from the first notebook. This is only necessary due to the expectation that a new temporary database was generated. 

If you have been following along you can uncomment the dataset lookup and comment out the dataset registartion lines. There is a unique constraint on the dataset name, so attempting to add it should result in an error if you are already using the credentials form the first notebook.

In [7]:
# Lookup the dataset registered with the initial notebook. 
# dataset = ap.lookup_dataset("california-housing-dataset")

# Register dataset, comment out if following along.
ds_info = {"name" : "california-housing-dataset",\
            "description": "This dataset lists median house prices in Califoria. Various house features are provided",\
           "source": "UCI ML Repository" }
dataset = ap.register_dataset(ds_info)

### Register the Featureset Generated from the Dataset
A log transformation is required for the median-house value. The feature set generated from the dataset is registered with **Arangopipe**. Note that the featureset is linked to the dataset using the dataset registration obtained from the previous step

In [8]:
df["medianHouseValue"] = df["medianHouseValue"].apply(lambda x: np.log(x))
featureset = df.dtypes.to_dict()
featureset = {k:str(featureset[k]) for k in featureset}
featureset["name"] = "log_transformed_median_house_value"
fs_reg = ap.register_featureset(featureset, dataset["_key"])

### Run the Experiment

In [9]:
preds = df.columns.tolist()
preds.remove("medianHouseValue")
X = df[preds]
Y = df["medianHouseValue"]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

**Set up the Hyperopt Experiment**

Define the hyper-opt space. In this case, this represents the various models and their associated parametrizations

In [10]:
hyper_param_options =     [{
        'type': 'lasso',
        'alpha': hp.uniform('alpha', 0.0, 1)
    },

    {
        'type': 'randomforest',
        'max_depth': hp.choice('max_depth', range(1,5)),
        'max_features': hp.choice('max_features', range(1,8)),
        'n_estimators': hp.choice('n_estimators', range(1,50))
    },
    {
        'type': 'knn',
        'n_neighbors': hp.choice('knn_n_neighbors', range(1,20))
    }
]
space = hp.choice('regressor_type', hyper_param_options)

def hyperopt_train_test(params):
    regressor_type = params['type']
    del params['type']
    if regressor_type == 'lasso':
        reg = linear_model.Lasso(**params)
    elif regressor_type == 'randomforest':
        reg = RandomForestRegressor(**params)
    elif regressor_type == 'knn':
        reg = neighbors.KNeighborsRegressor(**params)
    else:
        return 0
    reg.fit(X_train, y_train)
    ytest_pred = reg.predict(X_test)
    return mean_squared_error(y_test, ytest_pred)

**Run the Hyperopt Experiment**

In [11]:
count = 0
best = 100
def f(params):
    global best, count
    count += 1
    rmse = hyperopt_train_test(params.copy())
    if rmse < best:
        print ('new best:', rmse, 'using', params['type'])
        best = rmse
    if count % 250 == 0:
        print ('iters:', count, ', acc:', rmse, 'using', params)
    return {'loss': rmse, 'status': STATUS_OK}

trials = Trials()
best = fmin(f, space, algo=tpe.suggest, max_evals=500, trials=trials)
print ('best:', best)

new best:
0.25182933552576053
using
lasso
new best:
0.12690774538944666
using
lasso
new best:
0.12188265157246057
using
randomforest
new best:
0.11468024299236657
using
lasso
new best:
0.11456515860773858
using
lasso
new best:
0.11379396083303803
using
lasso
new best:
0.11373415186805391
using
lasso
new best:
0.11372762789004627
using
lasso
iters:
250
, acc:
0.1569496666062016
using
{'alpha': 0.11396564749321282, 'type': 'lasso'}
new best:
0.11372643688773879
using
lasso
iters:
500
, acc:
0.1851953579599621
using
{'max_depth': 4, 'max_features': 1, 'n_estimators': 22, 'type': 'randomforest'}
100%|██████████| 500/500 [01:11<00:00,  7.01it/s, best loss: 0.11372643688773879]
best: {'alpha': 2.4177502328996713e-05, 'regressor_type': 0}


## Convert Hyperopt Space to JSON

In [12]:
ruuid = str(uuid.uuid4().int)
frozen_space = jsonpickle.encode(space)
model_params = {"name": "Housing_Price_Regression_Model_Params",\
                "hyperopt-space": frozen_space, "run_id": ruuid}


## Store Results in Arangopipe
Note that we are tagging the run so that we can look up this run by the tag if we need to retrieve it from storage

In [13]:
model_perf = {"best": jsonpickle.encode(best), "run_id": ruuid, "timestamp": str(datetime.datetime.now())}
run_info = {"dataset" : dataset["_key"],\
                    "featureset": fs_reg["_key"],\
                    "run_id": ruuid,\
                    "model": model_reg["_key"],\
                    "model-params": model_params,\
                    "model-perf": model_perf,\
                    "tag": "Housing-Price-Hyperopt-Experiment",\
                    "project": "Housing Price Estimation Project"}
ap.log_run(run_info)


## What was the best the model from the previous run?
The tag (Housing-Price-Hyperopt-Experiment) that we applied while logging the previous experiment can be used to retrieve the results associated with the previous run. For example, we may be interested in the best model and its parameters from the experiment we just conducted.

In [14]:
mp = ap.lookup_modelperf("Housing-Price-Hyperopt-Experiment")

### Note about lookups:
Check the return value of the lookup to see if you got a reference to what you were looking for. If what you are looking for was not found, you will get a "None" for the return value.

In [15]:
mp = ap.lookup_modelperf("A non existent experiment in the database")
mp == None

True

In [16]:
mp = ap.lookup_modelperf("Housing-Price-Hyperopt-Experiment")

In [17]:
mp["best"]

'{"alpha": 2.4177502328996713e-05, "regressor_type": 0}'

## Advanced Modeling Option

If you have the need to extend or customize the arangopipe schema, the API provides that capability. You can add vertex types and edge types. In the context of this (hyperparameter experiment) notebook, the following example serves to illustrate this. If we want to save meta-data about notebooks used for a project to a new graph vertex type, and, link the project to notebooks created for the project, the following code segment illustrates how this can be done.

In [18]:
notebook_info = {"version": "v1", "author": "John Doe", "name": "hyperopt_integration.ipynb"}
if not admin.has_vertex('notebook'):
    admin.add_vertex_to_arangopipe('notebook')
nb_info = ap.insert_into_vertex_type('notebook', notebook_info)
if not admin.has_edge('project_notebook'):
    admin.add_edge_definition_to_arangopipe('project_notebook', 'project', 'notebook')
ap.insert_into_edge_type('project_notebook', project, nb_info)


{'_id': 'project_notebook/545953669-541953545',
 '_key': '545953669-541953545',
 '_rev': '_bvZniMe--_'}