# Overview of Arangopipe

This notebook provides an overview of **Arangopipe**, a component of ArangoDB for managing metadata from machine learning pipelines. Arangopipe has two API's:
1. **arangopipe_api**
2. **arangopipe_admin_api**
**arangopipe_api** is the set of API used for machine learning metadata management. **arangopipe_admin_api** is the API used to provision users into **Arangopipe**. The following notebook illustrates both these API's. We will illustrate this with a machine learning model to predict house prices. The data is available in the UCI machine learning repository.

### Create a Project
Use the admministrative API to register a project with Arangopipe

In [None]:
from arangopipe.arangopipe_storage.arangopipe_api import ArangoPipe
from arangopipe.arangopipe_storage.arangopipe_admin_api import ArangoPipeAdmin
from arangopipe.arangopipe_storage.arangopipe_config import ArangoPipeConfig
from arangopipe.arangopipe_storage.managed_service_conn_parameters import ManagedServiceConnParam
mdb_config = ArangoPipeConfig()
msc = ManagedServiceConnParam()
conn_params = { msc.DB_SERVICE_HOST : "localhost", \
                        msc.DB_SERVICE_END_POINT : "apmdb",\
                        msc.DB_SERVICE_NAME : "createDB",\
                        msc.DB_SERVICE_PORT : 8529,
                        msc.DB_CONN_PROTOCOL : 'http'}
        
mdb_config = mdb_config.create_connection_config(conn_params)
admin = ArangoPipeAdmin(reuse_connection = False, config = mdb_config)
ap_config = admin.get_config()
ap = ArangoPipe(config = ap_config)
proj_info = {"name": "Housing_Price_Estimation_Project"}
proj_reg = admin.register_project(proj_info)

### Associate Model with Project
This pipeline is going to determine the best regression model to use for the project. We will conduct this experiment with hyperopt. First, however we link the the model developed in this pipeline with the project

In [None]:
model_info = {"name": "hyper-param-optimization",  "type": "hyper-opt-experiment"}
model_reg = ap.register_model(model_info, project = "Housing_Price_Estimation_Project")

## Pipeline Development
This notebook illustrates the process of storing pipeline metadata while executing a machine learning pipeline. The objective with this experiment is to determine the best model for the dataset using the **Hyperopt** library. After conducting the experiments, the result is tagged and stored in **Arangopipe**. 

### Read Data

In [None]:
import pandas as pd
fp = "cal_housing.csv"
df = pd.read_csv(fp)

### Register the Dataset

In [None]:
ds_info = {"name" : "california-housing-dataset",\
            "description": "This dataset lists median house prices in Califoria. Various house features are provided",\
           "source": "UCI ML Repository" }
ds_reg = ap.register_dataset(ds_info)

### Register the Featureset Generated from the Dataset
A log transformation is required for the median-house value. The feature set generated from the dataset is registered with **Arangopipe**. Note that the featureset is linked to the dataset using the dataset registration obtained from the previous step

In [None]:
import numpy as np
df["medianHouseValue"] = df["medianHouseValue"].apply(lambda x: np.log(x))
featureset = df.dtypes.to_dict()
featureset = {k:str(featureset[k]) for k in featureset}
featureset["name"] = "log_transformed_median_house_value"
fs_reg = ap.register_featureset(featureset, ds_reg["_key"])

### Run the Experiment

In [None]:
from sklearn.model_selection import train_test_split
preds = df.columns.tolist()
preds.remove("medianHouseValue")
X = df[preds]
Y = df["medianHouseValue"]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

**Set up the Hyperopt Experiment**

Define the hyper-opt space. In this case, this represents the various models and their associated parametrizations

In [None]:
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn import neighbors
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.metrics import mean_squared_error

hyper_param_options =     [{
        'type': 'lasso',
        'alpha': hp.uniform('alpha', 0.0, 1)
    },

    {
        'type': 'randomforest',
        'max_depth': hp.choice('max_depth', range(1,5)),
        'max_features': hp.choice('max_features', range(1,8)),
        'n_estimators': hp.choice('n_estimators', range(1,50))
    },
    {
        'type': 'knn',
        'n_neighbors': hp.choice('knn_n_neighbors', range(1,20))
    }
]
space = hp.choice('regressor_type', hyper_param_options)

def hyperopt_train_test(params):
    regressor_type = params['type']
    del params['type']
    if regressor_type == 'lasso':
        reg = linear_model.Lasso(**params)
    elif regressor_type == 'randomforest':
        reg = RandomForestRegressor(**params)
    elif regressor_type == 'knn':
        reg = neighbors.KNeighborsRegressor(**params)
    else:
        return 0
    reg.fit(X_train, y_train)
    ytest_pred = reg.predict(X_test)
    return mean_squared_error(y_test, ytest_pred)

**Run the Hyperopt Experiment**

In [None]:
count = 0
best = 100
def f(params):
    global best, count
    count += 1
    rmse = hyperopt_train_test(params.copy())
    if rmse < best:
        print ('new best:', rmse, 'using', params['type'])
        best = rmse
    if count % 250 == 0:
        print ('iters:', count, ', acc:', rmse, 'using', params)
    return {'loss': rmse, 'status': STATUS_OK}

trials = Trials()
best = fmin(f, space, algo=tpe.suggest, max_evals=500, trials=trials)
print ('best:', best)

## Convert Hyperopt Space to JSON

In [None]:
import jsonpickle
import uuid
ruuid = str(uuid.uuid4().int)
frozen_space = jsonpickle.encode(space)
model_params = {"name": "Housing_Price_Regression_Model_Params",\
                "hyperopt-space": frozen_space, "run_id": ruuid}


## Store Results in Arangopipe
Note that we are tagging the run so that we can look up this run by the tag if we need to retrieve it from storage

In [None]:

import datetime

model_perf = {"best": jsonpickle.encode(best), "run_id": ruuid, "timestamp": str(datetime.datetime.now())}
run_info = {"dataset" : ds_reg["_key"],\
                    "featureset": fs_reg["_key"],\
                    "run_id": ruuid,\
                    "model": model_reg["_key"],\
                    "model-params": model_params,\
                    "model-perf": model_perf,\
                    "tag": "Housing-Price-Hyperopt-Experiment",\
                    "project": "Housing Price Estimation Project"}
ap.log_run(run_info)


## What was the best the model from the previous run?
The tag (Housing-Price-Hyperopt-Experiment) that we applied while logging the previous experiment can be used to retrieve the results associated with the previous run. For example, we may be interested in the best model and its parameters from the experiment we just conducted.

In [None]:
mp = ap.lookup_modelperf("Housing-Price-Hyperopt-Experiment")

### Note about lookups:
Check the return value of the lookup to see if you got a reference to what you were looking for. If what you are looking for was not found, you will get a "None" for the return value.

In [None]:
mp = ap.lookup_modelperf("A non existent experiment in the database")
mp == None

In [None]:
mp = ap.lookup_modelperf("Housing-Price-Hyperopt-Experiment")

In [None]:
mp["best"]

## Advanced Modeling Option

If you have the need to extend or customize the arangopipe schema, the API provides that capability. You can add vertex types and edge types. In the context of this (hyperparameter experiment) notebook, the following example serves to illustrate this. If we want to save meta-data about notebooks used for a project to a new graph vertex type, and, link the project to notebooks created for the project, the following code segment illustrates how this can be done.

In [None]:
notebook_info = {"version": "v1", "author": "John Doe", "name": "hyperopt_integration.ipynb"}
if not admin.has_vertex('notebook'):
    admin.add_vertex_to_arangopipe('notebook')
nb_info = ap.insert_into_vertex_type('notebook', notebook_info)
if not admin.has_edge('project_notebook'):
    admin.add_edge_definition_to_arangopipe('project_notebook', 'project', 'notebook')
ap.insert_into_edge_type('project_notebook', proj_reg, nb_info)
