<a href="https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ML_Collab_Article/ML_Collaboration_Model_Building.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color='red'>NOTE: This notebook is a stripped down version from our ArangoML Series and has had some additional notes added for those following along with the ArangoML [Multi-Model Collaboration post](https://www.arangodb.com/2021/01/arangoml-series-multi-model-collaboration/).</font>

The full post that details the basic workflow of arangopipe and introduces the concept of model building can be found [here](https://www.arangodb.com/2020/10/arangoml-part-2-basic-arangopipe-workflow/).



# ArangoML Multi-model Collaboration

This notebook is a continuation of a project being worked on by three colleagues. After having performed the data analysis, candidate modeling and the best choice of hyper-parameters it is time to build the model.

### Connect to Arangopipe
In a real environment you would reconnect to the same database and update the existing project, this would make it so that your colleagues could reference your work later.

If you have been following along with the previous notebooks, you can see this continuity with a couple small changes.
1. Uncomment and update these `conn_params` variable properties with the credentials generated in the first noteook:
 * `DB_NAME`
 * `DB_USER_NAME`
 * `DB_PASSWORD`
2. Change the ArangoPipeAdmin `reuse_connection` parameter to `True`
3. Comment out registering a new project and uncomment the project lookup.
4. Comment out registering a new dataset and uncomment the dataset lookup.

For example, here is a list of the variables included with our data.

In [None]:
import pandas as pd
data_url = "https://raw.githubusercontent.com/arangoml/arangopipe/arangopipe_examples/examples/data/cal_housing.csv"
df = pd.read_csv(data_url, error_bad_lines=False)

df.head() #prints the first 5 rows of data with headers

## Installation Prerequisites

In [None]:
%%capture
!pip install python-arango
!pip install arangopipe==0.0.70.0.0
!pip install pandas PyYAML==5.1.1 sklearn2
!pip install jsonpickle

In [None]:

from arangopipe.arangopipe_storage.arangopipe_api import ArangoPipe
from arangopipe.arangopipe_storage.arangopipe_admin_api import ArangoPipeAdmin
from arangopipe.arangopipe_storage.arangopipe_config import ArangoPipeConfig
from arangopipe.arangopipe_storage.managed_service_conn_parameters import ManagedServiceConnParam
mdb_config = ArangoPipeConfig()
msc = ManagedServiceConnParam()
conn_params = { msc.DB_SERVICE_HOST : "arangoml.arangodb.cloud", \
                        msc.DB_SERVICE_END_POINT : "createDB",\
                        msc.DB_SERVICE_NAME : "createDB",\
                        msc.DB_SERVICE_PORT : 8529,\
                        msc.DB_CONN_PROTOCOL : 'https'}
        
mdb_config = mdb_config.create_connection_config(conn_params)
admin = ArangoPipeAdmin(reuse_connection = False, config = mdb_config)
ap_config = admin.get_config()
ap = ArangoPipe(config = ap_config)
mdb_config.get_cfg()

# If you receive an error creating the temporary database, please run this code block again.

## Lookup Project

Normally you would not need to register a new project each time, this is only necessary because we typically generate a new temporary database with the tutorial notebooks.

If you have been following along you could instead uncomment the project lookup and comment out or delete the two project registration lines.

In [None]:
# project = ap.lookup_entity("Housing_Price_Estimation_Project", "project")

proj_info = {"name": "Housing_Price_Estimation_Project"}
proj_reg = admin.register_project(proj_info)

### Try it out!
Once the previous block has successfully executed you can navigate to https://arangoml.arangodb.cloud:8529 and sign in with the generated credentials to explore the temporary database.

## Model Building


In this section, the procedure for capturing meta-data with Arangopipe as part of the model building activity will be illustrated. Model selection is an important activity for data scientists. Data scientists consider many candidate models for a task and then the best performing model is chosen. An example of this can be found in the hyperopt guide to capture metadata from a hyper-parameter tuning experiment, (see [hyperopt](https://github.com/arangoml/arangopipe/blob/master/arangopipe/tests/hyperopt/hyperopt_integration.ipynb)). We will use a simpler setting for this notebook. We will assume model selection has been completed and that a LASSO regression model is the best candidate for the task. Having made this decision, we capture information about the model and its parameters. This information is stored in Arangopipe. The details of performing these tasks are shown below. Before model building, we capture information related to the dataset and the features used to build the model.

### Register Dataset

Here we register the dataset that we imported in the intro section. This dataset is available from the arangopipe repo and was originally made available from the UCI ML Repository. The dataset contains data for housing in california, including:
 - The house configuration & location
 - The median house values and ages
 - The general population & number of households
 - The median income for the area

### For those following along
Here we register the same dataset we registered from the first notebook. This is only necessary due to the expectation that a new temporary database was generated. 

If you have been following along you can uncomment the dataset lookup and comment out the dataset registartion lines. There is a unique constraint on the dataset name, so attempting to add it should result in an error if you are already using the credentials form the first notebook.


In [None]:
# Lookup the dataset registered with the initial notebook. 
# dataset = ap.lookup_dataset("california-housing-dataset")

# Register dataset, comment out if following along.
ds_info = {"name" : "california-housing-dataset",\
            "description": "This dataset lists median house prices in Califoria. Various house features are provided",\
           "source": "UCI ML Repository" }
dataset = ap.register_dataset(ds_info)

### Register Featureset

Register the features used to develop the model.


*   Note that the response variable has been log transformed
*   Note that when the featureset is registered, it is linked to the dataset



In [None]:
import numpy as np
df["medianHouseValue"] = df["medianHouseValue"].apply(lambda x: np.log(x))
featureset = df.dtypes.to_dict()
featureset = {k:str(featureset[k]) for k in featureset}
featureset["name"] = "log_transformed_median_house_value"
fs_reg = ap.register_featureset(featureset, dataset["_key"]) # note that the dataset and featureset are linked here.

### Develop a Model

As discussed in the introduction it is important to have a training set and a test set to be able to evaluate our model with 'new' data.
Here we use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split) functionality of sklearn to split the data.

Note that we also set `Y` to be the `medianHouseValue`, `Y` here is our target.

In [None]:
from sklearn.model_selection import train_test_split
preds = df.columns.to_list()
preds.remove('medianHouseValue')
X = df[preds].values
Y = df['medianHouseValue'].values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

### Developing the model
Here we have taken some of the guess work out of model training and decided to go with Lasso regression. 

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
clf = linear_model.Lasso(alpha=0.001)
clf.fit(X_train, y_train)
train_pred = clf.predict(X_train)
test_pred = clf.predict(X_test)
train_mse = mean_squared_error(train_pred, y_train)
test_mse = mean_squared_error(test_pred, y_test)

print(train_mse)
print(test_mse)

To get some insight into what model parameters actually are here are the basic parameters used with this experiement.

While they won't make much sense to someone not familiar with them, they might offer a starting spot if you would like to look more into what exactly model parameters are.

In [None]:
print(clf.get_params())

### Register the Model
* Note that project and model are linked
* The notebook associated with the model can be retreived from github. This can be part of the meta-data associated with the model


In [None]:
import io
import requests
url = ('https://raw.githubusercontent.com/arangoml/arangopipe/master/examples/Arangopipe_Feature_Examples.ipynb')
nbjson = requests.get(url).text

The model information can contain the name you would like to assign to the model, the task, and notebook information.

Once you create the model info properties object you register it with the project.

In [None]:

model_info = {"name": "Lasso Model for Housing Dataset",  "task": "Regression", 'notebook': nbjson}
model_reg = ap.register_model(model_info, project = "Housing_Price_Estimation_Project")


## Log Model Building Activity

In this section we look at the procedure for capturing a consolidated version of this model building activity. The execution of this notebook, or any ML activity, is captured by the 'Run' entity in the Arangopipe schema (see [schema](https://github.com/arangoml/arangopipe)). To record the execution, we need to create a unique identifier for it in ArangoDB. 

After generating a unique identifier, we capture the model parameters and model performance and then record the details of this experiment in Arangopipe. Each of these steps is shown below.

Note that capturing the 'Run' or execution of this cell captures information that links


1.   The dataset used in this execution (dataset)
2.   The featureset used in this execution (fs_reg)
3.   The model parameters used in this execution (model_params)
4.   The model performance that was observed in this execution (model perf)



In [None]:
import uuid
import datetime
import jsonpickle

ruuid = str(uuid.uuid4().int)
model_perf = {'training_mse': train_mse, 'test_mse': test_mse, 'run_id': ruuid, "timestamp": str(datetime.datetime.now())}

mp = clf.get_params()
mp = jsonpickle.encode(mp)
model_params = {'run_id': ruuid, 'model_params': mp}

run_info = {"dataset" : dataset["_key"],\
                    "featureset": fs_reg["_key"],\
                    "run_id": ruuid,\
                    "model": model_reg["_key"],\
                    "model-params": model_params,\
                    "model-perf": model_perf,\
                    "tag": "Housing_Price_Estimation_Project",\
                    "project": "Housing_Price_Estimation_Project"}
ap.log_run(run_info)


### Optional: Save the connection information to google drive so that this can used to connect to the instance that was used in this session.
Once you have a database created and a project filled with data, you can save your connection configuration to a file to be able to easily reconnect.

This file could be shared among your colleagues to have them connect with to the same database with the same credentials. For our scenario we assume each colleage has there own credentials and are already aware of the database information.

Feel free to uncomment to see how to export this file to your personal Google Drive.

In [None]:
# from google.colab import drive
# drive._mount('/content/drive')
# fp = '/content/drive/My Drive/saved_arangopipe_config.yaml'
# mdb_config.export_cfg(fp)

## Using Arangopipe with Common Tools in a Machine Learning Stack

This notebook provides the details of working with Arangopipe to capture meta-data from a machine learning project activity. If you would like to see Arangopipe can be used with some common tools in a machine learning stack:


1.   See [TFX](https://github.com/arangoml/arangopipe/tree/master/arangopipe/tests/TFX) for the details of using Arangopipe with TFX
2.   See [Pytorch](https://github.com/arangoml/arangopipe/tree/master/arangopipe/tests/pytorch) for details of using Arangopipe with Pytorch.
3.  See [Hyperopt](https://github.com/arangoml/arangopipe/tree/master/arangopipe/tests/hyperopt) for details of using Arangopipe with Hyperopt
4. See [MLFlow](https://github.com/arangoml/arangopipe/tree/master/arangopipe/tests/mlflow) for details of using Arangopipe with MLFlow.

