 
# Getting Started with Arangopipe

## Overview
The purpose of this notebook is to provide a brief overview of the **Arangopipe** API with examples. This should provide a template for users to adapt **Arangopipe** to their specific needs. The examples cover various components of a typical machine learning stack. Test data generation utilities are also provided. The examples discussed in this notebook are in the _examples_ directory of the **Arangopipe** docker container. Examples of integration with common components of a typical machine learning stack are provided. The _examples_ directory contains sub-directories such as, _Tensorflow_, _MLFlow_ etc.. Each of these sub-directories provide an illustration of integrating **Arangopipe** with a particular machine learning stack component.

## Arangopipe API
The python script _arangopipe_test_cases.py_ provides examples of the **Arangopipe** API. Examples provided cover the following features:
1. Provision a Project (`test_provision_project()`)
2. Register a dataset (`test_dataset_registration()`)
3. Lookup a dataset (`test_dataset_lookup()`)
4. Register a featureset (`test_featureset_registration()`)
5. Lookup a featureset (`test_featureset_lookup()`)
6. Register a model (`test_model_registration()`)
7. Lookup a model (`test_model_lookup()`)
8. Log the data for machine learning experiment execution (`test_log_run()`)
9. Track a model deployed to production in **Arangopipe** (`test_provision_deployment()`)
10. Log the serving performance of deployed model (`test_log_servingperf()`)

The parameters that are needed to be specified and an example of using each API is provided. 



## Tensorflow/TFX

The notebook, _tfx_metadata_integration.ipynb_, provides an illustration of how **Arangopipe** can be used with _Tensorflow Extended (TFX)_. This notebook provides an illustration of using _TFX_ to generate summary statistics for a dataset. The generated _TFX_ artifact is an example of machine learning metadata. This can be stored in **Arangopipe**. The example is illustrated using the _california housing_ dataset that is available in the **UCI** machine learning repository.
 


## Hyperopt
Hyperparameter optimization experiments are one of the most common tasks performed by **data scientists**. Capturing experimental results, observations and hypothesis are critical. The notebook, _hyperopt-integration.ipynb_, provides an example of how **Arangopipe** can be used with _hyperopt_ to store meta-data from hyper-parameter optimization experiments. The example illustrates how artifacts from hyper-parameter optimization experiments, for example, the parameter space specification, can be captured in **Arangopipe**. The example is illustrated using the _california housing_ dataset that is available in the **UCI** machine learning repository.


## MLFlow

Arangopipe can be used in conjuction with other tools and API that capture ML metadata, for example **MLFlow**. The data captured can be stored in **ArangoDB** using **Arangopipe**. The example is illustrated using the _wine quality_ dataset that is available in the **UCI** machine learning repository. The examples are located in the _mlflow_ directory. These examples illustrate the complete range of the meta-data capture API in **Arangopipe**. The scripts, *example_secenario_1.py*, *example_secenario_2.py* and *example_secenario_3.py*, need to be executed in the specified order. The script, *example_secenario_1.py*, provides an example of registering datasets and featuresets prior to logging machine learning experiment results. The script, *example_secenario_2.py*, provides an example of using the lookup api to retrieve meta-data information from **Arangopipe**. The script, *example_secenario_3.py*, provides an example of tagging a model from a machine learning experiment for deployment. A model that is tagged for deployment can be tracked in **Arangopipe** using the administrative API. Look at the implementation of the `test_provision_deployment()` method in *arangopipe_test_cases.py* for an illustration.  



## Scikit-learn
Many machine learning tasks do need nueral network models. Consequently, the machine learning stack in such cases may not use tools like _Tensorflow_ or _Pytorch_. A library like _scikit-learn_ may be adequete. The examples in the _mlflow_ directory or the _hypeopt_ directory show how **Arangopipe** can be used with tools like _scikit-learn_.



## Covariate Shift Detection
- Not API-fied yet, but not a lot of work to accomplish this. Perhaps half a day to a day at most


## Test Data Generation Utilities
A utility that simulates recieving a continuous batch of datasets is provided with **Arangopipe**. This can be used for test data generation. The python script, *generate_model_data.py* is found in the *test_data_generator* directory. This utility simulates the generation of model meta-data on a monthly basis over a two year period. The script generates data for each month and records the corresponding meta-data in **Arangopipe**. The data can be generated by invoking the  `generate_runs()` from the script.
 