<a href="https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ML_Collab_Article/Arangopipe_Generate_TF_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color='red'>THIS NOTEBOOK IS FROM THE ARANGOML MULTI-MODEL COLLABORATION ARTICLE. PLEASE REFER TO THAT ARTICLE FOR FURTHER CONTEXT [HERE](https://www.arangodb.com/2021/01/arangoml-series-multi-model-collaboration/).</font>

## Generating Data Visualization with TFX data validation

Install pre-requisite libraries

In [None]:
%%capture
!pip install python-arango
!pip install arangopipe==0.0.70.0.0
!pip install pandas PyYAML==5.1.1 sklearn2
!pip install jsonpickle
!pip install tensorflow==2.2.0
# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

print('Installing TensorFlow Data Validation')
!pip install -q tensorflow_data_validation[visualization]

In [None]:
# Must restart runtime for tensorflow_data_validation due to how Colab installs packages.
# You will need to resume running the code blocks below. 
# To resume onced exited, click into the next cell and then CTRL+F10.
exit()

## Retrieve the Dataset

In [None]:
import pandas as pd
import os
import tensorflow as tf
import tensorflow_data_validation as tfdv
print('TFDV version: {}'.format(tfdv.version.__version__))

In [None]:
data_url = "https://raw.githubusercontent.com/arangoml/arangopipe/arangopipe_examples/examples/data/cal_housing.csv"
df = pd.read_csv(data_url, error_bad_lines=False)

df.head() #prints the first 5 rows of data with headers

In [None]:
fp = "cal_housing.csv"
df.to_csv(fp, index = False)
DATA_DIR = "./"
TRAIN_DATA = os.path.join(DATA_DIR, 'cal_housing.csv')

In [None]:
pwd

## Generate the TFX Visualization

In [None]:
train_stats = tfdv.generate_statistics_from_csv(TRAIN_DATA, delimiter=',')

In [None]:
tfdv.visualize_statistics(train_stats)

In [None]:
schema = tfdv.infer_schema(train_stats)

## Connect to Arangopipe

In [None]:
from arangopipe.arangopipe_storage.arangopipe_api import ArangoPipe
from arangopipe.arangopipe_storage.arangopipe_admin_api import ArangoPipeAdmin
from arangopipe.arangopipe_storage.arangopipe_config import ArangoPipeConfig
from arangopipe.arangopipe_storage.managed_service_conn_parameters import ManagedServiceConnParam

mdb_config = ArangoPipeConfig()
msc = ManagedServiceConnParam()
conn_params = { msc.DB_SERVICE_HOST : "arangoml.arangodb.cloud", \
                        msc.DB_SERVICE_END_POINT : "createDB",\
                        msc.DB_SERVICE_NAME : "createDB",\
                        msc.DB_SERVICE_PORT : 8529,\
                        msc.DB_CONN_PROTOCOL : 'https'}
mdb_config = mdb_config.create_connection_config(conn_params)
admin = ArangoPipeAdmin(reuse_connection = False, config = mdb_config)
ap_config = admin.get_config()
ap = ArangoPipe(config = ap_config)

# Prints the temporary login credentials
# These credentials are only valid for a short time
mdb_config.get_cfg()

## Register the Project
This creates a project that we can associate other experiment details with, making it easy to find all relevant experiment information.

In [None]:
# Register the project to associate all of our experiment data with
proj_info = {"name": "Housing_Price_Estimation_Project"}
proj_reg = admin.register_project(proj_info)

## Save the Visualization in Arangopipe

In [None]:
from google.protobuf import json_format
enc_stats = json_format.MessageToJson(train_stats)
enc_schema = json_format.MessageToJson(schema)

In [None]:

from tensorflow_metadata.proto.v0 import statistics_pb2
from tensorflow_metadata.proto.v0 import schema_pb2

In [None]:
#data = pd.read_csv(fp)
ds_info = {"name" : "california-housing-dataset",\
                   "description": "data about housing in California",\
           "encoded_stats": enc_stats,\
           "encoded_schema": enc_schema,\
           "source": "UCI ML Repository" }
ds_reg = ap.register_dataset(ds_info)
featureset = df.dtypes.to_dict()
featureset = {k:str(featureset[k]) for k in featureset}
featureset["name"] = "cal_housing_dataset_uc_demo_fs"
fs_reg = ap.register_featureset(featureset, ds_reg["_key"])

# The following messages indicate a lookup was performed for these resources.
# If the supplied resource isn't found the logger reports this and then the resource is added.
# When recording a featureset the appropriate links(edges) are also created.

## Explore the data
If you ran the notebook on your own and would like to explore the data added so far you can access the ArangoDB WebUI directly by using the temporary credentials generated when connecting to Arangopipe.

The following code block prints these again for you.

In [None]:
mdb_config.get_cfg()