# Arangopipe TFX Metadata Integration

In this notebook we illustrate the process of storing TFX artifacts into Arangopipe. This is accomplished with using proto to create json representation of the artifacts. To use the stored representation in TFX, the stored json representation is transformed back to the TFX object and used with TFX components and libraries. In this example we illustrate this process on the summary statistics associated with a dataset. The california housing dataset is used for this example. The details of the process are shown below.

## Read the Data

In [None]:
import tensorflow_data_validation as tfdv
import os
DATA_DIR = "./"
TRAIN_DATA = os.path.join(DATA_DIR, 'cal_housing.csv')



In [None]:
pwd

## Calculate the Statistics

In [None]:
train_stats = tfdv.generate_statistics_from_csv(TRAIN_DATA, delimiter=',')

## Visualize the Statistics 

In [None]:
tfdv.visualize_statistics(train_stats)

## Infer the Schema

In [None]:
schema = tfdv.infer_schema(train_stats)

## Use Arangopipe for Metadata Storage

In [None]:
from arangopipe.arangopipe_storage.arangopipe_api import ArangoPipe
from arangopipe.arangopipe_storage.arangopipe_admin_api import ArangoPipeAdmin
from arangopipe.arangopipe_storage.arangopipe_config import ArangoPipeConfig
from arangopipe.arangopipe_storage.managed_service_conn_parameters import ManagedServiceConnParam
mdb_config = ArangoPipeConfig()
msc = ManagedServiceConnParam()
conn_params = { msc.DB_SERVICE_HOST : "localhost", \
                        msc.DB_SERVICE_END_POINT : "apmdb",\
                        msc.DB_SERVICE_NAME : "createDB",\
                        msc.DB_SERVICE_PORT : 8529,
                        msc.DB_CONN_PROTOCOL : 'http'}
        
mdb_config = mdb_config.create_connection_config(conn_params)
admin = ArangoPipeAdmin(reuse_connection = False, config = mdb_config)
ap_config = admin.get_config()
ap = ArangoPipe(config = ap_config)

In [None]:
import pandas as pd
fp ="cal_housing.csv"
df = pd.read_csv(fp)

## Create JSON Artifact Representation

In [None]:
from google.protobuf import json_format
enc_stats = json_format.MessageToJson(train_stats)
enc_schema = json_format.MessageToJson(schema)

In [None]:
from tensorflow_metadata.proto.v0 import statistics_pb2
from tensorflow_metadata.proto.v0 import schema_pb2

## Store Artifacts in Arangopipe

In [None]:
data = pd.read_csv(fp)
ds_info = {"name" : "cal_housing_dataset",\
                   "description": "data about housing in California",\
           "encoded_stats": enc_stats,\
           "encoded_schema": enc_schema,\
           "source": "UCI ML Repository" }
ds_reg = ap.register_dataset(ds_info)
featureset = data.dtypes.to_dict()
featureset = {k:str(featureset[k]) for k in featureset}
featureset["name"] = "wine_no_transformations"
fs_reg = ap.register_featureset(featureset, ds_reg["_key"])

## Reusing an Existing Connection to a Managed Database

If you want to reuse the database you used with your previous interaction, simply retrieve the connection information from ArangopipeAdmin and initialize an Arangopipe instance with that connection (configuration). This is illustrated below.

In [None]:
mdb_config.cfg

In [None]:
# Get the last persisted connection
the_admin = ArangoPipeAdmin(reuse_connection=True)
db_config = the_admin.get_config()
db_config.cfg

## Retrieve Stored Artifacts from Arangopipe

In [None]:
#Use the last persisted connection as the database for this interaction
ap_rtrval = ArangoPipe(config = db_config)
dataset = ap_rtrval.lookup_dataset("cal_housing_dataset")

### Note about lookups:
If you lookup for a non existent artifact, you will get a `None` for the the return value

In [None]:
dsinfo = ap_rtrval.lookup_dataset("a_non_existent_dataset")
dsinfo == None

## Get the JSON Representation of TFX Artifacts

In [None]:
retrieved_stats = dataset["encoded_stats"]
retrieved_schema = dataset["encoded_schema"]

## Convert JSON Representation to TFX Objects

In [None]:
remat_stats = json_format.Parse(retrieved_stats, statistics_pb2.DatasetFeatureStatisticsList())
remat_schema = json_format.Parse(retrieved_schema, schema_pb2.Schema())

## Use TFX Objects

In [None]:
tfdv.visualize_statistics(remat_stats)