# Arangopipe TFX Metadata Integration

In this notebook we illustrate the process of storing TFX artifacts into Arangopipe. This is accomplished with using proto to create json representation of the artifacts. To use the stored representation in TFX, the stored json representation is transformed back to the TFX object and used with TFX components and libraries. In this example we illustrate this process on the summary statistics associated with a dataset. The california housing dataset is used for this example. The details of the process are shown below.

## Read the Data

In [1]:
import tensorflow_data_validation as tfdv
import os
DATA_DIR = "./"
TRAIN_DATA = os.path.join(DATA_DIR, 'cal_housing.csv')



  'Running the Apache Beam SDK on Python 3 is not yet fully supported. '


In [2]:
pwd

'/workspace/experiments/tests/TFX'

## Calculate the Statistics

In [3]:
train_stats = tfdv.generate_statistics_from_csv(TRAIN_DATA, delimiter=',')

W0722 14:30:00.596516 140640373532480 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensorflow_data_validation/utils/stats_gen_lib.py:144: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.

W0722 14:30:00.607689 140640373532480 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensorflow_data_validation/utils/stats_gen_lib.py:292: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

W0722 14:30:00.609966 140640373532480 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensorflow_data_validation/utils/stats_gen_lib.py:298: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

W0722 14:30:00.715036 140640373532480 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensorflow_transform/analyzers.py:948: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

W0722 14:30:00.753403 140640373532480 depreca

## Visualize the Statistics 

In [4]:
tfdv.visualize_statistics(train_stats)

## Infer the Schema

In [5]:
schema = tfdv.infer_schema(train_stats)

## Use Arangopipe for Metadata Storage

In [8]:
from arangopipe.arangopipe_api import ArangoPipe
from arangopipe.arangopipe_admin_api import ArangoPipeAdmin
from arangopipe.arangopipe_config import ArangoPipeConfig
conn_config = ArangoPipeConfig()
conn_config.set_dbconnection(hostname = "localhost", port = 8529,\
                                root_user = "root", root_user_password = "open sesame")

In [9]:
import pandas as pd
admin = ArangoPipeAdmin(conn_config)
ap = ArangoPipe(conn_config)
fp ="cal_housing.csv"
df = pd.read_csv(fp)

I0722 14:31:03.796445 140640373532480 arangopipe_api.py:177] Arango Pipe ML Graph initialized


## Create JSON Artifact Representation

In [10]:
from google.protobuf import json_format
enc_stats = json_format.MessageToJson(train_stats)
enc_schema = json_format.MessageToJson(schema)

In [11]:
from tensorflow_metadata.proto.v0 import statistics_pb2
from tensorflow_metadata.proto.v0 import schema_pb2

## Store Artifacts in Arangopipe

In [12]:
data = pd.read_csv(fp)
ds_info = {"name" : "cal_housing_dataset",\
                   "description": "data about housing in California",\
           "encoded_stats": enc_stats,\
           "encoded_schema": enc_schema,\
           "source": "UCI ML Repository" }
ds_reg = ap.register_dataset(ds_info)
featureset = data.dtypes.to_dict()
featureset = {k:str(featureset[k]) for k in featureset}
featureset["name"] = "wine_no_transformations"
fs_reg = ap.register_featureset(featureset, ds_reg["_key"])

I0722 14:31:10.011958 140640373532480 arangopipe_api.py:218] Recording dataset dataset link {'_id': 'datasets/30138', '_key': '30138', '_rev': '_Z_mV35m---'}
I0722 14:31:10.016732 140640373532480 arangopipe_api.py:228] Recording featureset {'_id': 'featuresets/30141', '_key': '30141', '_rev': '_Z_mV356---'}
I0722 14:31:10.021649 140640373532480 arangopipe_api.py:236] Recording featureset dataset link {'_id': 'featureset_dataset/30141-30138', '_key': '30141-30138', '_rev': '_Z_mV36O---'}


## Retrieve Stored Artifacts from Arangopipe

In [13]:
dataset = ap.lookup_dataset("cal_housing_dataset")

## Get the JSON Representation of TFX Artifacts

In [14]:
retrieved_stats = dataset["encoded_stats"]
retrieved_schema = dataset["encoded_schema"]

## Convert JSON Representation to TFX Objects

In [15]:
remat_stats = json_format.Parse(retrieved_stats, statistics_pb2.DatasetFeatureStatisticsList())
remat_schema = json_format.Parse(retrieved_schema, schema_pb2.Schema())

## Use TFX Objects

In [16]:
tfdv.visualize_statistics(remat_stats)