## Data Validation in Feast

Feast allows users to specify a **schema** that specifies the value, shape and presence constraints 
of the features they are ingesting. This schema is compatible with the schema defined in Tensorflow
metadata.

cp https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto.

This means that you can import an existing Tensorflow metadata schema into Feast and Feast can
check that the features ingested fulfill the schema provided. In Feast v0.5, however, only feature
value domains and presence will be validated during ingestion.

For more information regarding Tensorflow data validation, please check these documentations:
- https://www.tensorflow.org/tfx/data_validation/get_started
- https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/data_validation/tfdv_basic.ipynb

### 1. Importing Tensorflow metadata schema to Feast

In [1]:
from feast import Client, FeatureSet
import tensorflow_data_validation as tfdv
from google.protobuf import text_format
import pandas as pd

In [2]:
%%bash
# Sample data from BigQuery public dataset: bikeshare stations
# https://cloud.google.com/bigquery/public-data
wget https://raw.githubusercontent.com/davidheryanto/feast/update-ingestion-metrics-for-validation/examples/data_validation/bikeshare_stations.csv
ls *.csv

bikeshare_stations.csv


--2020-02-10 03:11:51--  https://raw.githubusercontent.com/davidheryanto/feast/update-ingestion-metrics-for-validation/examples/data_validation/bikeshare_stations.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.64.133, 151.101.128.133, 151.101.192.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.64.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7492 (7.3K) [text/plain]
Saving to: ‘bikeshare_stations.csv.5’

     0K .......                                               100% 12.9M=0.001s

2020-02-10 03:11:51 (12.9 MB/s) - ‘bikeshare_stations.csv.5’ saved [7492/7492]



In [3]:
pd.read_csv("bikeshare_stations.csv").head(3)

Unnamed: 0,station_id,name,status,latitude,longitude,location
0,3793,Rio Grande & 28th,active,30.29333,-97.74412,"(30.29333, -97.74412)"
1,3291,11th & San Jacinto,active,30.27193,-97.73854,"(30.27193, -97.73854)"
2,4058,Hollow Creek & Barton Hills,active,30.26139,-97.77234,"(30.26139, -97.77234)"


In [4]:
%%bash
cat <<EOF > bikeshare_stations_feature_set.yaml

spec:
  name: bikeshare_stations
  entities:
  - name: station_id
    valueType: INT64
  features:
  - name: name
    valueType: STRING
  - name: status
    valueType: STRING
  - name: latitude
    valueType: FLOAT
  - name: longitude
    valueType: FLOAT
  - name: location
    valueType: STRING
  maxAge: 3600s

EOF

In [5]:
# Create a FeatureSet bikeshare_stations
bikeshare_stations_feature_set = FeatureSet.from_yaml("bikeshare_stations_feature_set.yaml")
print(bikeshare_stations_feature_set)

{
  "spec": {
    "name": "bikeshare_stations",
    "entities": [
      {
        "name": "station_id",
        "valueType": "INT64"
      }
    ],
    "features": [
      {
        "name": "name",
        "valueType": "STRING"
      },
      {
        "name": "status",
        "valueType": "STRING"
      },
      {
        "name": "latitude",
        "valueType": "FLOAT"
      },
      {
        "name": "longitude",
        "valueType": "FLOAT"
      },
      {
        "name": "location",
        "valueType": "STRING"
      }
    ],
    "maxAge": "3600s"
  },
  "meta": {
    "createdTimestamp": "1970-01-01T00:00:00Z"
  }
}


In [6]:
# Use Tensorflow Data Validation (tfdv) to create a schema from the csv
train_stats = tfdv.generate_statistics_from_csv(data_location="bikeshare_stations.csv")
schema = tfdv.infer_schema(statistics=train_stats, max_string_domain_size=10)
tfdv.display_schema(schema=schema)

  types.FeaturePath([column_name]), column.data.chunk(0), weights):


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'station_id',INT,required,,-
'name',BYTES,required,,-
'status',STRING,required,,'status'
'latitude',FLOAT,required,,-
'longitude',FLOAT,required,,-
'location',BYTES,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'status',"'active', 'closed'"


In [7]:
# Import the schema into the FeatureSet
bikeshare_stations_feature_set.import_tfx_schema(schema)
print(bikeshare_stations_feature_set)

{
  "spec": {
    "name": "bikeshare_stations",
    "entities": [
      {
        "name": "station_id",
        "valueType": "INT64",
        "presence": {
          "minFraction": 1.0,
          "minCount": "1"
        },
        "shape": {
          "dim": [
            {
              "size": "1"
            }
          ]
        }
      }
    ],
    "features": [
      {
        "name": "name",
        "valueType": "STRING",
        "presence": {
          "minFraction": 1.0,
          "minCount": "1"
        },
        "shape": {
          "dim": [
            {
              "size": "1"
            }
          ]
        }
      },
      {
        "name": "status",
        "valueType": "STRING",
        "presence": {
          "minFraction": 1.0,
          "minCount": "1"
        },
        "shape": {
          "dim": [
            {
              "size": "1"
            }
          ]
        },
        "stringDomain": {
          "name": "status",
          "value": [
           

Now that the FeatureSet has imported the schema, Prometheus metrics will be exported during ingestion, which
can be used to check if the features ingested fulfill the requirements.

### 2. Exporting Tensorflow metadata schema from Feast

The following scenario is for users who have created a FeatureSet and used Feast to ingest features. During training,
they want to run batch validation using Tensorflow data validation utility. Rather than attempting to recreate the
schema from scratch, users can export the existing one from the FeatureSet.

This ensures that the schema that is currently applied for Feast ingestion will be consistent to the one used in
batch validation with Tensorflow data validation.

In [8]:
exported_tfx_schema = bikeshare_stations_feature_set.export_tfx_schema()
tfdv.display_schema(exported_tfx_schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'name',BYTES,required,,-
'status',STRING,required,,'status'
'latitude',FLOAT,required,,-
'longitude',FLOAT,required,,-
'location',BYTES,required,,-
'station_id',INT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'status',"'active', 'closed'"
