# Historical Feature Statistics with Feast, TFDV and Facets

This tutorial covers how Feast can be used in conjunction with TFDV and Facets to retrieve statistics about feature datasets. 

The notebook showcases how Feast's integration with TFDV allows users to:

1. Define TFX feature schemas and persist these properties in the Feature Store
2. Validate new data against the defined schema
3. Validate data already in Feast against the defined schema

**Prerequisites**:

- Feast running with at least 1 BigQuery warehouse store. This example uses a bigquery store with the name `historical`.

In [1]:
import pandas as pd
import pytest
import pytz
import uuid
import time
from datetime import datetime, timedelta

from feast.client import Client
from feast.entity import Entity
from feast.feature import Feature
from feast.feature_set import FeatureSet
from feast.type_map import ValueType
from google.protobuf import json_format
from google.protobuf.duration_pb2 import Duration
from tensorflow_metadata.proto.v0 import statistics_pb2
from tensorflow_metadata.proto.v0 import schema_pb2
import tensorflow_data_validation as tfdv

PROJECT_NAME = "statistics"
IRIS_DATASET = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
BIGQUERY_STORE_NAME = "historical"
client = Client(core_url="localhost:6565")
print(f"setting project to {PROJECT_NAME}...")
client.set_project(PROJECT_NAME)

setting project to statistics...


In this example, we are using the iris dataset. More information about this dataset can be found [here](http://archive.ics.uci.edu/ml/datasets/iris).

In [2]:
iris_feature_names = ["sepal_length","sepal_width","petal_length","petal_width"]
df = pd.read_csv(IRIS_DATASET, names=iris_feature_names + ["class"])

# Add datetime to satisfy Feast
current_datetime = datetime.utcnow().replace(tzinfo=pytz.utc)
df['datetime'] = current_datetime - timedelta(days=1)

df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,datetime
0,5.1,3.5,1.4,0.2,Iris-setosa,2020-05-25 07:31:28.230582+00:00
1,4.9,3.0,1.4,0.2,Iris-setosa,2020-05-25 07:31:28.230582+00:00
2,4.7,3.2,1.3,0.2,Iris-setosa,2020-05-25 07:31:28.230582+00:00
3,4.6,3.1,1.5,0.2,Iris-setosa,2020-05-25 07:31:28.230582+00:00
4,5.0,3.6,1.4,0.2,Iris-setosa,2020-05-25 07:31:28.230582+00:00


## TFDV schema as part of the feature set definition

An integral part of TFDV is the feature [schemas](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto) that describe the expected properties of the data in a dataset, such as:
- expected feature presence
- type
- expected domains of features

These schemas, which can be [manually defined or generated by TFDV](https://www.tensorflow.org/tfx/data_validation/get_started#inferring_a_schema_over_the_data), can be then used to extend the definition of features within the feature set. As part of the spec, the schema is persisted within Feast, and is used for both in-flight data validation, as well as offline integration with TFDV.


In [3]:
# Infer a schema over the iris dataset. These values can be tweaked as necessary.
stats = tfdv.generate_statistics_from_dataframe(df)
schema = tfdv.infer_schema(statistics=stats)
width_domain = schema_pb2.FloatDomain(min=0)
tfdv.set_domain(schema, 'petal_width', width_domain)

# Create a new FeatureSet or retrieve an existing FeatureSet in Feast
feature_set = FeatureSet(name="iris")
feature_set.infer_fields_from_df(df[['datetime'] + iris_feature_names], 
                        entities=[Entity(name="class", dtype=ValueType.STRING)])

# Update the entities and features with constraints defined in the schema
feature_set.import_tfx_schema(schema)
print(feature_set)

  types.FeaturePath([column_name]), column.data.chunk(0), weights):


Entity class(ValueType.STRING) manually updated (replacing an existing field).
Feature sepal_length (ValueType.DOUBLE) added from dataframe.
Feature sepal_width (ValueType.DOUBLE) added from dataframe.
Feature petal_length (ValueType.DOUBLE) added from dataframe.
Feature petal_width (ValueType.DOUBLE) added from dataframe.

{
  "spec": {
    "name": "iris",
    "entities": [
      {
        "name": "class",
        "valueType": "STRING"
      }
    ],
    "features": [
      {
        "name": "sepal_length",
        "valueType": "DOUBLE",
        "presence": {
          "minFraction": 1.0,
          "minCount": "1"
        },
        "shape": {
          "dim": [
            {
              "size": "1"
            }
          ]
        }
      },
      {
        "name": "sepal_width",
        "valueType": "DOUBLE",
        "presence": {
          "minFraction": 1.0,
          "minCount": "1"
        },
        "shape": {
          "dim": [
            {
              "size": "1"
      

## Computing statistics over an ingested dataset

Feast is able to compute statistics for any data that has been ingested into the system. Statistics can be computed over either discrete datasets using *dataset_ids* or periods of time using a specified time range.

These statistics are computed at a historical store (caveat: only BQ is supported at the moment). The feature statistics returned in the form of TFX's `DatasetFeatureStatisticsList`, which can then be directly fed back into TFDV methods to either visualise the data statistics, or validate the dataset.

In [4]:
# Apply the featureset
client.apply(feature_set)

# When a dataset is ingested into Feast, a unique ingestion id referencing the ingested dataset is returned. 
ingestion_id = client.ingest(feature_set, df)
print("\ningestion id: " + ingestion_id)

Feature set created: "iris"
Waiting for feature set to be ready for ingestion...


100%|██████████| 150/150 [00:01<00:00, 122.33rows/s]

Ingestion complete!

Ingestion statistics:
Success: 150/150
Removing temporary file(s)...

ingestion id: 73ed84b1-1218-3702-b4c6-673503233264





In [6]:
# Get statistics from Feast for the ingested dataset.
# The statistics are calculated over the data in the store specified.
stats = client.get_statistics(
    feature_set_id='iris', 
    store=BIGQUERY_STORE_NAME, 
    features=iris_feature_names, 
    ingestion_ids=[ingestion_id])

# Visualising statistics with facets

Since Feast outputs statistics in a format compatible with the TFDV API, the stats object can be directly passed to `tfdv.visualize_statistics()` to visualise, in-line, the output statistics on [Facets](https://pair-code.github.io/facets/), allowing for easy and interactive exploration of the shape and distribution of the data inside Feast.

In [7]:
tfdv.visualize_statistics(stats)

# Validating correctness of subsequent datasets 

While it is useful to explore dataset statistics using facets, since we have already defined a schema that specifies a dataset's bounds of correctness, we can leverage TFDV's `validate_statistics` to validate if subsequent datasets are problematic or not. 

It is possible to validate correctness of a new dataset prior to ingestion by retrieving the schema from the feature set, and comparing computed statistics against that schema. 

This can be useful if we want to avoid ingesting problematic data into Feast.

In [8]:
# Ingest a new dataset with obviously incorrect data
df_2 = pd.DataFrame(
    {
        "datetime": current_datetime,
        "class": ["Iris-setosa", "Iris-virginica", "Iris-nonsensica"],
        "sepal_length": [4.3, 6.9, 12],
        "sepal_width": [3.0, 2.8, 1.1],
        "petal_length": [1.2, 4.9, 2.2],
        "petal_width": [0.1, 1.8, -1.0]
    }
)

# Validate correctness
stats_2 = tfdv.generate_statistics_from_dataframe(df_2)
anomalies = tfdv.validate_statistics(statistics=stats_2, schema=feature_set.export_tfx_schema())
tfdv.display_anomalies(anomalies)



Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'petal_width',Out-of-range values,Unexpectedly low values: -1<-1(upto six significant digits)
'class',New column,New column (column in data but not in schema)


Alternatively, the data can be ingested into Feast, and the statistics computed at the store. This has the benefit of offloading statistics computation for large datasets to Feast.

In [9]:
# Ingest the data into Feast
ingestion_id_2 = client.ingest(feature_set, df_2)
time.sleep(10) # Sleep is not necessary if not using DirectRunner

# Compute statistics over the new dataset
stats_2 = client.get_statistics(
    feature_set_id='iris', 
    store=BIGQUERY_STORE_NAME, 
    features=iris_feature_names, 
    ingestion_ids=[ingestion_id_2])

# Detect anomalies in the dataset
anomalies = tfdv.validate_statistics(statistics=stats_2, schema=feature_set.export_tfx_schema())
tfdv.display_anomalies(anomalies)

  0%|          | 0/3 [00:00<?, ?rows/s]

Waiting for feature set to be ready for ingestion...


100%|██████████| 3/3 [00:01<00:00,  2.85rows/s]


Ingestion complete!

Ingestion statistics:
Success: 3/3
Removing temporary file(s)...


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'petal_width',Out-of-range values,Unexpectedly low values: -1<-1(upto six significant digits)
