# Data Validation using TFX

## Task & Data
In this notebook, we perform a data validation on our **Booking** data. The purpose is to predict the number of reservations made (regression problem). The target column is **bookings**.

The first step of the machine learning pipeline is data validation where we want to have an overview on the data fields and define a schema.
Thus, we will use the TensorFlow Data Validation library (tfdv) as it offers many interesting functionalities:
* Generate some statistics related to the data (type, missing values, distribution) and visualize them.
* Infer the schema of the train file.
* Compare the statistics between the reference data (training for our case) and the new coming data (test data for our case).

In [2]:
"""Compute stats, infer schema, and validate stats for booking data."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import warnings
warnings.filterwarnings('ignore')

import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)

import tensorflow_data_validation as tfdv

from google.protobuf import text_format
from tensorflow.python.lib.io import file_io  # pylint: disable=g-direct-tensorflow-import

import os

In [3]:
# Define path for input
DATA_DIR = "../data"
train_data = os.path.join(DATA_DIR, 'train/train.csv')
test_data = os.path.join(DATA_DIR, 'test/test.csv')
# Define outout dir
OUTPUT_DIR = "../data/tfdv_output"
file_io.recursive_create_dir(OUTPUT_DIR)

## Compute and visualize statistics

We'll start by computing statistics for our training data.
TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions.

In [3]:
# Generate statistics from training data
train_stats = tfdv.generate_statistics_from_csv(train_data, delimiter=',')
# Visualize stats
tfdv.visualize_statistics(train_stats)



Among the findings that we can get:
* Some numeric features have a **high percentage of missing values** (displayed in red in the column missing): "country_hotel" (92%), "airport_hotel"(95%), "senior_hotel"(99%), "club_club_hotel" (98%). "echo_friendly_hotel"(96%)... Thus, we'll mostly drop these column (we can choose a certain threshold for percentage of missing values to drop the column)
* For the numeric features that do not have a very high percentage of missing values, we may just drop the correspondant records.
* For the target column, **bookings** , we recognize a very high percentage of zero values (78%). We may need to deal with that umbalance.
* For the feature "poi_image", we have **100% of the value are zeros**, so we don't need this column.
* For the categorical features, **the feature "market" has only one unique value**: "IM". So we'll also drop this column.


## Infer a schema

Now, we will create a schema for our data. A schema defines constraints for the data that are relevant for ML. 
Luckily, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.

In [4]:
# Infer schema
schema = tfdv.infer_schema(train_stats)
# Display schema
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'id',BYTES,required,,-
'yyear',INT,required,,-
'week_of_year',INT,required,,-
'advertiser_id',INT,required,,-
'market',STRING,required,,'market'
'hotel_id',INT,required,,-
'clicks',INT,required,,-
'cost',INT,required,,-
'bookings',INT,required,,-
'top_pos',INT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'market','IM'


## Check test data 

 It's important that our test data is consistent with our training data, including that it uses the same schema. It helps us to check if we have the same range of values for numeric features. Same for categorical features, in case we have new feature values in the test data that we did not have during the training.

In [5]:
# Get Statist from test data
test_stats = tfdv.generate_statistics_from_csv(test_data, delimiter=',')
# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=test_stats, rhs_statistics=train_stats,
                          lhs_name='TEST_DATASET', rhs_name='TRAIN_DATASET')

Given the comparison above, the train and test data are consistants
* The target only exist in the train, and absent is the test data, which is fine.
* The features that have very high percentage of missing values in the train data, have a similar high one in test data. This validates the choice of dropping those columns.
* The features that have a low percentage of missing values in the train data, do not have missing values in the test data. Thus, we should keep those features and deal with the missing values in the train.
* For the most of numeric features, the maximum values in the train data are higher than the one in the test. That would not cause any issue.
* The categorical feature "Market", that has only one value during the train, also have the same one value in the test data. Thus, dropping that columns was righteous decision.

## Check for anomalies

In [6]:
# Check test data for errors by validating the test data stats using the previously inferred schema.
anomalies = tfdv.validate_statistics(test_stats, schema)
tfdv.display_anomalies(anomalies)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'lose',Expected data of type: FLOAT but got INT,
'beat',Expected data of type: FLOAT but got INT,
'bookings',Column dropped,The feature was not present in any examples.
'meet',Expected data of type: FLOAT but got INT,


### Fix anomalies related to data type

We have an INT value in our "lose", "beat", "meet", where our schema expected a FLOAT. By making us aware of that difference, TFDV helps uncover inconsistencies in the way the data is generated for training and serving.
In this case, we can safely convert INT values to FLOATs, so we want to tell TFDV to use our schema to infer the type. Let's do that now.

In [7]:
options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True)
test_stats = tfdv.generate_statistics_from_csv("./data/test/test_set.csv", delimiter=',', stats_options=options)
anomalies = tfdv.validate_statistics(test_stats, schema)
tfdv.display_anomalies(anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'bookings',Column dropped,The feature was not present in any examples.


### Fix anomaly related to the target column -- different environments

Now we just have the "bookings" feature (which is our target) showing up as an anomaly ('Column dropped'). Of course we don't expect to have labels in our serving data, so let's tell TFDV to ignore that.

In [8]:
# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')

# Specify that 'tips' feature is not in SERVING environment.
tfdv.get_feature(schema, 'bookings').not_in_environment.append('SERVING')

serving_anomalies_with_env = tfdv.validate_statistics(
    test_stats, schema, environment='SERVING')

tfdv.display_anomalies(serving_anomalies_with_env)

## Freeze schema

Now that the schema has been reviewed and curated, we will store it in a file to reflect its "frozen" state.

In [9]:
schema_file = os.path.join(OUTPUT_DIR, 'schema.pbtxt')
# Write schema
tfdv.write_schema_text(schema, schema_file)