# Data split & validation

## Abstract

This notebook analyses the data created by the StatisticsGen component of the pipeline.

The results of a pipeline run are stored in Google Cloud Storage. This notebook downloads them and then performs analysis using Tensorflow Data Validation.

More specifically, we will assess if there are any noticeable differences in the distributions of the variables used for modelling (features & target) across the different data sets (training, evaluation & test sets).

In a nutshell, such validation phase is made to assess if the data in hands is representative - which is crucial for a trained model to generalize well on new data.

## Train/eval/test split

In order to mimick a real-case scenario of serving the fare pricing engine, we hold out all data after January, 1st, 2021 as __test data__.
<br>A hypothetical motivation behind such holdout split could be that the business wants to retrain its production every so often, meaning that the production model needs running & testing for a certain amount of time (in our case, up to May, 2022).

Data collected from January, 1st, 2020 to December, 31st, 2020 is then used for __training & model evaluation__.
<br>More specifically, the __train/eval split__ is made in a standard fashion: randomly with a ratio of 95% / 5%.

In [1]:
%%bash

mkdir -p analysis_data
mkdir -p anomalies_data
mkdir -p schemas

In [2]:
!gsutil cp -r gs://aliz-ml-spec-2022/demo-1/pipeline_root/taxi-vertex-pipelines/53911330556/taxi-vertex-pipelines-20220608212029/StatisticsGen_-6421068191618826240/statistics analysis_data

Copying gs://aliz-ml-spec-2022/demo-1/pipeline_root/taxi-vertex-pipelines/53911330556/taxi-vertex-pipelines-20220608212029/StatisticsGen_-6421068191618826240/statistics/Split-eval/FeatureStats.pb...
Copying gs://aliz-ml-spec-2022/demo-1/pipeline_root/taxi-vertex-pipelines/53911330556/taxi-vertex-pipelines-20220608212029/StatisticsGen_-6421068191618826240/statistics/Split-test/FeatureStats.pb...
Copying gs://aliz-ml-spec-2022/demo-1/pipeline_root/taxi-vertex-pipelines/53911330556/taxi-vertex-pipelines-20220608212029/StatisticsGen_-6421068191618826240/statistics/Split-train/FeatureStats.pb...
| [3 files][ 32.5 KiB/ 32.5 KiB]                                                
Operation completed over 3 objects/32.5 KiB.                                     


In [3]:
!gsutil cp -r gs://aliz-ml-spec-2022/demo-1/pipeline_root/taxi-vertex-pipelines/53911330556/taxi-vertex-pipelines-20220608212029/ExampleValidator_5108146854449643520/anomalies anomalies_data

Copying gs://aliz-ml-spec-2022/demo-1/pipeline_root/taxi-vertex-pipelines/53911330556/taxi-vertex-pipelines-20220608212029/ExampleValidator_5108146854449643520/anomalies/Split-eval/SchemaDiff.pb...
- [1 files][  520.0 B/  520.0 B]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://aliz-ml-spec-2022/demo-1/pipeline_root/taxi-vertex-pipelines/53911330556/taxi-vertex-pipelines-20220608212029/ExampleValidator_5108146854449643520/anomalies/Split-test/SchemaDiff.pb...
Copying gs://aliz-ml-spec-2022/demo-1/pipeline_root/taxi-vertex-pipelines/53911330556/taxi-vertex-pipelines-20220608212029/ExampleValidator_5108146854449643520/anomalies/Split-train/SchemaDiff.pb...
| [3 files][  1.5 KiB/  1.5 KiB]                                        

In [4]:
!gsutil cp -r gs://aliz-ml-spec-2022/demo-1/pipeline_root/taxi-vertex-pipelines/53911330556/taxi-vertex-pipelines-20220608212029/SchemaGen_2802303845235949568/schema schemas

Copying gs://aliz-ml-spec-2022/demo-1/pipeline_root/taxi-vertex-pipelines/53911330556/taxi-vertex-pipelines-20220608212029/SchemaGen_2802303845235949568/schema/schema.pbtxt...
/ [1 files][  1.8 KiB/  1.8 KiB]                                                
Operation completed over 1 objects/1.8 KiB.                                      


## Generate TFDV reports

## Importing the created data

In [5]:
import tensorflow_data_validation as tfdv

In [6]:
train_stats = tfdv.load_stats_binary('analysis_data/statistics/Split-train/FeatureStats.pb')
eval_stats = tfdv.load_stats_binary('analysis_data/statistics/Split-eval/FeatureStats.pb')
test_stats = tfdv.load_stats_binary('analysis_data/statistics/Split-test/FeatureStats.pb')

In [7]:
from tensorflow_metadata.proto.v0 import anomalies_pb2
from tfx.utils import io_utils

test_anomalies = anomalies_pb2.Anomalies()
anomalies_bytes = io_utils.read_bytes_file('anomalies_data/anomalies/Split-test/SchemaDiff.pb')
test_anomalies.ParseFromString(anomalies_bytes)

eval_anomalies = anomalies_pb2.Anomalies()
anomalies_bytes = io_utils.read_bytes_file('anomalies_data/anomalies/Split-eval/SchemaDiff.pb')
eval_anomalies.ParseFromString(anomalies_bytes)

520

In [8]:
schema = tfdv.load_schema_text('schemas/schema/schema.pbtxt')
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'TripStartHour',INT,required,,-
'TripStartMinute',INT,required,,-
'TripStartMonth',INT,required,,-
'TripStartYear',INT,required,,-
'dropoff_census_tract',INT,required,,-
'fare',FLOAT,required,,-
'histOneWeek_tripDistance',FLOAT,required,,-
'histOneWeek_tripDuration',FLOAT,required,,-
'historical_tripDistance',FLOAT,required,,-
'historical_tripDuration',FLOAT,required,,-


### TFDV for train & eval sets

In [9]:
tfdv.visualize_statistics(lhs_statistics=train_stats, lhs_name='train', rhs_statistics=eval_stats, rhs_name='eval')

About the numeric features & target variable, no discrepancy (drift, shema/feature/distribution shews) is noticeable.
<br>As a remark, the red-highlighted zeros ratios of `histOneWeek_tripDistance` & `histOneWeek_tripDuration` are due to the original `NULL` values filled in with zeros - such `NULL` values are caused by the rolling-window-fashion feature engineering and are totally normal.

About the categorical features, at this stage they are already transformed. Observing the distributions in the two datasets we can see they are similar. There is no drft between them.

In [10]:
tfdv.display_anomalies(eval_anomalies)

### TFDV for train & test sets

In [11]:
tfdv.visualize_statistics(lhs_statistics=train_stats, lhs_name='train', rhs_statistics=test_stats, rhs_name='test')

When comparing Train and Test sets, we can observe one discrancy in the `TripStartMonth` which is due to how we splitted our dateset.

In [12]:
tfdv.display_anomalies(test_anomalies)