# Data split & validation

## Abstract

This notebook analyses the data created by the StatisticsGen component of the pipeline.

The results of a pipeline run are stored in Google Cloud Storage. This notebook downloads them and then performs analysis using Tensorflow Data Validation.

More specifically, we will assess if there are any noticeable differences in the distributions of the variables used for modelling (features & target) across the different data sets (training, evaluation & test sets).

In a nutshell, such validation phase is made to assess if the data in hands is representative - which is crucial for a trained model to generalize well on new data.

## Train/eval/test split

In order to mimick a real-case scenario of serving the fare pricing engine, we hold out all data after January, 1st, 2021 as __test data__.
<br>A hypothetical motivation behind such holdout split could be that the business wants to retrain its production every so often, meaning that the production model needs running & testing for a certain amount of time (in our case, up to May, 2022).

Data collected from January, 1st, 2020 to December, 31st, 2020 is then used for __training & model evaluation__.
<br>More specifically, the __train/eval split__ is made in a standard fashion: randomly with a ratio of 95% / 5%.

In [66]:
!gsutil cp gs://mormota/pipeline_root/taxi-vertex-pipelines/639006805448/taxi-vertex-pipelines-20220608060724/ExampleValidator_1833185500421160960/anomalies/Split-test/SchemaDiff.pb analysis_data/test_anomalies.pb

Copying gs://mormota/pipeline_root/taxi-vertex-pipelines/639006805448/taxi-vertex-pipelines-20220608060724/ExampleValidator_1833185500421160960/anomalies/Split-test/SchemaDiff.pb...
/ [1 files][  685.0 B/  685.0 B]                                                
Operation completed over 1 objects/685.0 B.                                      


## Generate TFDV reports

## Importing the created data

In [59]:
import tensorflow_data_validation as tfdv

In [60]:
from tfx.components.statistics_gen import stats_artifact_utils

train_stats = tfdv.load_stats_binary('analysis_data/train.pb')
eval_stats = tfdv.load_stats_binary('analysis_data/eval.pb')
test_stats = tfdv.load_stats_binary('analysis_data/test.pb')



In [67]:
from tensorflow_metadata.proto.v0 import anomalies_pb2
from tfx.utils import io_utils

test_anomalies = anomalies_pb2.Anomalies()
anomalies_bytes = io_utils.read_bytes_file('analysis_data/test_anomalies.pb')
test_anomalies.ParseFromString(anomalies_bytes)

eval_anomalies = anomalies_pb2.Anomalies()
anomalies_bytes = io_utils.read_bytes_file('analysis_data/eval_anomalies.pb')
eval_anomalies.ParseFromString(anomalies_bytes)

685

In [56]:
schema = tfdv.load_schema_text('train_schema.pbtxt')
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'TripStartDay',STRING,required,,'TripStartDay'
'TripStartHour',INT,required,,-
'TripStartMinute',INT,required,,-
'TripStartMonth',INT,required,,-
'TripStartYear',INT,required,,-
'dropoff_census_tract',INT,required,,-
'fare',FLOAT,required,,-
'histOneWeek_tripDistance',FLOAT,required,,-
'histOneWeek_tripDuration',FLOAT,required,,-
'historical_tripDistance',FLOAT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'TripStartDay',"'Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday'"


### TFDV for train & eval sets

In [62]:
tfdv.visualize_statistics(lhs_statistics=train_stats, lhs_name='train', rhs_statistics=eval_stats, rhs_name='eval')

About the numeric features & target variable, no discrepancy (drift, shema/feature/distribution shews) is noticeable.
<br>As a remark, the red-highlighted zeros ratios of `histOneWeek_tripDistance` & `histOneWeek_tripDuration` are due to the original `NULL` values filled in with zeros - such `NULL` values are caused by the rolling-window-fashion feature engineering and are totally normal.

About the categorical features, no discrepancy is noticeable except for `pickup_census_tract` & `dropoff_census_tract`.
<br>In a nutshell, there are many locations present in the __eval set__ which will not be seen by the training.
<br>This could cause a real drop in performance if such features `pickup_census_tract` & `dropoff_census_tract` are assessed as _important_ during the training phase.
<br>One solution would then be to replace the __census_tract__ fields by coarser location metrics like the __community_area__ fields.

However at this stage, let's not proceed to this replacement since the __census_tract__ information is intertwined within the feature engineering of numeric features like `historical_tripDistance`, `historical_tripDuration`, `histOneWeek_tripDistance` & `histOneWeek_tripDuration` which are consistent across the train & eval sets.

The field `trip_id` is actually discarded as it is not a feature - but still needs serializing by TFDV as the unique identifier of examples.

In [68]:
tfdv.display_anomalies(eval_anomalies)

### TFDV for train & test sets

In [63]:
tfdv.visualize_statistics(lhs_statistics=train_stats, lhs_name='train', rhs_statistics=test_stats, rhs_name='test')

About the numeric features & target variable, no discrepancy (drift, shema/feature/distribution shews) is noticeable.
<br>As a remark, the red-highlighted zeros ratios of `histOneWeek_tripDistance` & `histOneWeek_tripDuration` are due to the original `NULL` values filled in with zeros - such `NULL` values are caused by the rolling-window-fashion feature engineering and are totally normal.

About the categorical features, no discrepancy is noticeable except for `pickup_census_tract` & `dropoff_census_tract`.
<br>In a nutshell, there are many locations present in the __test set__ which will not be seen by the training.
<br>This could cause a real drop in performance if such features `pickup_census_tract` & `dropoff_census_tract` are assessed as _important_ during the training phase.
<br>One solution would then be to replace the __census_tract__ fields by coarser location metrics like the __community_area__ fields.

However at this stage, let's not proceed to this replacement since the __census_tract__ information is intertwined within the feature engineering of numeric features like `historical_tripDistance`, `historical_tripDuration`, `histOneWeek_tripDistance` & `histOneWeek_tripDuration` which are consistent across the train & test sets.

The field `trip_id` is actually discarded as it is not a feature - but still needs serializing by TFDV as the unique identifier of examples.

In [69]:
tfdv.display_anomalies(test_anomalies)