# Data split & validation

## Abstract

This notebook aims at splitting and validating the overall data extracted using [Tensorflow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) on the features set & the target variable.

At this stage, the overall data is read from BigQuery, then splitted into the train/eval & test datasets, and serialized in Cloud Storage under the dedicated section `raw_examples`.

More specifically, we will assess if there are any noticeable differences in the distributions of the variables used for modelling (features & target) across the different data sets (training, evaluation & test sets).

In a nutshell, such validation phase is made to assess if the data in hands is representative - which is crucial for a trained model to generalize well on new data.

## Caution

The DataFlowRunner of TFDV is sensitive to `NULL` values.
<br>We then decide to fill them in with `0` in the numeric fields.

Furthermore, in order to force the automatic detection of __categorical features__, we decide to directly cast such variables from their initial data type to `STRING` when querying from BQ.

## Train/eval/test split

In order to mimick a real-case scenario of serving the fare pricing engine, we hold out all data after July, 1st, 2019 as __test data__.
<br>A hypothetical motivation behind such holdout split could be that the business wants to retrain its production every 2 months, meaning that the production model needs running & testing over 2 months before retraining (in our case July & August 2019).

Data collected from January, 1st, 2016 to June, 30th, 2019 is then used for __training & model evaluation__.
<br>More specifically, the __train/eval split__ is made in a standard fashion: randomly with a ratio of 95% / 5%.

## Generate TFDV reports

In [10]:
%%bash

sh analyze_and_validate/tfdv_dataflow.sh

Starting distributed TFDV stats computation and schema generation...


  'You are using Apache Beam with Python 2. '



2019-10-09 15:44:53.575321: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-09 15:44:53.582719: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2019-10-09 15:44:53.583087: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56241e9575e0 executing computations on platform Host. Devices:
2019-10-09 15:44:53.583119: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>






## Reports analysis

In [1]:
import tensorflow_data_validation as tfdv

  'You are using Apache Beam with Python 2. '


In [2]:
train_stats = tfdv.load_statistics('gs://szilard_aliz_sandbox/pierre_tasks/demo1/tfdv/stats/train_stats')
eval_stats = tfdv.load_statistics('gs://szilard_aliz_sandbox/pierre_tasks/demo1/tfdv/stats/eval_stats')
test_stats = tfdv.load_statistics('gs://szilard_aliz_sandbox/pierre_tasks/demo1/tfdv/stats/test_stats')

Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


In [3]:
def load_anomalies_text(input_path):
    from tensorflow_metadata.proto.v0 import anomalies_pb2
    from tensorflow.python.lib.io import file_io 
    from google.protobuf import text_format
    
    anomalies = anomalies_pb2.Anomalies()
    anomalies_text = file_io.read_file_to_string(input_path)
    text_format.Parse(anomalies_text, anomalies)
    return anomalies

eval_anomalies = load_anomalies_text('gs://szilard_aliz_sandbox/pierre_tasks/demo1/tfdv/anomalies/eval_anomalies.pbtxt')
test_anomalies = load_anomalies_text('gs://szilard_aliz_sandbox/pierre_tasks/demo1/tfdv/anomalies/test_anomalies.pbtxt')

### Tensorflow Data Validation results

The data is analysed using [Tensorflow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started). 

The original dataset wisas split into a train, eval and test set:
- the train/eval sets are created from a random split 95%/5% from the original data before July, 2019
- the test set refers to data after July, 2019

### TFDV for train & eval sets

In [7]:
tfdv.visualize_statistics(lhs_statistics=train_stats, lhs_name='train', rhs_statistics=eval_stats, rhs_name='eval')

About the numeric features & target variable, no discrepancy (drift, shema/feature/distribution shews) is noticeable.
<br>As a remark, the red-highlighted zeros ratios of `histOneWeek_tripDistance` & `histOneWeek_tripDuration` are due to the original `NULL` values filled in with zeros - such `NULL` values are caused by the rolling-window-fashion feature engineering and are totally normal.

About the categorical features, no discrepancy is noticeable except for `pickup_census_tract` & `dropoff_census_tract`.
<br>In a nutshell, there are many locations present in the __eval set__ which will not be seen by the training.
<br>This could cause a real drop in performance if such features `pickup_census_tract` & `dropoff_census_tract` are assessed as _important_ during the training phase.
<br>One solution would then be to replace the __census_tract__ fields by coarser location metrics like the __community_area__ fields.

However at this stage, let's not proceed to this replacement since the __census_tract__ information is intertwined within the feature engineering of numeric features like `historical_tripDistance`, `historical_tripDuration`, `histOneWeek_tripDistance` & `histOneWeek_tripDuration` which are consistent across the train & eval sets.

The field `trip_id` is actually discarded as it is not a feature - but still needs serializing by TFDV as the unique identifier of examples.

In [5]:
tfdv.display_anomalies(eval_anomalies)

### TFDV for train & test sets

In [6]:
tfdv.visualize_statistics(lhs_statistics=train_stats, lhs_name='train', rhs_statistics=test_stats, rhs_name='test')

About the numeric features & target variable, no discrepancy (drift, shema/feature/distribution shews) is noticeable.
<br>As a remark, the red-highlighted zeros ratios of `histOneWeek_tripDistance` & `histOneWeek_tripDuration` are due to the original `NULL` values filled in with zeros - such `NULL` values are caused by the rolling-window-fashion feature engineering and are totally normal.

About the categorical features, no discrepancy is noticeable except for `pickup_census_tract` & `dropoff_census_tract`.
<br>In a nutshell, there are many locations present in the __test set__ which will not be seen by the training.
<br>This could cause a real drop in performance if such features `pickup_census_tract` & `dropoff_census_tract` are assessed as _important_ during the training phase.
<br>One solution would then be to replace the __census_tract__ fields by coarser location metrics like the __community_area__ fields.

However at this stage, let's not proceed to this replacement since the __census_tract__ information is intertwined within the feature engineering of numeric features like `historical_tripDistance`, `historical_tripDuration`, `histOneWeek_tripDistance` & `histOneWeek_tripDuration` which are consistent across the train & test sets.

The field `trip_id` is actually discarded as it is not a feature - but still needs serializing by TFDV as the unique identifier of examples.

In [8]:
tfdv.display_anomalies(test_anomalies)