<a href="https://colab.research.google.com/github/beekal/MachieneLearningProjects/blob/master/0%20Basics%20-%20TF/TFX_Production_Scale_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TFX : Why ?
  - Provides end to end research to prod level ML solution , including Model versioning.
  - Bundles all the task into a package ranging from
    - Reading a data
    - Preprocessing
    - Model building / validation
    - Model Deployment
    - Model Versioning with rollback feature

If we do not use the TFX and a separate model independent preprocessing steps, then we really would not be able to rollback, in case preprocessing step changes with the model version updates.

### TFX Components :
A TFX consists of following components which we will discuss here
  - ExampleGen : Read data into TFX pipeline
  - StatisticsGen : Calculate exploratory Statistics about the data
  - SchemaGen : Create a data schema based on the Statistics
  - ExampleValidator: Analyse the data for abnormalities / inconsistencies
  - Transform: Perform necessary transformation in the data
  - Trainer : Trains the model
  - Evaluator : Evaluate the model performance to determine its readiness for deployment/ discard
  - Pusher : Deploys model to the production
  ![alt text](https://www.tensorflow.org/tfx/guide/images/diag_all.png)

## TF Libraries for the components:
  - StatsticsGen / SchemaGen/ ExampleValidator: [(TFDV) Tensorflow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv) to generate Statistics, inspect Schema, Analyse/ validate Data. Also used to calculate/ record drifts/ anamolies  to identify if a model needs a retraining.
  - ExampleValidator: (TFMD) : TensorFlow MetaData provides/stores Schema metadata to aid in Validation. Contains Schema for Data, Summary Statistics of the data.
  - Transform: [(TFT). Tensorflow Transform](https://www.tensorflow.org/tfx/guide/tft)
  - Evaluator: (TFMA). TensorFlow Model Analysis to evaluate models. Allows eval over large amount of data in a distributed way.
  - MLMD: [ML MetaData](https://www.tensorflow.org/tfx/guide/mlmd) stores all relevant ML information other than data statistics including workflow.
  - Pusher : [SavedModel](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/saved_model) + [TF Serving](https://www.tensorflow.org/tfx/tutorials/serving/rest_simple). SavedModel is a universal, portable and recommended serialization for TF Model deployment across any mobile, JS app or any other infrastructure . TFServing serilises the TF model as SavedModel and deploys it.



## Visualisation :
We can inspect the data visually using TFDV library using
  - tfdv.load_statistics()
  - tfdv.visualize_statistics()

## Evaluator:

## [TFX Guidelines](https://www.tensorflow.org/tfx/guide/train):
If using TFX for developing pipelines, then 
- Model's input layes must consume from the SavedModel
- Transform must be included in the model, so that the transformation can be exported along with the model using SavedModel
- The model must be saved as both SavedModel (Used by TF Serving) and EvalSavedModel(used by TF Model Analyis for evaluation)

REF: 

## Additional Dependency :
  - Apache Beam :Develop in single node, run in multi-node 

  We want our ML to run parallely for  greater speed while also being scalable. E.g we would likely develop the Ml model  on a single computer, however when we want it on a prod, we would like to run it on a multi-node cluster environment to serve a lot of parallel requests  with low latency.  Apache  Beam provides this abstraction i.e whatever we research / develop in a single nodeis easily scalable  to multi-node cluster, without any extra work/effort or code modification.
  - Apache Airflow / Kubeflow : Deploy, Scale and manage ML application automatically. 
  
  Some example Cases Airflow/ Kubeflow handles
    - Define 100 nodes/ input file path / checkpoint path / src code github path / 
    - Install required libraries in 100 node clusters
    - then Deploy ML model to all of them
    - Receiving tremendous volume of request for ML model, scale them up..
    - Terrible disaster 50 nodes have gone down, we need to bring another 50 uoo to compensate.
    - Efficiently utilise CPU/ GPU and minismise cost
    - Train your model  cheaply using the AWS spot instance ( i.e use lower cost spot if available )

REF : https://docs.agilestacks.com/article/gkyq26pzmr-creating-an-ml-pipeline

### TFX End to End
- ExampleGen : Read data into TFX pipeline
- StatisticsGen : Calculate exploratory Statistics about the data
- SchemaGen : Create a data schema based on the Statistics
-  ExampleValidator: Analyse the data for abnormalities / inconsistencies
- Transform: Perform necessary transformation in the data
- Trainer : Trains the model
- Evaluator : Evaluate the model performance to determine its readiness for deployment/ discard
- Pusher : Deploys model to the production alt text

### Install Libraries

In [0]:
!pip install -q pyarrow==0.15.1
!pip install -q tfx-bsl==0.21.4
!pip install -Uq apache_beam==2.20.0
!pip install -q tensorflow==2.1.0
!pip install -q tensorflow_data_validation==0.21.5


# INSTALL LIBRARIES
import pyarrow
import tensorflow as tf
import apache_beam as beam
import apache_beam.io.iobase
import tensorflow_data_validation as tfdv

print(f'Pyarrow : ',pyarrow.__version__)
print(f'Beam : ', beam.__version__)
print(f'Tensorflow : ',tf.__version__)
print(f'TFDV :', tfdv.version.__version__)

## ExampleGen : Get Chicago Taxi Data from internet

[31mERROR: apache-beam 2.20.0 has requirement httplib2<=0.12.0,>=0.8, but you'll have httplib2 0.17.3 which is incompatible.[0m
[31mERROR: google-api-python-client 1.7.12 has requirement httplib2<1dev,>=0.17.0, but you'll have httplib2 0.12.0 which is incompatible.[0m
Pyarrow :  0.15.1
Beam :  2.20.0
Tensorflow :  2.1.0
TFDV : 0.21.5


###  ExampleGen : Read data into TFX Pipeline
We will read the CSV  data into the TFX pipeline.

In [0]:
import  os, tempfile, urllib, zipfile

ROOT_DIR = tempfile.mkdtemp()
OUTPUT_DIR = os.path.join(ROOT_DIR, 'output')

def get_chicago_taxi_data():
  """ Download, unzip and return Train, Eval, and Serving Data """
  zipfile_url, _ = urllib.request.urlretrieve('https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/chicago_data.zip')
  zipfile.ZipFile(zipfile_url).extractall(ROOT_DIR)

  TRAIN_DATA = os.path.join(ROOT_DIR,'data', 'train', 'data.csv')
  EVAL_DATA = os.path.join(ROOT_DIR,'data', 'eval', 'data.csv')
  SERVING_DATA = os.path.join(ROOT_DIR,'data', 'serving', 'data.csv')
  return TRAIN_DATA, EVAL_DATA, SERVING_DATA

TRAIN_DATA, EVAL_DATA, SERVING_DATA = get_chicago_taxi_data()

print('Inspecting Train, eval  and serving Data Files')
!ls -R -all {os.path.join(ROOT_DIR)}

Inspecting Train, eval  and serving Data Files
/tmp/tmpme9ybo7h:
total 12
drwx------ 3 root root 4096 May 14 03:43 .
drwxrwxrwt 1 root root 4096 May 14 03:43 ..
drwxr-xr-x 5 root root 4096 May 14 03:43 data

/tmp/tmpme9ybo7h/data:
total 20
drwxr-xr-x 5 root root 4096 May 14 03:43 .
drwx------ 3 root root 4096 May 14 03:43 ..
drwxr-xr-x 2 root root 4096 May 14 03:43 eval
drwxr-xr-x 2 root root 4096 May 14 03:43 serving
drwxr-xr-x 2 root root 4096 May 14 03:43 train

/tmp/tmpme9ybo7h/data/eval:
total 636
drwxr-xr-x 2 root root   4096 May 14 03:43 .
drwxr-xr-x 5 root root   4096 May 14 03:43 ..
-rw-r--r-- 1 root root 641083 May 14 03:43 data.csv

/tmp/tmpme9ybo7h/data/serving:
total 24
drwxr-xr-x 2 root root  4096 May 14 03:43 .
drwxr-xr-x 5 root root  4096 May 14 03:43 ..
-rw-r--r-- 1 root root 12727 May 14 03:43 data.csv

/tmp/tmpme9ybo7h/data/train:
total 1260
drwxr-xr-x 2 root root    4096 May 14 03:43 .
drwxr-xr-x 5 root root    4096 May 14 03:43 ..
-rw-r--r-- 1 root root 1281883 May

## StatisticsGen : Perform Exploratory Analysis
- For large dataset, it uses Apache Beam to do parallel processing and scale.

In [0]:
train_statistics = tfdv.generate_statistics_from_csv(data_location=TRAIN_DATA)

  types.FeaturePath([column_name]), column.data.chunk(0), weights):


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


-  Company column is missing values  for 34% of rows/ data.
    - Create dummy variables.
- Payment_type : create dummy variable

- Company column : Normalisation necessary
-  0.17% fare is zero which seems odd as to "How can the fares be zero?"
-  trip_start_day should not contain float value. Confirm
- trip_start_timestamp wrong format
- dropoff_latitude/ longitude missing values
- trip_miles : 27% zero. How can the trip mile be zero ?
-


In [0]:
tfdv.visualize_statistics(train_statistics)

## Observation:
  - pickup_census_tract : Missing all values. FIX: Remove column
  - 