<a href="https://colab.research.google.com/github/beekal/MachieneLearningProjects/blob/master/0%20Basics%20-%20TF/TF_2_0_Basics%20Revision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TF 2.0 Topics Covered
1. TF.Data : A single point of entry to handle any/ varied data type ranging from pandas, csv, image, text, TfRecordByte e.t.c
2. TFX: Tensorflow extended,  which aims to  provided end to end TF ecosystem from its  research/ prototype to  the server deployment. It includes the following  all of  which we will cover in this notebook
  -  TF validation : Used to validate the data
  -  TF Transform : Used to transform the data to its modeling state
  -  TF Modeling / Analysis : Used to create the model and evaluate its performance. Reiterate until satisfactory Metric reached.
  - TF Serving: Used to serve the model to production with versioning/ Rollback capabilities.

REF: https://www.tensorflow.org/tfx 

In [2]:
import tensorflow as tf
import numpy

print(f'TF version : ', tf.__version__)

TF version :  2.2.0-rc3


## TF.Data
Depending on the data source you wil have to use different TF data load calls
  - Data in memory :
      - tf.data.Dataset.from_tensors() or
      - tf.data.Dataset.from_tensor_slices() : For data in memory
      - tf.data.TFRecordDataset(): For Large Files
      - tf.data.TextLineDataset() : For text File
      - From CSV : Supports lazy incremental data load
      - From DataFrame : Loads all data in mem

1. **TFRecordDataset** : If we want to read the data efficiently or are dealing with the large files, it is best to use the TFRecordDataSet. It stores the data in the binary format of smaller size(100-200 MB chunk) and hence is faster / easier to read. To know more on [How to create a TFRecordDataset click here](www.tensorflow.org/tutorials/load_data/tfrecord).

In [44]:
def hprint(text):
  print("\n"+"="*80+"\n\t\t"+text+"\n"+"="*80)

hprint('READING FROM MEMORY')
d_mem = tf.data.Dataset.from_tensor_slices([1,2,3])
print(f' Tf.Data from memory : ' ,[e.numpy() for e in d_mem])
it = iter(d_mem)
print(f' Tf.Data from memory using iterator : ' ,next(it).numpy())

# French street name sign FRecordDataset
hprint('READING FROM TF RECORD DATASET')
fsns_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")
fsns_tfrecord = tf.data.TFRecordDataset(filenames= [fsns_file])
print(fsns_tfrecord)
byte_example = next(iter(fsns_tfrecord))
# Displays all the example data
# print(tf.train.Example.FromString(byte_example.numpy()) )
print(tf.train.Example.FromString(byte_example.numpy()).features.feature['image/text'] ) 

hprint('READING TEXT FILE')
dir = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']
file_paths = [tf.keras.utils.get_file(f, dir+f) for f in file_names ]
f_dataset =   tf.data.TextLineDataset(file_paths)
for line in f_dataset.take(3):
  print('First line of each file  :', line.numpy() )
# To learn how to Shuffle Text lines between files, see https://www.tensorflow.org/guide/data#consuming_text_data

hprint('READING FROM DATAFRAME')
import pandas as pd
titanic_csv = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
df =  pd.read_csv(titanic_csv, index_col=None)
t_dataset = tf.data.Dataset.from_tensor_slices(dict(df))
for feat in t_dataset.take(1):
  for col_name, val in feat.items():
    print("{:20s}: {}".format(col_name, val))

hprint('READING FROM CSV')
t_batch = tf.data.experimental.make_csv_dataset(\
                titanic_csv, batch_size=2, label_name="survived", \
                select_columns=['sex','age', 'class', 'fare', 'survived'] )
for feat, label in t_batch.take(1):
  print(f'Label (Survived) :', label)
  for col_name, val in feat.items():
    print("{:20s} : {}".format(col_name, val))


		READING FROM MEMORY
 Tf.Data from memory :  [1, 2, 3]
 Tf.Data from memory using iterator :  1

		READING FROM TF RECORD DATASET
<TFRecordDatasetV2 shapes: (), types: tf.string>
bytes_list {
  value: "Rue Perreyon"
}


		READING TEXT FILE
First line of each file  : b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
First line of each file  : b'His wrath pernicious, who ten thousand woes'
First line of each file  : b"Caused to Achaia's host, sent many a soul"

		READING FROM DATAFRAME
survived            : 0
sex                 : b'male'
age                 : 22.0
n_siblings_spouses  : 1
parch               : 0
fare                : 7.25
class               : b'Third'
deck                : b'unknown'
embark_town         : b'Southampton'
alone               : b'n'

		READING FROM CSV
Label (Survived) : tf.Tensor([0 1], shape=(2,), dtype=int32)
sex                  : [b'male' b'female']
age                  : [36.  4.]
fare                 : [10.5    13.4167]
class                : 

### TF.Data: Generator
  - **Read From Mixed datatype + Lazy Data Loading**

  - Use generator to convert each elements into a tf.Data Dataset 
  - Example : Load image data with generator. i.e We do not need to read 5+ GB of data and fill it up in  memory / waste memory.

In [4]:
person_data=  ([{ 'age':18,'name':'Hary' },
              { 'age':30,'name':'Sam' }
              ])
person_data
person =  tf.data.Dataset.from_generator( lambda: person_data, {"age": tf.int32, "name":tf.string}  )
print(person)
print(list(person.as_numpy_iterator()))


<FlatMapDataset shapes: {age: <unknown>, name: <unknown>}, types: {age: tf.int32, name: tf.string}>
[{'age': 18, 'name': b'Hary'}, {'age': 30, 'name': b'Sam'}]


## TF Transform :
We can apply different type of transformations as per our need
  - Dataset.map() : Per-Element operations
  - Dataset.filter(): Per-element Filter Operation
  - Dataset.reduce(): Reduce transfomations to single scalar value
  - Dataset.batch(): Per-batch operations
  - Dataset.shuffle(): Shuffle data

### Dataset.map: Parallelisation/ SpeedUp
If we want to speed up the dataset.map then we must specify the num_parallel_calls. 
  - Not defined : Process sequentially
  - tf.data.experimental.AUTOTUNE : parallel calls set based on available CPU 


In [21]:
def f(x):
  return x+2

print('Dataset.map(): All elements Increment by 1')
print(list(d_mem.map(lambda x: x+1, num_parallel_calls=2).as_numpy_iterator() ))
print(f'Using function ',list(d_mem.map(lambda x: f(x), num_parallel_calls=2).as_numpy_iterator() ))


print('\nDataset.reduce():  With initial intitial_state or starting_val')
print(f'=10 ', d_mem.reduce( 10,  lambda x,y: x+y).numpy() )
print(f'=0 ', d_mem.reduce( 0,  lambda x,y: x+y).numpy() )

print('\nDataset.shuffle()')
d_mem1 = tf.data.Dataset.from_tensor_slices([1,2,3,4,5,6,7,8,9]) 
print(f'Pre-Shuffle ', list(d_mem1.as_numpy_iterator() ))
print(f'Post-Shuffle', list(d_mem1.shuffle(buffer_size=2).batch(2).as_numpy_iterator() ))

Dataset.map(): All elements Increment by 1
[2, 3, 4]
Using function  [3, 4, 5]

Dataset.reduce():  With initial intitial_state or starting_val
=10  16
=0  6

Dataset.shuffle()
Pre-Shuffle  [1, 2, 3, 4, 5, 6, 7, 8, 9]
Post-Shuffle [array([1, 3], dtype=int32), array([4, 5], dtype=int32), array([6, 2], dtype=int32), array([7, 9], dtype=int32), array([8], dtype=int32)]


## TF Transform : Using Python function
If you want to use the python function inside teh tensorflow pipeline, you can do so in two ways
  - Autograph: Use the function directly in the TF graphs. 
    - Pro: Easy to use
    - Con: Can covert some but not all python codes
  - tf.py_function : Define the python function as tf function during its use, indicating that it needs to converted into the tf code
    - Pro: Can write  arbitary code and will be supported by TF pipeline without throwing any error
    - Con: Generally results in worse performance
      - Not parallelised

In [0]:
def upper1(x: tf.Tensor):
  return x+10
  
person_modf = person.map( lambda x : tf.py_function( func=upper1, inp=[x['age']], Tout=tf.int32) )
print(list(person_modf.as_numpy_iterator()) )


## TF.py_function vs TF.distribute.Server
The tf.py_function must run in the same address as the python program including the device, hence if you are using the distributed Tensorflow you must  use the tf.distribute.Server instead

## TF.Batch :
Create dataset batch to be used in the training
**[Caution]** : The batch sizes can be uneven because equal size batching may not be possible. e.g break 3 element into batch of 2. It may create warning/ errors due to different size. 

In [0]:
print(f'Uneven batch size :', list( d_mem.batch(2).as_numpy_iterator() ) )
print(f'\nEven batch size by dropping remainder :', \
      list( d_mem.batch(2, drop_remainder=True).as_numpy_iterator() ) )

print(f'\nEven batch size with padding :', \
      list( d_mem.batch(2).padded_batch(2, padded_shapes=[None]).as_numpy_iterator() ) )

## Optimisation
There are couple of things that can be done to speed up the.
  - prefetch :  If you are reading the data off the disk, then often the latency is involved in disk reads. You can avoid this latency and hence faster training by prefetching  some batch of data.

In [0]:
print('Orig d_mem :', list( d_mem.as_numpy_iterator() ))
# This one fetches all because we have only one  list in the dataset
print('Prefetches all ', list( d_mem.prefetch(1).as_numpy_iterator() ))
print('', list( d_mem.batch(1).prefetch(1).as_numpy_iterator() ))

In [0]:
# print(person.map(lambda d: (d['age'] , d['name'])))
# print(list(person.map(lambda d: (d['age'] , str(d['name'].numpy()).upper())).as_numpy_iterator() ))
# print(person.map( lambda x : upper(x) ).as_numpy_iterator()) 
# print(person.map( lambda x : upper(x) )) 
# print(list(person.as_numpy_iterator()))