## DATA PREPROCESSING

These are the most common data that we are going to use in TensorFlow (Deep Learning)

* Tables
* Images
* Text

 ### Preparing Tabular Data for Training

It is important to identify which columns are considered categorical. Neural Networks and other models of Machine Learning should prefer numeric inputs (One Hot Encoder)
Another aspect to take account of is the potential of interactions among multiple features


🤬 Sick of trying to use Titanic as an example. <br>
🤔 There are lot's of dataset out there ¿? Bigger ones that can teach us better. <br>
But I am still almost a newbie so I'll keep track of what books tell us to do

In [1]:
# Loading libraries
import functools
import numpy as np
import tensorflow as tf
import pandas as pd
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

In [2]:
# Load data from Google's public storage
TRAIN_DATA_URL = 'https://storage.googleapis.com/tf-datasets/titanic/train.csv'
TEST_DATA_URL = 'https://storage.googleapis.com/tf-datasets/titanic/eval.csv'

train_file_path = tf.keras.utils.get_file('train.csv',TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file('eval.csv',TEST_DATA_URL)


Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv


In [3]:
print(train_file_path)

/root/.keras/datasets/train.csv


**📑 When datasets are bigger is a best prectice in TensorFlow to convert your table into a straming dataset.** <br>
This ensures that memory consumption wil be not affected by data size

In [4]:
train_df = pd.read_csv(train_file_path, header='infer')
test_df = pd.read_csv(test_file_path, header='infer')

In [5]:
train_df

Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,0,male,22.0,1,0,7.2500,Third,unknown,Southampton,n
1,1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
2,1,female,26.0,0,0,7.9250,Third,unknown,Southampton,y
3,1,female,35.0,1,0,53.1000,First,C,Southampton,n
4,0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
...,...,...,...,...,...,...,...,...,...,...
622,0,male,28.0,0,0,10.5000,Second,unknown,Southampton,y
623,0,male,25.0,0,0,7.0500,Third,unknown,Southampton,y
624,1,female,19.0,0,0,30.0000,First,B,Southampton,y
625,0,female,28.0,1,2,23.4500,Third,unknown,Southampton,n


In [6]:
# We can, with TensorFlow, set the target, the column for prediction, the label

LABEL_COLUMN = 'survived'
LABELS = [0,1]

train_ds = tf.data.experimental.make_csv_dataset(
    train_file_path,
    batch_size=3,
    label_name=LABEL_COLUMN,
    na_value='?',
    num_epochs=1)
train_ds = train_ds.ignore_errors()

test_ds = tf.data.experimental.make_csv_dataset(
    test_file_path,
    batch_size=3,
    label_name=LABEL_COLUMN,
    na_value='?',
    num_epochs=1)
test_ds = test_ds.ignore_errors()


In [7]:
# Inspect the data
for batch, label in train_ds.take(1):
  print(label)
  for key, value in batch.items():
    print('{}: {}'.format(key,value.numpy()))

tf.Tensor([0 1 1], shape=(3,), dtype=int32)
sex: [b'male' b'female' b'female']
age: [35. 21. 42.]
n_siblings_spouses: [0 0 0]
parch: [0 0 0]
fare: [ 10.5    10.5   227.525]
class: [b'Second' b'Second' b'First']
deck: [b'unknown' b'unknown' b'unknown']
embark_town: [b'Southampton' b'Southampton' b'Cherbourg']
alone: [b'y' b'y' b'y']


#### Major steps for training a paradigm to consume your training dataset:

* Designate columns by features type
* Decide wether or not to embed or cross columns
* Choose the columns of interest, possibly as an experiment
* Create a 'feature layer' for consumption by the training paradigm

##### Features types
Four numeric columns: 'age', 'n_siblings_spouses', 'parch', 'fares' <br>
Five categorical columns: 'sex','class','deck', 'embark_town', 'alone'

In [8]:
feature_columns = []

# numeric cols
for header in ['age', 'n_siblings_spouses', 'parch', 'fare']:
  feature_columns.append(layers.Input(shape=(1,), name=header))

# other feature columns (categorical, etc.) could be added here

In [9]:
train_df.describe()

Unnamed: 0,survived,age,n_siblings_spouses,parch,fare
count,627.0,627.0,627.0,627.0,627.0
mean,0.38756,29.631308,0.545455,0.379585,34.385399
std,0.487582,12.511818,1.15109,0.792999,54.59773
min,0.0,0.75,0.0,0.0,0.0
25%,0.0,23.0,0.0,0.0,7.8958
50%,0.0,28.0,0.0,0.0,15.0458
75%,1.0,35.0,1.0,0.0,31.3875
max,1.0,80.0,8.0,5.0,512.3292


In [10]:
age = feature_column.numeric_column('age')
age_input = layers.Input(shape=(1,), name='age')
age_buckets = feature_column.bucketized_column(age, boundaries=[23, 28, 35])

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


In [11]:
h = {}

for col in train_df:
  if col in ['sex','class','deck', 'embark_town', 'alone']:
    print(col, ':', train_df[col].unique())
    h[col] = train_df[col].unique()

sex : ['male' 'female']
class : ['Third' 'First' 'Second']
deck : ['unknown' 'C' 'G' 'A' 'B' 'D' 'F' 'E']
embark_town : ['Southampton' 'Cherbourg' 'Queenstown' 'unknown']
alone : ['n' 'y']


In [12]:
sex_type = feature_column.categorical_column_with_vocabulary_list('Type', h.get('sex').tolist())
sex_type_one_hot = feature_column.indicator_column(sex_type)

class_type = feature_column.categorical_column_with_vocabulary_list('Type', h.get('class').tolist())
class_type_one_hot = feature_column.indicator_column(class_type)

deck_type = feature_column.categorical_column_with_vocabulary_list('Type', h.get('deck').tolist())
deck_type_one_hot = feature_column.indicator_column(deck_type)

embark_town_type = feature_column.categorical_column_with_vocabulary_list('Type', h.get('embark_town').tolist())
embark_town_type_one_hot = feature_column.indicator_column(embark_town_type)

alone_type = feature_column.categorical_column_with_vocabulary_list('Type', h.get('alone').tolist())
alone_type_one_hot = feature_column.indicator_column(alone_type)


Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


In [13]:
deck = feature_column.categorical_column_with_vocabulary_list('deck',train_df.deck.unique())
deck_embedding = feature_column.embedding_column(deck, dimension=3)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


In [14]:
class_hashed = feature_column.categorical_column_with_hash_bucket('class', hash_bucket_size=4)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


##### Encoding Column Interactions as Possible Features

Sometimes intuition, experience and knowledge allows us to create new features by finding interactions between existing features

In [15]:
cross_type_feature = feature_column.crossed_column(['sex','class'], hash_bucket_size=5)

Instructions for updating:
Use `tf.keras.layers.experimental.preprocessing.HashedCrossing` instead for feature crossing when preprocessing data to train a Keras model.


In [16]:
# Start appending modified data
feature_columns = []

# Append numeric columns
for header in ['age', 'n_siblings_spouses', 'parch', 'fare']:
  feature_columns.append(feature_column.numeric_column(header))

# Append bucketized columns
age = feature_column.numeric_column('age')
age_buckets = feature_column.bucketized_column(age, boundaries=[23, 28, 35])
feature_columns.append(age_buckets)



In [17]:
# indicator_columns
indicator_column_names = ['sex', 'class', 'deck', 'embark_town', 'alone']
for col_name in indicator_column_names:
  categorical_column = feature_column.categorical_column_with_vocabulary_list(
      col_name, train_df[col_name].unique())
  indicator_column = feature_column.indicator_column(categorical_column)
  feature_columns.append(indicator_column)

In [18]:
# Append embedding columns
deck = feature_column.categorical_column_with_vocabulary_list(
    'deck', train_df.deck.unique())
deck_embedding = feature_column.embedding_column(deck, dimension=3)
feature_columns.append(deck_embedding)

In [19]:
# Append crossed columns
feature_columns.append(feature_column.indicator_column(cross_type_feature))

In [20]:
# Now create a feature layer
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

This layer will serve as the first input of the model we are going to build and train. This is how you'll provide all the feature engineering framewroks for the model's training process

##### Creating a Cross-Validation Dataset

We need to create a small set of our data for cross-validation purposes.<br>
(training, validation, testing)

In [21]:
val_df, test_df = train_test_split(test_df, test_size = 0.4)

In [22]:
batch_size = 33
labels = train_df.pop('survived')
working_ds = tf.data.Dataset.from_tensor_slices((dict(train_df), labels))
working_ds = working_ds.shuffle(buffer_size=len(train_df))
train_ds = working_ds.batch(batch_size)

In [23]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
def pandas_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('survived')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

In [24]:
val_ds = pandas_to_dataset(val_df, shuffle=False, batch_size=batch_size)
test_ds = pandas_to_dataset(test_df, shuffle=False, batch_size=batch_size)

In [26]:
 # 1. Build the model
 model=tf.keras.Sequential([
    feature_layer,
    layers.Dense(128, activation='relu'),
    layers.Dense(128, activation='relu'),
    layers.Dropout(.1),
    layers.Dense(1)])

 model.compile(optimizer='adam',
               loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
               metrics=['accuracy'])

 model.fit(train_ds, validation_data=val_ds, epochs=10)

Epoch 1/10








Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7e660d9f8d30>

### Preparing Image Data for Processing

For images, you need to reshape and resample all the images into the same pixel count; this is know as *standarization*

We need, also, to ensure that all pixel values are within the same color range so that they fall within the finite range of RGB values of each pixel.

👁 *(ResNet, i.e., requires each input image to be 224 x 224 x 3 pixels and be presented as NumPy multidimensional array)*

Finally, the important thing is to make a good preprocessing routine to ensure the resampling is done properly

In [27]:
import tensorflow as tf
import numpy as np
import matplotlib.pylab as plt
import pathlib

In [29]:
data_dir = tf.keras.utils.get_file('flower_photos',
                                   'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
                                   untar=True) # untar=True because files came in a compressed tar file



Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
