# Pet adoptions with deep networks
This simple project aims to build a deep neural network with tensorflow by predicting adoptions of animals. It is also a way for me to play around with tensorflow functionalities, and to have a nice fallback example when I have problems with larger projects. The project follows https://www.tensorflow.org/tutorials/structured_data/feature_columns, with hyperparameter tuning inspired by https://www.tensorflow.org/tutorials/keras/keras_tuner
### Library import
Let's start by importing relevant libraries.

In [1]:
# import complete libraries
import numpy
import pandas
import tensorflow
import kerastuner
import os

# import sub-libraries and specific functions
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from tensorboard.plugins.hparams import api as hp

### Data import and create train, validate, test dataset
Download the dataset with the keras get_file utility, and import it as a pandas dataframe

In [2]:
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'

tensorflow.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')
dataframe = pandas.read_csv(csv_file)


Construct labels upon using the information that AdoptionSpeed = 4 labels animals that were not adopted, and drop columns of no interest. 

In [3]:
# Encode data labels
dataframe['target'] = numpy.where(dataframe['AdoptionSpeed']==4, 0, 1)

# Drop un-used columns.
dataframe = dataframe.drop(columns=['AdoptionSpeed', 'Description'])

Split dataset into train, validation and test datasets. I am 

In [4]:
train, test = train_test_split(dataframe, test_size=0.2, random_state = 0)
train, val = train_test_split(train, test_size=0.2, random_state = 0)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')


7383 train examples
1846 validation examples
2308 test examples


Create datasets from dataframe using utilities from the GCP platform

In [5]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tensorflow.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)


And below, just a few extra utilities that helps with the job of inspecting stuff

In [6]:
# extract one batch to play around
batch, label = iter(train_ds).next()

# Utility to visualize the dataset structure
for key, value in batch.items():
    print(f"{key:20s}: {value}")
print(f"{'label':20s}: {label}")

# utility to inspect the dataset composition
def demo(feature_column):
  feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(batch).numpy())


Type                : [b'Dog' b'Dog' b'Cat' b'Dog' b'Dog' b'Dog' b'Cat' b'Cat' b'Cat' b'Dog'
 b'Dog' b'Dog' b'Dog' b'Cat' b'Dog' b'Dog' b'Cat' b'Cat' b'Cat' b'Cat'
 b'Dog' b'Dog' b'Dog' b'Cat' b'Cat' b'Dog' b'Cat' b'Cat' b'Dog' b'Dog'
 b'Dog' b'Dog']
Age                 : [ 5  4  1 60 60  3  6  2  3 24  6  2  2  2  2  2  1  4  6  5  1  2 60  2
  4  2  2  1  2  4 12  5]
Breed1              : [b'Mixed Breed' b'Silky Terrier' b'Domestic Short Hair'
 b'Golden Retriever' b'Miniature Pinscher' b'Mixed Breed'
 b'Domestic Short Hair' b'Domestic Short Hair' b'Domestic Short Hair'
 b'Mixed Breed' b'Mixed Breed' b'Mixed Breed' b'Mixed Breed'
 b'Domestic Short Hair' b'Mixed Breed' b'Poodle' b'Domestic Medium Hair'
 b'Domestic Short Hair' b'Domestic Short Hair' b'Domestic Short Hair'
 b'Mixed Breed' b'Mixed Breed' b'Poodle' b'Domestic Short Hair'
 b'Domestic Medium Hair' b'Mixed Breed' b'Domestic Short Hair'
 b'Oriental Long Hair' b'Mixed Breed' b'Mixed Breed' b'Mixed Breed'
 b'Mixed Breed']
Gender

### Build feature columns
Okay, now I can start to play around by building the feature column. This means that I will combine diffeten features together. First, let's create the groups of basic features that I want to include.

In [7]:
# purely numeric features
numeric_features = ['PhotoAmt', 
                    'Fee']

# bucketized features, with buckets to use in a feature:bucket dictionary form
bucketized_features = {'Age': [1, 2, 3, 4, 5]}

# indicator features
indicator_features = ['Type', 
                      'Color1', 
                      'Color2', 
                      'Gender', 
                      'MaturitySize',
                      'FurLength', 
                      'Vaccinated', 
                      'Sterilized', 
                      'Health']

# embedded features
embedded_features = ['Breed1']



And now, let's define the feature columns. Note that you can apply the demo utility on each new_feature separately, or on the overall feature_columns array as a whole.

In [8]:
# function to build the feature columns. The original pandas dataframe is referenced as global variable
def build_feature_columns():
    feature_columns = []

    # add numeric features
    for feature_name in numeric_features:
        new_feature = feature_column.numeric_column(feature_name)
        feature_columns.append(new_feature)

    # add bucketized features from numeric    
    for feature_name in bucketized_features:
        new_feature = feature_column.bucketized_column(feature_column.numeric_column(feature_name),
                                                       bucketized_features[feature_name])
        feature_columns.append(new_feature)

    # add indicator feature
    for feature_name in indicator_features:
        new_feature_as_categorical = feature_column.categorical_column_with_vocabulary_list(feature_name, dataframe[feature_name].unique())
        new_feature_as_indicator   = feature_column.indicator_column(new_feature_as_categorical) 
        feature_columns.append(new_feature_as_indicator)

    # add embedded features
    for feature_name in embedded_features:
        naive_embedding_size = int(numpy.round(len(dataframe[feature_name].unique())**(0.25)))
        new_feature_as_categorical = feature_column.categorical_column_with_vocabulary_list(feature_name, dataframe[feature_name].unique())
        new_feature_as_embedding   = feature_column.embedding_column(new_feature_as_categorical, naive_embedding_size)
        feature_columns.append(new_feature_as_embedding)

    return feature_columns

    
print('inspect everything')
demo(build_feature_columns())
# Warnings comes out because conda on macOS can have only tensorflow 2.0.0 an not 2.3.14

inspect everything
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 1. 0. ... 0. 0. 1.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 1. 0. 0.]]


### Neural model
Now that feature columns are ready, I can now finally train a deep neural network model, albeit a very simple one. Let's start by writing down a funcitons that initialized a model object that I can feed to the hyperparameter tuner. Note that in this function the hyperparameters object does not have specific details on the hyperparameter space. Those are defined within the function initialize_model itself.

In [9]:
def initialize_model(hyperparameters):        
    # specify hyperparameter ranges hyperparameter_object 
    node_units  = hyperparameters.Int('units', min_value = 10, max_value = 50, step = 10)
    dropout_val = hyperparameters.Float('dropout', min_value = 0.05, max_value = 0.25, step = 0.05)
    optimizer   = hyperparameters.Choice('optimizer', ['adam', 'ftrl'])
        
    # build input layer from feature columns
    input_layer = tensorflow.keras.layers.DenseFeatures(build_feature_columns())    
        
    # build sequential model
    model = tensorflow.keras.Sequential([
      input_layer,
      layers.Dense(units = node_units, activation='relu'),
      layers.Dropout(rate = dropout_val),
      layers.Dense(units = node_units, activation='relu'),
      layers.Dropout(rate = dropout_val),
      # sigmoid layer to perfomr classification tasks
      layers.Dense(1,  activation='sigmoid')
    ])

    # compile model
    model.compile(optimizer = optimizer,
                  loss = tensorflow.keras.losses.BinaryCrossentropy(from_logits = True),
                  metrics = ['accuracy',
                            tensorflow.keras.metrics.Precision(name='precision'),
                            tensorflow.keras.metrics.Recall(name='recall')])
    
    return model

Let's now use this initializer function to set up an hyperparameter tuner.

In [10]:
tuner = kerastuner.Hyperband(initialize_model,
                             objective = 'val_accuracy', 
                             max_epochs = 10,
                             factor = 3,
                             directory = 'logs',
                             project_name = 'hyperparameter_tuning')


INFO:tensorflow:Reloading Oracle from existing project logs/hyperparameter_tuning/oracle.json
INFO:tensorflow:Reloading Tuner from logs/hyperparameter_tuning/tuner0.json


We can now run the hyperparameter tuner to find the best hyperparameter configuration

In [11]:
tuner.search(train_ds, epochs = 20, validation_data = val_ds, verbose = 0)

INFO:tensorflow:Oracle triggered exit


Using the log from the tuner, we can now find the best parameter and train the corresponding model

In [12]:
best_hps = tuner.get_best_hyperparameters(num_trials = 1)[0]
print('best model:')
print(f"""nodes:     {best_hps.get('units')}""")
print(f"""dropout:   {best_hps.get('dropout')}""")
print(f"""optimizer: {best_hps.get('optimizer')}""")

best model:
nodes:     30
dropout:   0.15000000000000002
optimizer: adam


So we can now train the best model

In [13]:
# Build the model with the optimal hyperparameters and train it on the data
model = tuner.hypermodel.build(best_hps)
model.fit(train_ds, epochs = 10, validation_data = val_ds, verbose = 0)

<tensorflow.python.keras.callbacks.History at 0x7ff26ccbcd50>

Using the trained model, we can estimate now the performances of the model. I know that I could play with other hyperparameters in this dataset, such as the number of layers to implement, or the size of the embedding for breeds, or joint variables. Such complex optimization is whereas the model evaluation in the test set is outside the scope of this example, so I will move toward validation on the test set instead.

In [14]:
performances_on_validation = model.evaluate(val_ds)
performances_on_test = model.evaluate(test_ds)



In [15]:
print(f"""Performances on validation:
Accuracy:  {performances_on_validation[1]}
Precision: {performances_on_validation[2]}
Recall:    {performances_on_validation[3]}

Performances on test:
Accuracy:  {performances_on_test[1]}
Precision: {performances_on_test[2]}
Recall:    {performances_on_test[3]}""")

Performances on validation:
Accuracy:  0.744312047958374
Precision: 0.7949735522270203
Recall:    0.8812316656112671

Performances on test:
Accuracy:  0.7370017170906067
Precision: 0.7824000120162964
Recall:    0.8805522322654724


The performances on the validation and the test set seems in agreement, so the predictive model seems to generalize quite well. This means that I can deploy it. This requires first to save the model for production

In [16]:
model.save(os.path.join(os.getcwd(), 'animal_adoption_model'))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: /Users/dabol99/Documents/DS projects/Animal_adoptions/animal_adoption_model/assets


which I will be able to deploy on GCP the day I want to pay for their services. Yay!