## Pet adoptions with deep networks
This simple project aims to develop the basic understanding how to build a deep neural network with tensorflow by predicting adoptions of animals. It is also a way for me to play around with tensorflow functionalities, and to have a nice fallback example when I have problems with larger projects.

Let's start by importing relevant libraries.

In [1]:
# import complete libraries
import numpy
import pandas
import tensorflow

# import sub-libraries and specific functions
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from tensorboard.plugins.hparams import api as hp

Download the dataset with the keras get_file utility, and import it as a pandas dataframe

In [2]:
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'

tensorflow.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')
dataframe = pandas.read_csv(csv_file)


Construct labels upon using the information that AdoptionSpeed = 4 labels animals that were not adopted, and drop columns of no interest. 

In [3]:
# Encode data labels
dataframe['target'] = numpy.where(dataframe['AdoptionSpeed']==4, 0, 1)

# Drop un-used columns.
dataframe = dataframe.drop(columns=['AdoptionSpeed', 'Description'])

Split dataset into train, validation and test datasets. I am 

In [4]:
train, test = train_test_split(dataframe, test_size=0.2, random_state = 0)
train, val = train_test_split(train, test_size=0.2, random_state = 0)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')


7383 train examples
1846 validation examples
2308 test examples


Create datasets from dataframe using utilities from the GCP platform

In [5]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tensorflow.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)


And below, just a few extra utilities that helps with the job of inspecting stuff

In [6]:
# extract one batch to play around
batch, label = iter(train_ds).next()

# Utility to visualize the dataset structure
for key, value in batch.items():
    print(f"{key:20s}: {value}")
print(f"{'label':20s}: {label}")

# utility to inspect the dataset composition
def demo(feature_column):
  feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(batch).numpy())


Type                : [b'Dog' b'Dog' b'Dog' b'Cat' b'Dog' b'Cat' b'Dog' b'Dog' b'Cat' b'Cat'
 b'Cat' b'Cat' b'Cat' b'Dog' b'Dog' b'Dog' b'Dog' b'Cat' b'Dog' b'Dog'
 b'Dog' b'Dog' b'Cat' b'Cat' b'Cat' b'Dog' b'Dog' b'Dog' b'Dog' b'Cat'
 b'Dog' b'Dog']
Age                 : [22  2  2  4  9  3  3 36  2  4  1 15  4 48  1  1 14  1  3 12  1  2  2  2
  3  2  2  2  8  5 10  2]
Breed1              : [b'Golden Retriever' b'Mixed Breed' b'Mixed Breed' b'Domestic Short Hair'
 b'Dalmatian' b'Domestic Short Hair' b'Mixed Breed' b'Schnauzer'
 b'Domestic Short Hair' b'Domestic Short Hair' b'Calico' b'Siamese'
 b'Domestic Short Hair' b'Mixed Breed' b'Mixed Breed' b'Mixed Breed'
 b'Mixed Breed' b'Domestic Short Hair' b'Mixed Breed' b'Mixed Breed'
 b'Labrador Retriever' b'Labrador Retriever' b'Domestic Medium Hair'
 b'Domestic Medium Hair' b'Domestic Medium Hair' b'Mixed Breed'
 b'Mixed Breed' b'Mixed Breed' b'Mixed Breed' b'Domestic Short Hair'
 b'Mixed Breed' b'Mixed Breed']
Gender              : [b'Fe

Okay, now I can start to play around by building the feature column. This means that I will combine diffeten features together. First, let's create the groups of basic features that I want to include.

In [7]:
# purely numeric features
numeric_features = ['PhotoAmt', 
                    'Fee']

# bucketized features, with buckets to use in a feature:bucket dictionary form
bucketized_features = {'Age': [1, 2, 3, 4, 5]}

# indicator features
indicator_features = ['Type', 
                      'Color1', 
                      'Color2', 
                      'Gender', 
                      'MaturitySize',
                      'FurLength', 
                      'Vaccinated', 
                      'Sterilized', 
                      'Health']

# embedded features
embedded_features = ['Breed1']



And now, let's define the feature columns. Note that you can apply the demo utility on each new_feature separately, or on the overall feature_columns array as a whole.

In [8]:
feature_columns = []

# add numeric features
for feature_name in numeric_features:
    new_feature = feature_column.numeric_column(feature_name)
    feature_columns.append(new_feature)
    
# add bucketized features from numeric    
for feature_name in bucketized_features:
    new_feature = feature_column.bucketized_column(feature_column.numeric_column(feature_name),
                                                   bucketized_features[feature_name])
    feature_columns.append(new_feature)
       
# add indicator feature
for feature_name in indicator_features:
    new_feature_as_categorical = feature_column.categorical_column_with_vocabulary_list(feature_name, dataframe[feature_name].unique())
    new_feature_as_indicator   = feature_column.indicator_column(new_feature_as_categorical) 
    feature_columns.append(new_feature_as_indicator)

# add embedded features
for feature_name in embedded_features:
    print(1)
    naive_embedding_size = int(numpy.round(len(dataframe[feature_name].unique())**(0.25)))
    new_feature_as_categorical = feature_column.categorical_column_with_vocabulary_list(feature_name, dataframe[feature_name].unique())
    new_feature_as_embedding   = feature_column.embedding_column(new_feature_as_categorical, naive_embedding_size)
    feature_columns.append(new_feature_as_embedding)
    
print('inspect everything')
demo(feature_columns)
# Warnings comes out because conda on macOS can have only tensorflow 2.0.0 an not 2.3.14

1
inspect everything
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 0. 0. ... 0. 0. 1.]
 [0. 0. 1. ... 1. 0. 0.]
 [0. 0. 1. ... 0. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 1. ... 1. 0. 0.]]


Now that feature columns are ready, I can now finally train a deep neural network model, albeit a very simple one. Let's also play with some hyperparameter tuning, so that I can chose a best model.

In [49]:
# number of nodes to try: integer values from 16 to 32. 
HP_NUM_NODES = hp.HParam('num_units', hp.Discrete([16, 32]))

hyperparameters = {'num_nodes': 128,
                   'dropout': 0.1,
                   'epochs': 5,
                   'optimizer' : 'ftrl',
                   'metrics' : 'accuracy'}

In [50]:
print(hyperparameters)

{'num_nodes': 128, 'dropout': 0.1, 'epochs': 5, 'optimizer': 'ftrl', 'metrics': 'accuracy'}


In [None]:
def train_and_test_model(feature_columns, train_ds, val_ds, hyperparameters):    
    # biuild input layer from feature columns
    input_layer = tensorflow.keras.layers.DenseFeatures(feature_columns)

    # build sequential model
    model = tensorflow.keras.Sequential([
      input_layer,
      layers.Dense(hyperparameters['num_nodes'], activation='relu'),
      layers.Dense(hyperparameters['num_nodes'], activation='relu'),
      layers.Dropout(hyperparameters['dropout']),
      layers.Dense(1)
    ])

    # compile model
    model.compile(optimizer = hyperparameters['optimizer'],
                  loss = tensorflow.keras.losses.BinaryCrossentropy(from_logits=True),
                  metrics = [hyperparameters['metrics']])

    # train model
    model.fit(train_ds,
              validation_data = val_ds,
              epochs = hyperparameters['epochs'])
    
    # evaluate model
    loss, accuracy = model.evaluate(val_ds)
    
    return model, loss, accuracy

model, loss, accuracy = train_and_test_model(feature_columns, train_ds, val_ds, hyperparameters)
print(loss)
print(accuracy)

Epoch 1/5
    213/Unknown - 6s 30ms/step - loss: 0.6821 - accuracy: 0.2918