##### Copyright 2019 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Classify structured data with feature columns

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/structured_data/feature_columns">
    <img src="https://www.tensorflow.org/images/tf_logo_32px.png" />
    View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/structured_data/feature_columns.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/structured_data/feature_columns.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/structured_data/feature_columns.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

Note: If you are starting a new project to classify structured data, we recommend you use [preprocessing layers](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers).

This tutorial demonstrates how to classify structured data (e.g. tabular data in a CSV). We will use [Keras](https://www.tensorflow.org/guide/keras) to define the model, and [feature columns](https://www.tensorflow.org/guide/feature_columns) as a bridge to map from columns in a CSV to features used to train the model. This tutorial contains complete code to:

* Load a CSV file using [Pandas](https://pandas.pydata.org/).
* Build an input pipeline to batch and shuffle the rows using [tf.data](https://www.tensorflow.org/guide/datasets).
* Map from columns in the CSV to features used to train the model using feature columns.
* Build, train, and evaluate a model using Keras.

## The Dataset

We will use a simplified version of the PetFinder [dataset](https://www.kaggle.com/c/petfinder-adoption-prediction). There are several thousand rows in the CSV. Each row describes a pet, and each column describes an attribute. We will use this information to predict the speed at which the pet will be adopted.

Following is a description of this dataset. Notice there are both numeric and categorical columns. There is a free text column which we will not use in this tutorial.

Column | Description| Feature Type | Data Type
------------|--------------------|----------------------|-----------------
Type | Type of animal (Dog, Cat) | Categorical | string
Age |  Age of the pet | Numerical | integer
Breed1 | Primary breed of the pet | Categorical | string
Color1 | Color 1 of pet | Categorical | string
Color2 | Color 2 of pet | Categorical | string
MaturitySize | Size at maturity | Categorical | string
FurLength | Fur length | Categorical | string
Vaccinated | Pet has been vaccinated | Categorical | string
Sterilized | Pet has been sterilized | Categorical | string
Health | Health Condition | Categorical | string
Fee | Adoption Fee | Numerical | integer
Description | Profile write-up for this pet | Text | string
PhotoAmt | Total uploaded photos for this pet | Numerical | integer
AdoptionSpeed | Speed of adoption | Classification | integer

## Import TensorFlow and other libraries

In [2]:
!pip install sklearn



In [3]:
import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

## Use Pandas to create a dataframe

[Pandas](https://pandas.pydata.org/) is a Python library with many helpful utilities for loading and working with structured data. We will use Pandas to download the dataset from a URL, and load it into a dataframe.

About dataset:

https://www.kaggle.com/c/petfinder-adoption-prediction

In [4]:
import pathlib

dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'

tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')
dataframe = pd.read_csv(csv_file)

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip


AdoptionSpeed

Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way:

0 - Pet was adopted on the same day as it was listed.

1 - Pet was adopted between 1 and 7 days (1st week) after being listed.

2 - Pet was adopted between 8 and 30 days (1st month) after being listed.

3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.

4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

In [5]:
dataframe.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,Description,PhotoAmt,AdoptionSpeed
0,Cat,3,Tabby,Male,Black,White,Small,Short,No,No,Healthy,100,Nibble is a 3+ month old ball of cuteness. He ...,1,2
1,Cat,1,Domestic Medium Hair,Male,Black,Brown,Medium,Medium,Not Sure,Not Sure,Healthy,0,I just found it alone yesterday near my apartm...,2,0
2,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,Yes,No,Healthy,0,Their pregnant mother was dumped by her irresp...,7,3
3,Dog,4,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,No,Healthy,150,"Good guard dog, very alert, active, obedience ...",8,2
4,Dog,1,Mixed Breed,Male,Black,No Color,Medium,Short,No,No,Healthy,0,This handsome yet cute boy is up for adoption....,3,2


In [6]:
dataframe.shape

(11537, 15)

In [7]:
dataframe.describe()

Unnamed: 0,Age,Fee,PhotoAmt,AdoptionSpeed
count,11537.0,11537.0,11537.0,11537.0
mean,11.743434,23.957268,3.610211,2.486522
std,19.324221,80.024226,3.145872,1.173275
min,0.0,0.0,0.0,0.0
25%,2.0,0.0,2.0,2.0
50%,4.0,0.0,3.0,2.0
75%,12.0,0.0,5.0,4.0
max,255.0,2000.0,30.0,4.0


## Create target variable

The task in the original dataset is to predict the speed at which a pet will be adopted (e.g., in the first week, the first month, the first three months, and so on). Let's simplify this for our tutorial. Here, we will transform this into a binary classification problem, and simply predict whether the pet was adopted, or not.

After modifying the label column, 0 will indicate the pet was not adopted, and 1 will indicate it was.

In [8]:
# In the original dataset "4" indicates the pet was not adopted.
dataframe['target'] = np.where(dataframe['AdoptionSpeed']==4, 0, 1)

# Drop un-used columns.
dataframe = dataframe.drop(columns=['AdoptionSpeed', 'Description'])

In [9]:
dataframe.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt,target
0,Cat,3,Tabby,Male,Black,White,Small,Short,No,No,Healthy,100,1,1
1,Cat,1,Domestic Medium Hair,Male,Black,Brown,Medium,Medium,Not Sure,Not Sure,Healthy,0,2,1
2,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,Yes,No,Healthy,0,7,1
3,Dog,4,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,No,Healthy,150,8,1
4,Dog,1,Mixed Breed,Male,Black,No Color,Medium,Short,No,No,Healthy,0,3,1


## Split the dataframe into train, validation, and test

The dataset we downloaded was a single CSV file. We will split this into train, validation, and test sets.

In [10]:
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

7383 train examples
1846 validation examples
2308 test examples


## Create an input pipeline using tf.data

Next, we will wrap the dataframes with [tf.data](https://www.tensorflow.org/guide/datasets). This will enable us  to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model. If we were working with a very large CSV file (so large that it does not fit into memory), we would use tf.data to read it from disk directly. That is not covered in this tutorial.

In [11]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

In [12]:
dict(train)

{'Age': 10723     9
 6297     12
 6675     29
 1258      4
 10199     8
          ..
 2058     10
 1884      1
 5608      4
 7046      1
 6258     12
 Name: Age, Length: 7383, dtype: int64, 'Breed1': 10723    Domestic Short Hair
 6297             Mixed Breed
 6675     Domestic Short Hair
 1258     Domestic Short Hair
 10199                Burmese
                 ...         
 2058     Domestic Short Hair
 1884             Mixed Breed
 5608             Mixed Breed
 7046             Mixed Breed
 6258     Domestic Short Hair
 Name: Breed1, Length: 7383, dtype: object, 'Color1': 10723    Cream
 6297     Black
 6675     Brown
 1258     Brown
 10199    Black
          ...  
 2058     Black
 1884     Brown
 5608     Brown
 7046     Brown
 6258     Black
 Name: Color1, Length: 7383, dtype: object, 'Color2': 10723    No Color
 6297        Brown
 6675        White
 1258        Cream
 10199      Yellow
            ...   
 2058        Brown
 1884        Cream
 5608     No Color
 7046        White

In [13]:
batch_size = 5 # A small batch sized is used for demonstration purposes
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

## Understand the input pipeline

Now that we have created the input pipeline, let's call it to see the format of the data it returns. We have used a small batch size to keep the output readable.

In [14]:
for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of ages:', feature_batch['Age'])
  print('A batch of targets:', label_batch )

Every feature: ['Type', 'Age', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize', 'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Fee', 'PhotoAmt']
A batch of ages: tf.Tensor([ 0  3  1  2 21], shape=(5,), dtype=int64)
A batch of targets: tf.Tensor([1 1 1 1 0], shape=(5,), dtype=int64)


We can see that the dataset returns a dictionary of column names (from the dataframe) that map to column values from rows in the dataframe.

## Demonstrate several types of feature columns
TensorFlow provides many types of feature columns. In this section, we will create several types of feature columns, and demonstrate how they transform a column from the dataframe.

In [17]:
next(iter(train_ds)) # include x train and y train in one sample only

({'Age': <tf.Tensor: shape=(5,), dtype=int64, numpy=array([24,  3, 12, 22,  2])>,
  'Breed1': <tf.Tensor: shape=(5,), dtype=string, numpy=
  array([b'Mixed Breed', b'Mixed Breed', b'Mixed Breed',
         b'Golden Retriever', b'Mixed Breed'], dtype=object)>,
  'Color1': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Cream', b'Brown', b'Brown', b'Golden', b'Black'], dtype=object)>,
  'Color2': <tf.Tensor: shape=(5,), dtype=string, numpy=
  array([b'No Color', b'No Color', b'White', b'No Color', b'Brown'],
        dtype=object)>,
  'Fee': <tf.Tensor: shape=(5,), dtype=int64, numpy=array([0, 0, 0, 0, 0])>,
  'FurLength': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Medium', b'Short', b'Short', b'Long', b'Medium'], dtype=object)>,
  'Gender': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Male', b'Male', b'Female', b'Male', b'Female'], dtype=object)>,
  'Health': <tf.Tensor: shape=(5,), dtype=string, numpy=
  array([b'Healthy', b'Healthy', b'Healthy', b'Healthy', b'Heal

In [15]:
# We will use this batch to demonstrate several types of feature columns
example_batch = next(iter(train_ds))[0]

In [16]:
example_batch # contain only x train, not include y train

{'Age': <tf.Tensor: shape=(5,), dtype=int64, numpy=array([36, 84,  5,  1, 24])>,
 'Breed1': <tf.Tensor: shape=(5,), dtype=string, numpy=
 array([b'German Shepherd Dog', b'Mixed Breed', b'Domestic Medium Hair',
        b'Mixed Breed', b'Mixed Breed'], dtype=object)>,
 'Color1': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Black', b'Brown', b'Black', b'Black', b'Yellow'], dtype=object)>,
 'Color2': <tf.Tensor: shape=(5,), dtype=string, numpy=
 array([b'No Color', b'No Color', b'No Color', b'Brown', b'No Color'],
       dtype=object)>,
 'Fee': <tf.Tensor: shape=(5,), dtype=int64, numpy=array([  0,   0, 150,  50,   0])>,
 'FurLength': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Medium', b'Short', b'Medium', b'Short', b'Long'], dtype=object)>,
 'Gender': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Female', b'Female', b'Male', b'Female', b'Female'], dtype=object)>,
 'Health': <tf.Tensor: shape=(5,), dtype=string, numpy=
 array([b'Healthy', b'Healthy', b'Healthy', b'

In [18]:
# A utility method to create a feature column
# and to transform a batch of data
def demo(feature_column):
  feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(example_batch).numpy())

### Numeric columns
The output of a feature column becomes the input to the model (using the demo function defined above, we will be able to see exactly how each column from the dataframe is transformed). A [numeric column](https://www.tensorflow.org/api_docs/python/tf/feature_column/numeric_column) is the simplest type of column. It is used to represent real valued features. When using this column, your model will receive the column value from the dataframe unchanged.

In [21]:
example_batch['PhotoAmt']

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([ 3, 16, 20,  9,  2])>

In [19]:
photo_count = feature_column.numeric_column('PhotoAmt') # feature column from tensorflow
photo_count

NumericColumn(key='PhotoAmt', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)

In [20]:
demo(photo_count) # show one sample from feature column

[[ 3.]
 [16.]
 [20.]
 [ 9.]
 [ 2.]]


In the PetFinder dataset, most columns from the dataframe are categorical.

### Bucketized columns
Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. Consider raw data that represents a person's age. Instead of representing age as a numeric column, we could split the age into several buckets using a [bucketized column](https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column). Notice the one-hot values below describe which age range each row matches.

In [22]:
example_batch['Age']

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([36, 84,  5,  1, 24])>

In [23]:
age = feature_column.numeric_column('Age')
age_buckets = feature_column.bucketized_column(age, boundaries=[1, 3, 5])
age_buckets

BucketizedColumn(source_column=NumericColumn(key='Age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(1, 3, 5))

In [24]:
demo(age_buckets)

[[0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]]


In [25]:
age_buckets = feature_column.bucketized_column(age, boundaries=[10, 50, 100])
demo(age_buckets)

[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]]


### Categorical columns
In this dataset, Type is represented as a string (e.g. 'Dog', or 'Cat'). We cannot feed strings directly to a model. Instead, we must first map them to numeric values. The categorical vocabulary columns provide a way to represent strings as a one-hot vector (much like you have seen above with age buckets). The vocabulary can be passed as a list using [categorical_column_with_vocabulary_list](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list), or loaded from a file using [categorical_column_with_vocabulary_file](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_file).

In [36]:
example_batch['Type']

<tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Dog', b'Dog', b'Cat', b'Dog', b'Dog'], dtype=object)>

In [37]:
animal_type = feature_column.categorical_column_with_vocabulary_list(
      'Type', ['Cat', 'Dog'])

animal_type

VocabularyListCategoricalColumn(key='Type', vocabulary_list=('Cat', 'Dog'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

In [38]:
animal_type_one_hot = feature_column.indicator_column(animal_type)
demo(animal_type_one_hot)

[[0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]]


### Embedding columns
Suppose instead of having just a few possible strings, we have thousands (or more) values per category. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using one-hot encodings. We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an [embedding column](https://www.tensorflow.org/api_docs/python/tf/feature_column/embedding_column) represents that data as a lower-dimensional, dense vector in which each cell can contain any number, not just 0 or 1. The size of the embedding (8, in the example below) is a parameter that must be tuned.

Key point: using an embedding column is best when a categorical column has many possible values. We are using one here for demonstration purposes, so you have a complete example you can modify for a different dataset in the future.

In [29]:
example_batch['Breed1']

<tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'German Shepherd Dog', b'Mixed Breed', b'Domestic Medium Hair',
       b'Mixed Breed', b'Mixed Breed'], dtype=object)>

In [28]:
# Notice the input to the embedding column is the categorical column
# we previously created
breed1 = feature_column.categorical_column_with_vocabulary_list(
      'Breed1', dataframe.Breed1.unique())
breed1_embedding = feature_column.embedding_column(breed1, dimension=8)
demo(breed1_embedding)

[[ 0.31189206 -0.33967862 -0.6466614  -0.33414206  0.45192662 -0.21522976
  -0.27996817  0.6959859 ]
 [ 0.32074252  0.10269089  0.3971076   0.11309242 -0.40013707 -0.11581437
  -0.06517407  0.24499024]
 [-0.3786064   0.7015086  -0.34524834 -0.31299487  0.09972376 -0.38989344
  -0.22461942 -0.06823309]
 [ 0.32074252  0.10269089  0.3971076   0.11309242 -0.40013707 -0.11581437
  -0.06517407  0.24499024]
 [ 0.32074252  0.10269089  0.3971076   0.11309242 -0.40013707 -0.11581437
  -0.06517407  0.24499024]]


### Hashed feature columns

Another way to represent a categorical column with a large number of values is to use a [categorical_column_with_hash_bucket](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_hash_bucket). This feature column calculates a hash value of the input, then selects one of the `hash_bucket_size` buckets to encode a string. When using this column, you do not need to provide the vocabulary, and you can choose to make the number of hash_buckets significantly smaller than the number of actual categories to save space.

Key point: An important downside of this technique is that there may be collisions in which different strings are mapped to the same bucket. In practice, this can work well for some datasets regardless.

In [30]:
breed1_hashed = feature_column.categorical_column_with_hash_bucket(
      'Breed1', hash_bucket_size=10)
demo(feature_column.indicator_column(breed1_hashed))

[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]


### Crossed feature columns
Combining features into a single feature, better known as [feature crosses](https://developers.google.com/machine-learning/glossary/#feature_cross), enables a model to learn separate weights for each combination of features. Here, we will create a new feature that is the cross of Age and Type. Note that `crossed_column` does not build the full table of all possible combinations (which could be very large). Instead, it is backed by a `hashed_column`, so you can choose how large the table is.

In [31]:
crossed_feature = feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=10)
demo(feature_column.indicator_column(crossed_feature))

[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]


## Choose which columns to use
We have seen how to use several types of feature columns. Now we will use them to train a model. The goal of this tutorial is to show you the complete code (e.g. mechanics) needed to work with feature columns. We have selected a few columns to train our model below arbitrarily.

Key point: If your aim is to build an accurate model, try a larger dataset of your own, and think carefully about which features are the most meaningful to include, and how they should be represented.

In [61]:
feature_columns = []

# numeric cols
for header in ['PhotoAmt', 'Fee', 'Age']:
  feature_columns.append(feature_column.numeric_column(header))

In [62]:
# bucketized cols
age = feature_column.numeric_column('Age')
age_buckets = feature_column.bucketized_column(age, boundaries=[1, 2, 3, 4, 5])
feature_columns.append(age_buckets)

In [63]:
# indicator_columns
indicator_column_names = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',
                          'FurLength', 'Vaccinated', 'Sterilized', 'Health']
for col_name in indicator_column_names:
  categorical_column = feature_column.categorical_column_with_vocabulary_list( col_name, dataframe[col_name].unique()) # One hot encoder
  indicator_column = feature_column.indicator_column(categorical_column) #
  feature_columns.append(indicator_column)

In [None]:
# embedding columns
breed1 = feature_column.categorical_column_with_vocabulary_list(
      'Breed1', dataframe.Breed1.unique())
breed1_embedding = feature_column.embedding_column(breed1, dimension=8)
feature_columns.append(breed1_embedding)

In [None]:
# crossed columns
age_type_feature = feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=100)
feature_columns.append(feature_column.indicator_column(age_type_feature))

### Explain

Explain indicator_columns

In [42]:
# indicator_columns
indicator_column_names = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',
                          'FurLength', 'Vaccinated', 'Sterilized', 'Health']
for col_name in indicator_column_names:
  print('-'*5)
  print(col_name)
  print(dataframe[col_name].unique())
  categorical_column = feature_column.categorical_column_with_vocabulary_list( col_name, dataframe[col_name].unique()) # One hot encoder
  print(categorical_column)
  indicator_column = feature_column.indicator_column(categorical_column) 
  print(indicator_column)
  print(demo(indicator_column))
  #feature_columns.append(indicator_column)

-----
Type
['Cat' 'Dog']
VocabularyListCategoricalColumn(key='Type', vocabulary_list=('Cat', 'Dog'), dtype=tf.string, default_value=-1, num_oov_buckets=0)
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='Type', vocabulary_list=('Cat', 'Dog'), dtype=tf.string, default_value=-1, num_oov_buckets=0))
[[0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]]
None
-----
Color1
['Black' 'Brown' 'Cream' 'Gray' 'Golden' 'White' 'Yellow']
VocabularyListCategoricalColumn(key='Color1', vocabulary_list=('Black', 'Brown', 'Cream', 'Gray', 'Golden', 'White', 'Yellow'), dtype=tf.string, default_value=-1, num_oov_buckets=0)
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='Color1', vocabulary_list=('Black', 'Brown', 'Cream', 'Gray', 'Golden', 'White', 'Yellow'), dtype=tf.string, default_value=-1, num_oov_buckets=0))
[[1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1.]]
None
-----
Color2
['White' 'Brown' 

Explain embedding columns

In [43]:
dataframe.Breed1.unique()

array(['Tabby', 'Domestic Medium Hair', 'Mixed Breed',
       'Domestic Short Hair', 'Domestic Long Hair', 'Terrier', 'Persian',
       'Rottweiler', 'Jack Russell Terrier', 'Shih Tzu',
       'Labrador Retriever', 'Silky Terrier', 'Bombay', 'Husky',
       'Schnauzer', 'Golden Retriever', 'Siberian Husky', 'Collie',
       'German Shepherd Dog', 'Siamese', 'Calico',
       'American Staffordshire Terrier', 'Turkish Van',
       'Doberman Pinscher', 'Oriental Short Hair', 'Beagle', 'Ragdoll',
       'Cocker Spaniel', 'Poodle', 'Black Labrador Retriever', 'Bengal',
       'Shar Pei', 'Spitz', 'Birman', 'Belgian Shepherd Malinois',
       'American Shorthair', 'Belgian Shepherd Laekenois', '0',
       'Jack Russell Terrier (Parson Russell Terrier)', 'Shepherd',
       'Corgi', 'Pit Bull Terrier', 'Oriental Tabby',
       'Miniature Pinscher', 'Manx', 'Boxer', 'Dachshund', 'Chihuahua',
       'Snowshoe', 'Rat Terrier', 'Tiger', 'Silver', 'Maine Coon',
       'German Pinscher', 'Russian Bl

In [65]:
# embedding columns
breed1 = feature_column.categorical_column_with_vocabulary_list(
      'Breed1', dataframe.Breed1.unique())
breed1_embedding = feature_column.embedding_column(breed1, dimension=8)
breed1_embedding

EmbeddingColumn(categorical_column=VocabularyListCategoricalColumn(key='Breed1', vocabulary_list=('Tabby', 'Domestic Medium Hair', 'Mixed Breed', 'Domestic Short Hair', 'Domestic Long Hair', 'Terrier', 'Persian', 'Rottweiler', 'Jack Russell Terrier', 'Shih Tzu', 'Labrador Retriever', 'Silky Terrier', 'Bombay', 'Husky', 'Schnauzer', 'Golden Retriever', 'Siberian Husky', 'Collie', 'German Shepherd Dog', 'Siamese', 'Calico', 'American Staffordshire Terrier', 'Turkish Van', 'Doberman Pinscher', 'Oriental Short Hair', 'Beagle', 'Ragdoll', 'Cocker Spaniel', 'Poodle', 'Black Labrador Retriever', 'Bengal', 'Shar Pei', 'Spitz', 'Birman', 'Belgian Shepherd Malinois', 'American Shorthair', 'Belgian Shepherd Laekenois', '0', 'Jack Russell Terrier (Parson Russell Terrier)', 'Shepherd', 'Corgi', 'Pit Bull Terrier', 'Oriental Tabby', 'Miniature Pinscher', 'Manx', 'Boxer', 'Dachshund', 'Chihuahua', 'Snowshoe', 'Rat Terrier', 'Tiger', 'Silver', 'Maine Coon', 'German Pinscher', 'Russian Blue', 'Tuxedo',

In [47]:
demo(breed1_embedding)

[[-0.2212519   0.42945105 -0.07446266  0.34322447 -0.01476592 -0.48085988
   0.03153189 -0.26152351]
 [-0.3640167  -0.23997332  0.4086902   0.06860685  0.41826752  0.11606763
  -0.6059803   0.345743  ]
 [ 0.42649218  0.09538587  0.67222625 -0.12674467  0.47199896 -0.28783375
   0.02352233 -0.06842715]
 [-0.3640167  -0.23997332  0.4086902   0.06860685  0.41826752  0.11606763
  -0.6059803   0.345743  ]
 [-0.3640167  -0.23997332  0.4086902   0.06860685  0.41826752  0.11606763
  -0.6059803   0.345743  ]]


Explain crossed columns

In [51]:
age_buckets

BucketizedColumn(source_column=NumericColumn(key='Age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(1, 2, 3, 4, 5))

In [52]:
animal_type

VocabularyListCategoricalColumn(key='Type', vocabulary_list=('Cat', 'Dog'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

In [53]:
# crossed columns
age_type_feature = feature_column.crossed_column([age_buckets, animal_type], hash_bucket_size=100)
age_type_feature

CrossedColumn(keys=(BucketizedColumn(source_column=NumericColumn(key='Age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(1, 2, 3, 4, 5)), VocabularyListCategoricalColumn(key='Type', vocabulary_list=('Cat', 'Dog'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), hash_bucket_size=100, hash_key=None)

In [54]:
feature_column.indicator_column(age_type_feature)

IndicatorColumn(categorical_column=CrossedColumn(keys=(BucketizedColumn(source_column=NumericColumn(key='Age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(1, 2, 3, 4, 5)), VocabularyListCategoricalColumn(key='Type', vocabulary_list=('Cat', 'Dog'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), hash_bucket_size=100, hash_key=None))

In [55]:
demo(feature_column.indicator_column(age_type_feature))

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

### End of explain

### Create a feature layer
Now that we have defined our feature columns, we will use a [DenseFeatures](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/DenseFeatures) layer to input them to our Keras model.

In [67]:
feature_columns

[NumericColumn(key='PhotoAmt', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='Fee', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='Age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 BucketizedColumn(source_column=NumericColumn(key='Age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(1, 2, 3, 4, 5)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='Type', vocabulary_list=('Cat', 'Dog'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='Color1', vocabulary_list=('Black', 'Brown', 'Cream', 'Gray', 'Golden', 'White', 'Yellow'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='Color2', vocabulary_list=('White', 'Brown', 'No Color', 'Gray', 'Cream', 'Golden', 'Ye

In [68]:
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

In [69]:
feature_layer

<tensorflow.python.keras.feature_column.dense_features_v2.DenseFeatures at 0x7f6db5bd2940>

Earlier, we used a small batch size to demonstrate how feature columns worked. We create a new input pipeline with a larger batch size.

In [71]:
print(type(train))
train

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt,target
10723,Cat,9,Domestic Short Hair,Female,Cream,No Color,Medium,Short,No,No,Healthy,0,2,1
6297,Dog,12,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,Not Sure,Healthy,0,2,0
6675,Cat,29,Domestic Short Hair,Female,Brown,White,Medium,Short,Not Sure,No,Healthy,0,3,0
1258,Cat,4,Domestic Short Hair,Female,Brown,Cream,Medium,Short,No,No,Healthy,0,2,1
10199,Cat,8,Burmese,Female,Black,Yellow,Medium,Medium,Yes,Yes,Healthy,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2058,Cat,10,Domestic Short Hair,Female,Black,Brown,Medium,Short,Yes,Yes,Healthy,0,2,1
1884,Dog,1,Mixed Breed,Female,Brown,Cream,Medium,Short,No,No,Healthy,0,3,1
5608,Dog,4,Mixed Breed,Female,Brown,No Color,Medium,Short,Yes,Yes,Healthy,0,5,0
7046,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,No,No,Healthy,0,6,1


In [72]:
batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

## Create, compile, and train the model

In [74]:
model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dropout(.1),
  layers.Dense(1)
])

In [75]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [76]:
model.fit(train_ds,
          validation_data=val_ds,
          epochs=10)

Epoch 1/10
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f6db5be3a58>

In [77]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

Accuracy 0.7023397088050842


In [78]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_features_34 (DenseFeat multiple                  1328      
_________________________________________________________________
dense_3 (Dense)              multiple                  19328     
_________________________________________________________________
dense_4 (Dense)              multiple                  16512     
_________________________________________________________________
dropout_1 (Dropout)          multiple                  0         
_________________________________________________________________
dense_5 (Dense)              multiple                  129       
Total params: 37,297
Trainable params: 37,297
Non-trainable params: 0
_________________________________________________________________


Key point: You will typically see best results with deep learning with much larger and more complex datasets. When working with a small dataset like this one, we recommend using a decision tree or random forest as a strong baseline. The goal of this tutorial is not to train an accurate model, but to demonstrate the mechanics of working with structured data, so you have code to use as a starting point when working with your own datasets in the future.

## Next steps
The best way to learn more about classifying structured data is to try it yourself. We suggest finding another dataset to work with, and training a model to classify it using code similar to the above. To improve accuracy, think carefully about which features to include in your model, and how they should be represented.