# **Basic Feature Engineering in Keras**

**Learning Objectives**
1. Create an input pipeline using `tf.data`
2. Engineer features to create categorical, crossed, and numerical feature columns

## **Introduction**

In this lab, we utilise feature engineering to improve the prediction of housing prices using a Keras sequential model.

Start by importing the necessary libraries for this lab

In [1]:
import os
import tensorflow.keras

import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from keras.utils import plot_model

print("TensorFlow version", tf.__version__)

TensorFlow version 2.4.1


Many of the Google ML courses programming exercises use the [California Housing data set](https://developers.google.com/machine-learning/crash-course/california-housing-data-description), which contains data drawn from the 1990 U.S. Census.

Let's read in the data set and create a Pandas DataFrame.

In [2]:
# `head()` function is used to get the first n rows of the DataFrame
housing_df = pd.read_csv("data/housing.csv", error_bad_lines=False)
housing_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


We can use `.describe()` to see some summary statistics for the numeric fields in our dataframe. Note, for example, the count rown and corresponding columns. The count shows 20640.000000 for all feature clumns. Thus, there are no missing values.

In [3]:
# `describe()` is used to get the statistical summary of the DataFrame
housing_df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [28]:
housing_df.loc[housing_df["total_bedrooms"].isnull(),:]

NotImplementedError: iLocation based boolean indexing on an integer type is not available

## **Split the data set for ML**

The data set we loaded was a single CSV file. We will split this into train, validation and test sets.

In [4]:
# Let's split the data set into train, validation, and test sets
train, test = train_test_split(housing_df, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)

print(len(train), "train examples")
print(len(val), "validation examples")
print(len(test), "test examples")

13209 train examples
3303 validation examples
4128 test examples


Now, we need to output the split files. We will specifically need the `test.csv` later for testing.

In [5]:
train.to_csv("data/housing-train.csv", encoding="utf8", index=False)

In [6]:
val.to_csv("data/housing-val.csv", encoding="utf8", index=False)

In [7]:
test.to_csv("data/housing-test.csv", encoding="utf8", index=False)

In [8]:
!head data/housing*.csv

==> data/housing.csv <==
longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY

==> data/housing-test.csv <==
longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
-117.06,32.61,34.0,4325.0,1015.0,2609.0,979.0,2.8

## **Create an input pipeline using `tf.data`**

Next, we will wrap the DataFrames with `tf.data`. This will enable us to use feature columns as bridge to map from the columns in the Pandas DataFrame to features used to train the model.

In [9]:
# Here, we create an input pipeline using `tf.data`
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop("median_house_value")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

Next we initialise the training and validation Datasets.

In [10]:
batch_size = 32
train_ds = df_to_dataset(train)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)

Now that we have created the input pipeline, let's call it to see the format of the data it returns. We have used a small batch size to keep the output readable.

In [11]:
for feature_batch, label_batch in train_ds.take(1):
    print("Every feature:", list(feature_batch.keys()))
    print("A batch of households:", feature_batch["households"])
    print("A batch of ocean_proximity:", feature_batch["ocean_proximity"])
    print("A batch of targets:", label_batch)

Every feature: ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'ocean_proximity']
A batch of households: tf.Tensor(
[217.  40. 281. 243. 119. 431. 475.  42. 314. 409. 220. 254. 386. 660.
 494. 586. 315. 615. 322. 244. 348. 677. 302. 456. 230. 568.  42. 158.
 571. 790.  89. 365.], shape=(32,), dtype=float64)
A batch of ocean_proximity: tf.Tensor(
[b'INLAND' b'NEAR BAY' b'INLAND' b'INLAND' b'<1H OCEAN' b'NEAR OCEAN'
 b'INLAND' b'INLAND' b'NEAR OCEAN' b'<1H OCEAN' b'<1H OCEAN' b'INLAND'
 b'INLAND' b'NEAR BAY' b'<1H OCEAN' b'<1H OCEAN' b'<1H OCEAN' b'<1H OCEAN'
 b'INLAND' b'<1H OCEAN' b'NEAR BAY' b'<1H OCEAN' b'<1H OCEAN' b'NEAR BAY'
 b'<1H OCEAN' b'<1H OCEAN' b'NEAR BAY' b'INLAND' b'<1H OCEAN' b'<1H OCEAN'
 b'INLAND' b'INLAND'], shape=(32,), dtype=string)
A batch of targets: tf.Tensor(
[141000. 350000.  86100.  54300. 145500. 179800. 178500. 170000. 360300.
 372000. 133000.  68400.  64500. 297300. 141400. 186200

We can see that the Dataset returns a dictionary of column names (from the DataFrame) that map to column values from rows in the DataFrame.

### **Numeric columns**

The output of a `tf.feature_column` becomes the input to the model. A numeric is the simplest type of column. It is used to represent real valued features. When using this column, your model will receive the column value from the DataFrame unchanged.

In the California Housing Prices data set, most columns are numeric. Let's create a variable called `numeric_cols` to hold only the numerical feature columns.

In [12]:
# Let's create a variable called `numeric_cols` to hold only the numerical feature columns
numeric_cols = ["longitude", "latitude", "housing_median_age", "total_rooms", 
                "total_bedrooms", "population", "households", "median_income"]

### **Scaler function**

It is very important for numerical variables to get scaled before they are *fed* into the neural network. Here we use *min-max scaling*. Here we are creating a function named `get_scal` which takes a list of numerical features and returns a `minmax` function, which will be used in `tf.feature_column.numeric_column()` as the `normalizer_fn` parameter. `minmax` function itself takes a *numerical* number from a particular feature and returns scaled value of that number.

Next, we scale the numerical feature columns that we assigned to the variable `numeric_cols`.

In [13]:
# `get_scal` function takes a list of numerical features and returns a `minmax` function
# `minmax` function itself takes a `numerical` number from a particular feature and returns scaled value of that number
def get_scal(feature):
    def minmax(x):
        mini = train[feature].min()
        maxi = train[feature].max()
        return (x - mini)/(maxi - mini)
    return(minmax)

In [14]:
feature_columns = []
for header in numeric_cols:
    scal_input_fn = get_scal(header)
    feature_columns.append(tf.feature_column.numeric_column(header,
                                                            normalizer_fn=scal_input_fn))

In [15]:
feature_columns

[NumericColumn(key='longitude', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=<function get_scal.<locals>.minmax at 0x7fef37680ca0>),
 NumericColumn(key='latitude', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=<function get_scal.<locals>.minmax at 0x7fef37680dc0>),
 NumericColumn(key='housing_median_age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=<function get_scal.<locals>.minmax at 0x7fef37680e50>),
 NumericColumn(key='total_rooms', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=<function get_scal.<locals>.minmax at 0x7fef37680ee0>),
 NumericColumn(key='total_bedrooms', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=<function get_scal.<locals>.minmax at 0x7fef37680f70>),
 NumericColumn(key='population', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=<function get_scal.<locals>.minmax at 0x7fef3744a040>),
 NumericColumn(key='households', shape=(1,), default_value=None, dtype=t

Next, we should validate the total number of feature columns. Compare this number to the number of numeric features you input earlier

In [16]:
print("Total number of feature columns:", len(feature_columns))

Total number of feature columns: 8


### **Using the Keras sequential model**

Next, we will compile and fit a Keras sequential model.

In [17]:
# Model creation
# `tf.keras.layers.DenseFeatures()` is a layer that produces a dense Tensor based on given `feature_columns`
feature_layer = tf.keras.layers.DenseFeatures(feature_columns, dtype="float64")

# `tf.keras.Sequential()` groups a linear stack of layers into a `tf.keras` model
model = tf.keras.Sequential([
    feature_layer,
    layers.Dense(12, input_dim=8, activation="relu"),
    layers.Dense(8, activation="relu"),
    layers.Dense(1, activation="linear", name="median_house_value")
])

# Model compilation
model.compile(optimizer="adam",
              loss="mse",
              metrics=["mse"])

# Model fit
history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=32)

Epoch 1/32
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Epoch 2/32
Epoch 3/32
Epoch 4/32
Epoch 5/32
Epoch 6/32
Epoch 7/32
Epoch 8/32
Epoch 9/32
Epoch 10/32
Epoch 11/32
Epoch 12/32
Epoch 13/32
Epoch 14/32
Epoch 15/32
Epoch 16/32
Epoch 17/32
Epoch 18/32
Epoch 19/32
Epoch 20/32
 84/413 [=====>........................] - ETA: 0s - loss: nan - mse: nan

KeyboardInterrupt: 