# House Price Prediction With TensorFlow

[![open_in_colab][colab_badge]][colab_notebook_link]
<!-- [![open_in_binder][binder_badge]][binder_notebook_link] -->

[colab_badge]: https://colab.research.google.com/assets/colab-badge.svg
[colab_notebook_link]: https://colab.research.google.com/github/foursquare/fsq-studio-sdk-examples/blob/master/python-notebooks/08%20-%20Tensorflow_prediction.ipynb
<!-- [binder_badge]: https://mybinder.org/badge_logo.svg
[binder_notebook_link]: https://mybinder.org/v2/gh/foursquare/fsq-studio-sdk-examples/master?urlpath=lab/tree/python-notebooks/08%20-%20Tensorflow_prediction.ipynb -->

This example demonstrates how the Studio Map SDK allows for more engaging exploratory data visualization, helping to simplify the process of building a machine learning model for predicting median house prices in California.

## Dependencies 

This notebook uses the following dependencies:

- pandas
- numpy
- scikit-learn
- scipy
- seaborn
- matplotlib
- tensorflow

If running this notebook in Binder, these dependencies should already be installed. If running in Colab, the next cell will install these dependencies. In another environment, you'll need to make sure these dependencies are available by running the following `pip` command in a shell.

```bash
pip install pandas numpy scikit-learn scipy seaborn matplotlib tensorflow
```

This notebook was originally tested with the following package versions, but likely works with a broad range of versions:

- pandas==1.3.2
- numpy==1.19.5
- scikit-learn==0.24.2
- scipy==1.7.1
- seaborn==0.11.2
- matplotlib==3.4.3
- tensorflow==2.6.0

In [None]:
# If in Colab, install this notebook's required dependencies
import sys
if "google.colab" in sys.modules:
    !pip install 'unfolded.map_sdk>=1.0' pandas numpy scikit-learn scipy seaborn matplotlib tensorflow

## Imports

If you're running this notebook on Binder, you may see a notification like the following when running the next cell.
```
Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
Ignore above cudart dlerror if you do not have a GPU set up on your machine.
```
This is expected behavior because the machines on which Binder is running are not equipped with GPUs. The notebook will still function fine, it will just run slightly slower than on a machine with a GPU available.

In [None]:
from uuid import uuid4

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

from unfolded.map_sdk import create_map

## Data Loading

For this example we'll use data from Kaggle's [California Housing Prices](https://www.kaggle.com/camnugent/california-housing-prices) dataset under the CC0 license. This dataset contains information about the housing in each census area in California, as of the 1990 census.

In [None]:
dataset_url = "https://4sq-studio-public.s3.us-west-2.amazonaws.com/sdk/examples/sample-data/housing.csv"
housing = pd.read_csv(dataset_url)
housing.head()

## Feature Engineering

First, let's take a look at the input data and try to visualize different aspects of them in a map.

### Population Clustering

In the next cell we'll create a map that clusters rows of the dataset according to population. Note that since the clustering happens within Foursquare Studio, the clusters are re-computed as you zoom in, allowing you to explore your data at various resolutions.

In [None]:
population_in_CA = create_map()
population_in_CA

In [None]:
# Create a persistent dataset ID that we can reference in both add_dataset and add_layer
dataset_id = str(uuid4())

population_in_CA.add_dataset(
    id=dataset_id,
    label="Population_in_CA",
    data=housing,
    auto_create_layers=False,
)

population_in_CA.add_layer(
    {
        "id": "population_CA",
        "type": "cluster",
        "label": "population in CA",
        "data_id": dataset_id,
        "fields": {"lat": "latitude", "lng": "longitude"},
        "config": {
            "visual_channels": {
                "colorScale": "quantize",
                "colorField": {"name": "population", "type": "real"},
            }
        },
    }
)

population_in_CA.set_view(longitude=-119.417931, latitude=36.778259, zoom=5)

### Distances from housing areas to largest cities

Next, we want to explore where the housing areas in our dataset are located in comparison to the largest cities in California. For example purposes, we'll take the five largest cities in California and compare our input data against these locations.

In [None]:
# Longitude-latitude pairs for large cities
cities = {
    "Los Angeles": (-118.244, 34.052),
    "San Diego": (-117.165, 32.716),
    "San Jose": (-121.895, 37.339),
    "San Francisco": (-122.419, 37.775),
    "Fresno": (-119.772, 36.748),
}

Next we need to find the closest city for each row in our data sample. First we'll define a couple functions to help compute the distance between cities and the city closest to a specific point. Then we'll apply these functions on our data.

In [None]:
def distance(lng1, lat1, lng2, lat2):
    """Vectorized Haversine formula

    Computes distances between two sets of points.

    From: https://stackoverflow.com/a/51722117
    """
    # approximate radius of earth in km
    R = 6371.009

    lat1 = lat1*np.pi/180.0
    lng1 = np.deg2rad(lng1)
    lat2 = np.deg2rad(lat2)
    lng2 = np.deg2rad(lng2)

    d = np.sin((lat2 - lat1)/2)**2 + np.cos(lat1)*np.cos(lat2) * np.sin((lng2 - lng1)/2)**2

    return 2 * R * np.arcsin(np.sqrt(d))

In [None]:
def closest_city(lng_array, lat_array, cities):
    """Find the closest_city for each row in lng_array and lat_array input
    """
    distances = []

    # Compute distance from each row of arrays to each of our city inputs
    for city_name, coord in cities.items():
        distances.append(distance(lng_array, lat_array, *coord))

    # Convert this list of numpy arrays into a 2D numpy array
    distances = np.array(distances)

    # Find the shortest distance value for each row
    shortest_distances = np.amin(distances, axis=0)

    # Find the _index_ of the shortest distance for each row. Then use this value to
    # lookup the longitude-latitude pair of the closest city
    city_index = np.argmin(distances, axis=0)

    # Create a 2D numpy array of location coordinates
    # Then use the indexes from above to perform a lookup against the order of cities as
    # input. (Note: this relies on the fact that in Python 3.6+ dictionaries are
    # ordered)
    input_coords = np.array(list(cities.values()))
    closest_city_coords = input_coords[city_index]

    # Return a 2D array with three columns:
    # - Distance to closest city
    # - Longitude of closest city
    # - Latitude of closest city
    return np.hstack((shortest_distances[:, np.newaxis], closest_city_coords))

Then use the `closest_city` function on our data to create three new columns:

In [None]:
housing[['closest_city_dist', 'closest_city_lng', 'closest_city_lat']] = closest_city(
    housing['longitude'], housing['latitude'], cities
)

The map created in the next cell uses the new columns we computed above in relation to the largest cities in California:

In [None]:
distance_to_big_cities = create_map()
distance_to_big_cities

In [None]:
dist_data_id = str(uuid4())

distance_to_big_cities.add_dataset(
    id=dist_data_id,
    label="Distance to closest big city",
    data=housing,
    auto_create_layers=False,
)

distance_to_big_cities.add_layer(
    {
        "id": "closest_distance",
        "type": "arc",
        "data_id": dist_data_id,
        "label": "distance to closest big city",
        "fields": {
            "lng0": "longitude",
            "lat0": "latitude",
            "lng1": "closest_city_lng",
            "lat1": "closest_city_lat",
        },
        "config": {
            "vis_config": {"opacity": 0.8, "thickness": 0.3},
        },
    }
)

distance_to_big_cities.set_view(longitude=-119.417931, latitude=36.778259, zoom=4.5)

## Data Preprocessing

In this next section, we want to prepare our dataset to be used for training a TensorFlow model. First, we'll drop rows with null values, since they're quite rare in the dataset.

In [None]:
pct_null_rows = housing.isnull().any(axis=1).sum() / len(housing) * 100
print(f'{pct_null_rows:.1f}% of rows have null values')

housing = housing.dropna()

In the model we're training, we want to predict the median house value of an area. Thus we split the columns from our dataset `housing` into a dataset `y` with the column `median_house_value` and a dataset `X` with all other columns.

In [None]:
predicted_column = ['median_house_value']
other_columns = housing.columns.difference(predicted_column)

X = housing.loc[:, other_columns]
y = housing.loc[:, predicted_column]

Most of the columns in `X` are numeric, but one is not. `ocean_proximity` is of type `object`, which here is a string.

In [None]:
X.dtypes

Looking closer, we see that `ocean_proximity` is a categorical string with only five values.

In [None]:
X['ocean_proximity'].value_counts()

In order to use this column in our numeric model, we call [`pandas.get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) to create five new boolean columns. Each of these columns contains a `1` if the value of `ocean_proximity` is equal to the value that's now the column name.

In [None]:
X = pd.get_dummies(
    data=X, columns=["ocean_proximity"], prefix=["ocean_proximity"], drop_first=True
)

## Data Splitting

In line with standard machine learning practice, we split our dataset into training, validation and test sets. We first take out 20% of our full dataset to use for testing the model after training. Then of the remaining 80%, we take out 75% to use for training the model and 25% to use for validation.

In [None]:
# dividing training data into test, validation and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=1
)

# We save a copy of our test data to use after model prediction
start_values = X_test.copy(deep=True)

## Feature Scaling

We use standard scaling with mean and standard deviation from our training dataset to avoid data leakage.

In [None]:
# feature standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

## Price Prediction Model

Next we specify the parameters for the TensorFlow model:

In [None]:
# We use a Sequential model from Keras
# https://keras.io/api/models/sequential/
model = Sequential()

# Each column from X is an input feature into our model.
number_of_features = len(X.columns)

# input Layer
model.add(Dense(number_of_features, activation="relu", input_dim=number_of_features))

# hidden Layer
model.add(Dense(512, activation="relu"))
model.add(Dense(512, activation="relu"))
model.add(Dense(256, activation="relu"))
model.add(Dense(128, activation="relu"))
model.add(Dense(64, activation="relu"))
model.add(Dense(32, activation="relu"))

# output Layer
model.add(Dense(1, activation="linear"))

In [None]:
model.compile(loss="mse", optimizer="adam", metrics=["mse", "mae"])
model.summary()

### Training

Next we begin model training. Model training can take a long time; the higher the number of epochs, the better the model will be fit, but the longer training will take. Here we default to only 10 epochs because the focus of this notebook is integration with Foursquare Studio, not the machine learning itself.

In [None]:
EPOCHS = 10
# Or uncomment the following line if you're happy to wait longer for a better model fit.
# EPOCHS = 70

In [None]:
history = model.fit(
    X_train,
    y_train.to_numpy(),
    batch_size=10,
    epochs=EPOCHS,
    verbose=1,
    validation_data=(X_val, y_val),
)

### Evaluation

Next we want to find out how well the model was trained:

In [None]:
# summarize history for loss
loss_train = history.history["loss"]
loss_val = history.history["val_loss"]
epochs = range(1, EPOCHS + 1)
plt.figure(figsize=(10, 8))
plt.plot(epochs, loss_train, "g", label="Training loss")
plt.plot(epochs, loss_val, "b", label="Validation loss")
plt.title("Training and Validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

In the above chart we can see that the training loss and validation loss are quite close to each other.

Now we can use our trained model to predict home prices on the _test_ data, which was not used in the training process.

In [None]:
y_pred = model.predict(X_test)

We can see that loss function value on the test data is similar to the loss value on the training data

In [None]:
model.evaluate(X_test, y_test)

### Prediction

Let's now visualize our housing price predictions using Foursquare Studio. Here we create a dataframe with predicted values obtained from the model.

In [None]:
predict_data = start_values.loc[:, ['longitude', 'latitude']]
predict_data["price"] = y_pred

### Visualization

The map we create in the next cell depicts the prices we've predicted for houses in each census area in California.

In [None]:
housing_predict_prices = create_map()
housing_predict_prices

In [None]:
price_data_id = str(uuid4())

housing_predict_prices.add_dataset(
    id=price_data_id,
    label="Predict housing prices in CA",
    data=predict_data,
    auto_create_layers=False,
)

housing_predict_prices.add_layer(
    {
        "id": "housing_prices",
        "type": "hexagon",
        "label": "housing prices",
        "data_id": price_data_id,
        "fields": {"lat": "latitude", "lng": "longitude"},
        "config": {
            "visual_channels": {
                "color_scale": "quantize",
                "color_field": {"name": "price", "type": "real"},
            },
            "vis_config": {
                "colorRange": {
                    "colors": [
                        "#E6F598",
                        "#ABDDA4",
                        "#66C2A5",
                        "#3288BD",
                        "#5E4FA2",
                        "#9E0142",
                        "#D53E4F",
                        "#F46D43",
                        "#FDAE61",
                        "#FEE08B",
                    ]
                }
            },
        },
    }
)

housing_predict_prices.set_view(longitude=-119.417931, latitude=36.6, zoom=6)

## Clustering Model

We'll now cluster the predicted data by price levels using the KMeans algorithm.

In [None]:
k = 5
km = KMeans(n_clusters=k, init="k-means++")
X = predict_data.loc[:, ["latitude", "longitude", "price"]]

# Run clustering and add to prediction dataset dataset
predict_data["cluster"] = km.fit_predict(X)

### Visualization

Let's show the price clusters in a chart

In [None]:
fig, ax = plt.subplots()
sns.scatterplot(
    x="latitude",
    y="longitude",
    data=predict_data,
    palette=sns.color_palette("bright", k),
    hue="cluster",
    size_order=[1, 0],
    ax=ax,
).set_title(f"Clustering (k={k})")

The next map shows the same clusters in a geographic context. Here we can see that house prices are highest for areas close to the largest cities.

In [None]:
unfolded_map_prices = create_map()
unfolded_map_prices

In [None]:
prices_dataset_id = str(uuid4())

unfolded_map_prices.add_dataset(
    id=prices_dataset_id,
    label="Prices",
    data=predict_data,
    auto_create_layers=False,
)

unfolded_map_prices.add_layer(
    {
        "id": "prices_CA",
        "type": "point",
        "data_id": prices_dataset_id,
        "label": "clustering of prices",
        "fields": {"lat": "latitude", "lng": "longitude"},
        "config": {
            "visual_channels": {
                "colorScale": "quantize",
                "colorField": {"name": "cluster", "type": "real"},
            },
            "vis_config": {
                "colorRange": {
                    "colors": ["#7FFFD4", "#8A2BE2", "#00008B", "#FF8C00", "#FF1493"]
                }
            },
        },
    }
)

unfolded_map_prices.set_view(longitude=-119.417931, latitude=36.778259, zoom=4)