# House Price Prediction With TensorFlow

This example demonstrates how the Unfolded Map SDK allows for more engaging exploratory data visualization, helping to simplify the process of building a machine learning model for predicting median house prices in California.

## Dependencies 

This notebook uses the following dependencies:

- pandas
- numpy
- scikit-learn
- scipy
- seaborn
- matplotlib
- tensorflow

If those aren't already installed, run the following command:

```bash
pip install pandas numpy scikit-learn scipy seaborn matplotlib tensorflow
```

This notebook was originally tested with the following package versions, but likely works with a broad range of versions:

- pandas==1.3.2
- numpy==1.19.5
- scikit-learn==0.24.2
- scipy==1.7.1
- seaborn==0.11.2
- matplotlib==3.4.3
- tensorflow==2.6.0

## Imports

In [None]:
from uuid import uuid4

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
from scipy.cluster.vq import vq
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from tensorflow import keras
from tensorflow.keras import Sequential, optimizers
from tensorflow.keras.layers import Dense, Flatten, Softmax

from unfolded.map_sdk import UnfoldedMap

## Data Loading

For this example we'll use data from XXXXXX under the CC0 license.

In [None]:
dataset_url = "https://actionengine-public.s3.us-east-2.amazonaws.com/housing.csv"
housing = pd.read_csv(dataset_url)
housing.head()

## Feature Engineering

First, let's take a look at the input data and try to visualize different aspects of them in a map.

### Population Clustering

Here we'll create a map that clusters by city with the largest population. Note that since the clustering happens within Unfolded Studio, the clusters change as you zoom in, allowing you to explore your data at various resolutions.

In [None]:
population_in_CA = UnfoldedMap()

# Create a persistent dataset ID that we can reference in both add_dataset and add_layer
dataset_id = uuid4()

population_in_CA.add_dataset(
    {"uuid": dataset_id, "label": "Population_in_CA", "data": housing},
    auto_create_layers=False,
)

population_in_CA.add_layer(
    {
        "id": "population_CA",
        "type": "cluster",
        "config": {
            "label": "population in CA",
            "data_id": dataset_id,
            "columns": {"lat": "latitude", "lng": "longitude"},
            "is_visible": True,
            "color_scale": "quantize",
            "color_field": {"name": "population", "type": "real"},
        },
    }
)

population_in_CA.set_view_state(
    {"longitude": -119.417931, "latitude": 36.778259, "zoom": 5}
)

population_in_CA

### Distance To Largest Cities

For example purposes, we'll take the five largest cities in California and compare our input data against these locations.

In [None]:
# Longitude-latitude pairs for large cities
cities = {
    "Los Angeles": (-118.244, 34.052),
    "San Diego": (-117.165, 32.716),
    "San Jose": (-121.895, 37.339),
    "San Francisco": (-122.419, 37.775),
    "Fresno": (-119.772, 36.748),
}

Next we need to find the closest city for each row in our data sample.

In [None]:
def distance(lng1, lat1, lng2, lat2):
    """Vectorized Haversine formula

    Computes distances between two sets of points.

    From: https://stackoverflow.com/a/51722117
    """
    # approximate radius of earth in km
    R = 6372.8

    lat1 = lat1*np.pi/180.0
    lng1 = np.deg2rad(lng1)
    lat2 = np.deg2rad(lat2)
    lng2 = np.deg2rad(lng2)

    d = np.sin((lat2 - lat1)/2)**2 + np.cos(lat1)*np.cos(lat2) * np.sin((lng2 - lng1)/2)**2

    return 2 * R * np.arcsin(np.sqrt(d))

In [None]:
def closest_city(lng_array, lat_array, cities):
    """Find the closest_city for each row in lng_array and lat_array input
    """
    distances = []

    # Compute distance from each row of arrays to each of our city inputs
    for city_name, coord in cities.items():
        distances.append(distance(lng_array, lat_array, *coord))

    # Convert this list of numpy arrays into a 2D numpy array
    distances = np.array(distances)

    # Find the shortest distance value for each row
    shortest_distances = np.amin(distances, axis=0)

    # Find the _index_ of the shortest distance for each row. Then use this value to
    # lookup the longitude-latitude pair of the closest city
    city_index = np.argmin(distances, axis=0)

    # Create a 2D numpy array of location coordinates
    # Then use the indexes from above to perform a lookup against the order of cities as
    # input. (Note: this relies on the fact that in Python 3.6+ dictionaries are
    # ordered)
    input_coords = np.array(list(cities.values()))
    closest_city_coords = input_coords[city_index]

    # Return a 2D array with three columns:
    # - Distance to closest city
    # - Longitude of closest city
    # - Latitude of closest city
    return np.hstack((shortest_distances[:, np.newaxis], closest_city_coords))

Then use the `closest_city` function on our data to create three new columns:

In [None]:
housing[['closest_city_dist', 'closest_city_lng', 'closest_city_lat']] = closest_city(
    housing['longitude'], housing['latitude'], cities
)

This map shows the distances between the locations and their nearest big cities

In [None]:
distance_to_big_cities = UnfoldedMap()
dist_data_id = uuid4()

distance_to_big_cities.add_dataset(
    {
        "uuid": dist_data_id,
        "label": "Distance to closest big cities",
        "data": housing,
    },
    auto_create_layers=False,
)

distance_to_big_cities.add_layer(
    {
        "id": "closest_distance",
        "type": "arc",
        "config": {
            "data_id": dist_data_id,
            "label": "distance to closest big cities",
            "columns": {
                "lng0": "longitude",
                "lat0": "latitude",
                "lng1": "closest_city_lng",
                "lat1": "closest_city_lat",
            },
            "visConfig": {"opacity": 0.8, "thickness": 0.3},
            "is_visible": True,
        },
    }
)

distance_to_big_cities.set_view_state(
    {"longitude": -119.417931, "latitude": 36.778259, "zoom": 4.5}
)

distance_to_big_cities

## Data Preprocessing

Here we are preparing data for training a TensorFlow model:

In [None]:
# we can drop null values as their count is less than 5 %
housing.dropna(inplace=True)

X = pd.DataFrame(
    columns=[
        "longitude",
        "latitude",
        "housing_median_age",
        "total_rooms",
        "total_bedrooms",
        "population",
        "households",
        "median_income",
        "ocean_proximity",
    ],
    data=housing,
)
y = pd.DataFrame(columns=["median_house_value"], data=housing)

# converting ocean_proximity into new separate columns ('NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND')
X = pd.get_dummies(
    data=X, columns=["ocean_proximity"], prefix=["ocean_proximity"], drop_first=True
)

## Data Splitting

Splitting the data into training, validation and test sets

In [None]:
# dividing training data into test, validation and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=1
)

start_values = X_test.copy(deep=True)

## Feature Scaling

Using standart scaling with mean and standard deviation from training dataset to avoid data leak

In [None]:
# feature standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

## Price Prediction Model

Here we are specify the parameters for the TensorFlow model:

In [None]:
model = Sequential()

number_of_features = X.shape[1]

# input Layer
model.add(Dense(number_of_features, activation="relu", input_dim=number_of_features))

# hidden Layer
model.add(Dense(512, activation="relu"))
model.add(Dense(512, activation="relu"))
model.add(Dense(256, activation="relu"))
model.add(Dense(128, activation="relu"))
model.add(Dense(64, activation="relu"))
model.add(Dense(32, activation="relu"))

# output Layer
model.add(Dense(1, activation="linear"))

In [None]:
model.compile(loss="mse", optimizer="adam", metrics=["mse", "mae"])
model.summary()

### Training

Here we are starting the model training:

In [None]:
history = model.fit(
    X_train,
    y_train.to_numpy(),
    batch_size=10,
    epochs=70,
    verbose=1,
    validation_data=(X_val, y_val),
)

### Evaluation

Here we are looking how well the model was trained:

In [None]:
# summarize history for loss
loss_train = history.history["loss"]
loss_val = history.history["val_loss"]
epochs = range(1, 71)
plt.figure(figsize=(10, 8))
plt.plot(epochs, loss_train, "g", label="Training loss")
plt.plot(epochs, loss_val, "b", label="Validation loss")
plt.title("Training and Validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

In the above chart we can see that the training loss and validation loss are quite close to each other.

Now we can use the model to predict prices on unseen data

In [None]:
y_pred = model.predict(X_test)

We can see that loss function value on the test data is similar to the loss value on the training data

In [None]:
model.evaluate(X_test, y_test)

### Prediction

Let's now visualize the predicted numbers on the map

First, create a dataframe with predicted values obtained from the model

In [None]:
predict_data = pd.DataFrame(
    columns=["longitude", "latitude"], data=start_values[["longitude", "latitude"]]
)
predict_data["price"] = y_pred

### Visualization

This map shows the predicted prices on houses in CA

In [None]:
housing_predict_prices = UnfoldedMap()
price_data_id = uuid4()

housing_predict_prices.add_dataset(
    {
        "uuid": price_data_id,
        "label": "Predict housing prices in CA",
        "data": predict_data,
    },
    auto_create_layers=False,
)

housing_predict_prices.add_layer(
    {
        "id": "housing_prices",
        "type": "hexagon",
        "config": {
            "label": "housing prices",
            "data_id": price_data_id,
            "columns": {"lat": "latitude", "lng": "longitude"},
            "is_visible": True,
            "color_scale": "quantize",
            "color_field": {"name": "price", "type": "real"},
            "vis_config": {
                "colorRange": {
                    "colors": [
                        "#E6F598",
                        "#ABDDA4",
                        "#66C2A5",
                        "#3288BD",
                        "#5E4FA2",
                        "#9E0142",
                        "#D53E4F",
                        "#F46D43",
                        "#FDAE61",
                        "#FEE08B",
                    ]
                }
            },
        },
    }
)

housing_predict_prices.set_view_state(
    {"longitude": -119.417931, "latitude": 36.6, "zoom": 6}
)

housing_predict_prices

## Clustering Model

Let's cluster the predicted data by price levels using the KMeans algorithm

In [None]:
k = 5
km = KMeans(n_clusters=k, init="k-means++")
X = predict_data[["latitude", "longitude", "price"]]

# clustering
dtf_X = X.copy()
dtf_X["cluster"] = km.fit_predict(X)

# add clustering info to the original dataset
predict_data[["cluster"]] = dtf_X[["cluster"]]

### Visualization

Let's show the price clusters in a chart

In [None]:
fig, ax = plt.subplots()
sns.scatterplot(
    x="latitude",
    y="longitude",
    data=predict_data,
    palette=sns.color_palette("bright", k),
    hue="cluster",
    size_order=[1, 0],
    ax=ax,
).set_title("Clustering (k=" + str(k) + ")")

This map shows the same clusters in the geographic context

Here we can see that the prices for cities close to the largest cities are the highest, in contrast to those that are far from them and, moreover, far from the ocean.

In [None]:
prices_dataset_id = uuid4()
unfolded_map_prices = UnfoldedMap()

unfolded_map_prices.add_dataset(
    {"uuid": prices_dataset_id, "label": "Prices", "data": predict_data},
    auto_create_layers=False,
)

unfolded_map_prices.add_layer(
    {
        "id": "prices_CA",
        "type": "point",
        "config": {
            "data_id": prices_dataset_id,
            "label": "clustering of prices",
            "columns": {"lat": "latitude", "lng": "longitude"},
            "is_visible": True,
            "color_scale": "quantize",
            "color_field": {"name": "cluster", "type": "real"},
            "vis_config": {
                "colorRange": {
                    "colors": ["#7FFFD4", "#8A2BE2", "#00008B", "#FF8C00", "#FF1493"]
                }
            },
        },
    }
)

unfolded_map_prices.set_view_state(
    {"longitude": -119.417931, "latitude": 36.778259, "zoom": 4}
)

unfolded_map_prices