### Learning an Emulator Model for Urban Heat Prediction

This notebook demonstrates step-by-step how to train a machine learning regression algorithm on spatial data, to use the distribution of green, blue, and grey elements for the prediction of heat - based on a target layer from a physical climate model. Technically, we attempt to "emulate" the climate model behavior by a data-driven approach. The model is evaluated regarding a performance metric, and then used for predicting scenarios of altered landcover.

The workflow contains the following steps:
1. Reading green-blue-grey elements as vector data and conversion to raster
2. Feature engineering, namely distance-to and density-of the elements
3. Stacking all layers, and spatial tiling
4. Sampling training data (nested cross-validataion)
5. Parameter grid search for a RandomForestRegressor
6. Performance evaluation of best model
7. Model inspection by feature importance and partial dependence plots
8. Spatial prediction of full layer

In [None]:
import os
import random
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import rioxarray
import rasterio
from rasterio.features import rasterize
from scipy.ndimage import convolve
from scipy.ndimage import distance_transform_edt
from customFunctions import writeRaster

os.getcwd()

#### 1. Reading vector data and conversion to raster

In [None]:
datafolder = "./data"
os.listdir(datafolder)

In [None]:
# An output layer from the climate model is used as reference for rasterizing the vector data
reference_raster = rioxarray.open_rasterio(datafolder + "/T2M_daily_mean_max_topography_2011_2020_present_30.tif")
reference_raster.plot()

In [None]:
ref_transform = reference_raster.rio.transform()
ref_crs = reference_raster.rio.crs
# print the CRS and transform to check
print("CRS of the reference raster:", ref_crs)
print("Transform of the reference raster:", ref_transform)


In [None]:
# load the datasets to rasterize
railways = gpd.read_file(datafolder + "/osm_railways_without_subway.gpkg")
subway = gpd.read_file(datafolder + "/osm_subway.gpkg")
roads = gpd.read_file(datafolder + "/osm_major_roads.gpkg")
water = gpd.read_file(datafolder + "/osm_water.gpkg")
greens = gpd.read_file(datafolder + "/osm_green_without_forest.gpkg")
forest = gpd.read_file(datafolder + "/osm_forest.gpkg")
buildings = gpd.read_file(datafolder + "/osm_buildings.gpkg")

In [None]:
# create a 'dictionary' for iterating over the datasets
datasets = {
    "railways": railways,
    "subway": subway,
    "roads": roads,
    "water": water,
    "greens": greens,
    "forest": forest,
    "buildings": buildings
}
for name, dataset in datasets.items():
    # rasterize the dataset
    rasterized = rasterize(
        [(geom, 1) for geom in dataset.geometry],
        out_shape=reference_raster.shape[1:],
        transform=ref_transform,
        fill=0,
        all_touched=True
    )
    
    # write the rasterized dataset to a GeoTIFF file
    output_filename = os.path.join(datafolder, f"{name}_raster.tif")
    writeRaster(rasterized, output_filename, ref_crs=ref_crs, ref_transform=ref_transform)


#### 2. Feature engineering

Here we start by defining functions that transform the rasters. Distances and focal mean values will provide spatial context information to the learning algorithm that we want to train.

In [None]:
# Define a function to compute distance to nearest feature
def compute_distance_to_nearest_feature(raster, feature_name):
    # Invert the raster to get the distance to the nearest feature
    inverted_raster = np.where(raster > 0, 0, 1)
    # Compute the distance transform
    distance_raster = distance_transform_edt(inverted_raster)
    # multiply by the grid size (30m) to convert to meters
    distance_raster *= 30  # Assuming each grid cell is 30m x 30m
    # Save the distance raster to a new GeoTIFF file
    output_filename = os.path.join(datafolder, f"distance_to_nearest_{feature_name}.tif")
    crs = raster.rio.crs
    transform = raster.rio.transform()
    writeRaster(distance_raster, output_filename, crs=crs, transform=transform)

def compute_convolution(raster, feature_name, kernel_size=33):
    kernel = np.ones((kernel_size, kernel_size), dtype=np.float32) / (kernel_size * kernel_size)
    # Perform convolution with the defined kernel
    convolved_raster = convolve(raster, kernel, mode='constant', cval=0.0)
    #convolved_datasets[name] = convolved_raster
    # Save the convolved raster to a new GeoTIFF file
    output_filename = os.path.join(datafolder, f"{feature_name}_convolved_{kernel_size*30}m.tif")
    crs = raster.rio.crs
    transform = raster.rio.transform()
    writeRaster(convolved_raster, output_filename, crs=crs, transform=transform)


In [None]:
# Load the rasterized datasets, and ensure they are in float32 format
greens_raster = rioxarray.open_rasterio(os.path.join(datafolder, "greens_raster.tif")).squeeze().astype(np.float32)
forest_raster = rioxarray.open_rasterio(os.path.join(datafolder, "forest_raster.tif")).squeeze().astype(np.float32)
water_raster = rioxarray.open_rasterio(os.path.join(datafolder, "water_raster.tif")).squeeze().astype(np.float32)
railways_raster = rioxarray.open_rasterio(os.path.join(datafolder, "railways_raster.tif")).squeeze().astype(np.float32)
subway_raster = rioxarray.open_rasterio(os.path.join(datafolder, "subway_raster.tif")).squeeze().astype(np.float32)
roads_raster = rioxarray.open_rasterio(os.path.join(datafolder, "roads_raster.tif")).squeeze().astype(np.float32)
buildings_raster = rioxarray.open_rasterio(os.path.join(datafolder, "buildings_raster.tif")).squeeze().astype(np.float32)

# loop through the datasets, compute the convolution and save them
datasets = {
    "greens": greens_raster,
    "forest": forest_raster,
    "water": water_raster,
    "railways": railways_raster,
    "subway": subway_raster,
    "roads": roads_raster,
    "buildings": buildings_raster
}

# Loop through the datasets and compute distance to / density of
# greens, forest, water, railways, subway, roads and buildings
for name, raster in datasets.items():
    compute_distance_to_nearest_feature(raster, name)
    compute_convolution(raster, name)


#### 3. Stacking and tiling

In [None]:
from customFunctions import slice_into_tiles, save_tiles_as_geotiff, save_stacked_raster_as_geotiff


In [None]:
heat_raster = reference_raster.copy().squeeze()
heat_raster.plot()

In [None]:
# load the convolved feature rasters
greens_convolved= rioxarray.open_rasterio(os.path.join(datafolder, "greens_convolved_990m.tif")).squeeze()
forest_convolved= rioxarray.open_rasterio(os.path.join(datafolder, "forest_convolved_990m.tif")).squeeze()
water_convolved= rioxarray.open_rasterio(os.path.join(datafolder, "water_convolved_990m.tif")).squeeze()
railways_convolved= rioxarray.open_rasterio(os.path.join(datafolder, "railways_convolved_990m.tif")).squeeze()
subway_convolved= rioxarray.open_rasterio(os.path.join(datafolder, "subway_convolved_990m.tif")).squeeze()
roads_convolved= rioxarray.open_rasterio(os.path.join(datafolder, "roads_convolved_990m.tif")).squeeze()
buildings_convolved= rioxarray.open_rasterio(os.path.join(datafolder, "buildings_convolved_990m.tif")).squeeze()

# load the distance feature rasters
greens_distance = rioxarray.open_rasterio(os.path.join(datafolder, "distance_to_nearest_greens.tif")).squeeze()
forest_distance = rioxarray.open_rasterio(os.path.join(datafolder, "distance_to_nearest_forest.tif")).squeeze()
water_distance = rioxarray.open_rasterio(os.path.join(datafolder, "distance_to_nearest_water.tif")).squeeze()
railways_distance = rioxarray.open_rasterio(os.path.join(datafolder, "distance_to_nearest_railways.tif")).squeeze()
subway_distance = rioxarray.open_rasterio(os.path.join(datafolder, "distance_to_nearest_subway.tif")).squeeze()
roads_distance = rioxarray.open_rasterio(os.path.join(datafolder, "distance_to_nearest_roads.tif")).squeeze()
buildings_distance = rioxarray.open_rasterio(os.path.join(datafolder, "distance_to_nearest_buildings.tif")).squeeze()

# stack them all, with heat as the first layer
stacked_raster = np.stack([heat_raster, greens_convolved, forest_convolved,
                           water_convolved, railways_convolved, subway_convolved,
                           roads_convolved, buildings_convolved,
                           greens_distance, forest_distance, water_distance,
                           railways_distance, subway_distance, roads_distance,
                           buildings_distance], axis=0)
stacked_raster.shape # should be (15, height, width) where 15 is the number of layers


In [None]:
# slice the stacked raster into 4 tiles, using the provided custom function
tile_size = stacked_raster.shape[1] // 2  # Assuming we want 2x2 tiles
tiles = slice_into_tiles(stacked_raster, tile_size)
tiles.shape

# the geolocation of the tiles is based on the original raster's transform and CRS
# but we need to adjust the transform for each tile
output_folder = os.path.join(datafolder, "tiles")
save_tiles_as_geotiff(tiles, heat_raster.rio.transform(), heat_raster.rio.crs, output_folder)

# alternative version: save full stacked raster as a single GeoTIFF
output_filename = os.path.join(datafolder, "stacked_raster.tif")
save_stacked_raster_as_geotiff(stacked_raster, heat_raster.rio.transform(), heat_raster.rio.crs, output_filename)

#### 4. Sampling training data

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.inspection import PartialDependenceDisplay
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Import custom functions from another Python script in the same folder
from customFunctions import readRaster, writeRaster


In [None]:
os.getcwd()

In [None]:
data, transform, crs, height, width = readRaster(os.path.join(datafolder, "tiles/tile_0.tif"))
training_data = data.reshape(data.shape[0], -1).T  # Reshape to (num_pixels, num_layers)

In [None]:
training_data.shape

In [None]:
heat_layer = training_data[:, 0]  # First layer is heat
features = training_data[:, 1:]   # Remaining layers are features

# Randomly select a fraction of the pixels for training, e.g. 0.1 for 10%
fraction = 0.10
num_pixels = features.shape[0]
sample_size = int(num_pixels * fraction)
random_indices = random.sample(range(num_pixels), sample_size)
X_sample = features[random_indices]    # the predictors (features)
y_sample = heat_layer[random_indices]  # the target variable (heat layer)

# Split the sample into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.2, random_state=42)


#### 5. Parameter grid search

This example is for training a RandomForestRegressor, which is an ensemble of decision trees. Details on the algorithm and its hyperparameters can be looked up, for example, on the scikit-learn website. It is an established method, famous for yielding good results on many datasets without requiring much preprocessing (e.g. no scaling needed), and for being less sensitive to sub-optimal parametrization than other methods. For this exercise, we will only try out a very narrow range for a few parameters. For other algorithms, doing a proper parametrization can be crucial.

In [None]:
# note that the computational effort increases drastically with the number of estimators 
# and with the combinations of hyperparameters to try out
param_grid = {'n_estimators': [10, 50],    # how many decision trees the ensemble consists of
              'max_depth': [3, 10, None],  # number of decision layers within each tree
              'min_samples_split': [2, 5], # stop splitting a node if it has less than this many samples
              'n_jobs': [-1]}              # Use all available cores for parallel processing
# Create a Random Forest Regressor
rf = RandomForestRegressor(random_state=42)
# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='r2', n_jobs=-1)
# Fit the grid search to the training data
grid_search.fit(X_train, y_train)
# print the ranking of estimators with hyperparameters and skill score as table
results = pd.DataFrame(grid_search.cv_results_)
print(results[['params', 'mean_test_score', 'rank_test_score']].sort_values(by='rank_test_score'))


#### 6. Performance evaluation

In [None]:
from customFunctions import plotSideBySide, plotSpatialError

In [None]:
model = RandomForestRegressor(n_estimators=10, max_depth=None, n_jobs=-1, random_state=42)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Model R^2 score: {score:.4f}")
# predict the heat layer using the model
predicted_heat = model.predict(features)
# reshape the predicted heat back to the original tile shape
predicted_heat_reshaped = predicted_heat.reshape(height, width)
# use the custom function to plot the original and predicted heat layers side by side
plotSideBySide(data[0], predicted_heat_reshaped)
plotSpatialError(data[0], predicted_heat_reshaped)


In [None]:
# joblib allows saving and loading of trained models
import joblib

output_path = os.path.join(datafolder, "predicted_heat_tile_0_rf.tif")
writeRaster(predicted_heat_reshaped, output_path, transform, crs)
# save the fitted model to a file
model_path = os.path.join(datafolder, "random_forest_model.pkl")
joblib.dump(model, model_path)


We can now test the performance of our trained model on a different tile, i.e. data that it hasn't seen during trainnig, and that is also spatially independent of the training data! This should tell us whether the model learned any transferable patterns from the data, or whether it simply overfit the training data.

In [None]:
tile_path_1 = os.path.join(datafolder, "tiles/tile_1.tif")
tile_1, transform_1, crs_1, height_1, width_1 = readRaster(tile_path_1)
tile_data_reshaped_1 = tile_1.reshape(tile_1.shape[0], -1).T  # (num_pixels, num_layers)
predicted_heat_tile_1 = model.predict(tile_data_reshaped_1[:, 1:])  # Use features only
predicted_heat_reshaped_1 = predicted_heat_tile_1.reshape(height_1, width_1)
# evaluate the model on tile 1
mse = mean_squared_error(tile_1[0].flatten(), predicted_heat_reshaped_1.flatten())
r2 = r2_score(tile_1[0].flatten(), predicted_heat_reshaped_1.flatten())
print(f"Mean Squared Error for Tile 1: {mse:.4f}")
print(f"R^2 Score for Tile 1: {r2:.4f}")
# plot the original and predicted heat layers side by side
plotSideBySide(tile_1[0], predicted_heat_reshaped_1, title1="Original Heat Layer Tile 1", title2="Predicted Heat Layer Tile 1")
plotSpatialError(tile_1[0], predicted_heat_reshaped_1)


#### 7. Model inspection

A linear regression model comes with weights that tell how important the individual predictive features (X1, X2, ...) are to the prediction (y-pred). RandomForest is a strongly non-linear ensemble method, which cannot be interpreted that easily. However, several method exist to derive a "feature imporance", e.g. by permutation or from the average position of the respective features within the individual decision trees. The details can be found in the documentation, but are outside the scope of this exercise.

In [None]:
features_names = ['greens_convolved', 'forest_convolved', 'water_convolved',
                  'railways_convolved', 'subway_convolved', 'roads_convolved',
                  'buildings_convolved', 'greens_distance', 'forest_distance',
                  'water_distance', 'railways_distance', 'subway_distance',
                  'roads_distance', 'buildings_distance']

# Visualize the feature importances
feature_importances = model.feature_importances_
plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_importances)), feature_importances)
plt.xticks(range(len(feature_importances)), features_names, rotation=45)
plt.title("Feature Importances of Random Forest Model")
plt.xlabel("Features")
plt.ylabel("Importance Value")
plt.grid(axis='y')
plt.tight_layout()


While the feature importance gives an idea of how much the model makes use of particular features, this does not really tell anything about the direction of influence! When trying to understand the model behavior, we usually want an answer to questions like "Is more or less heat predicted in areas of high building density?", or even "how far from the forest edge does the model still predict a cooling effect?". Such insights can be derived from partial dependence plots. These plots visualize the marginal effect of a single feature on the prediction (all other feature values held constant). Keep in mind, though, that there can be interaction effects which even out in the marginal effect.

In [None]:
fig, ax = plt.subplots(figsize=(12, 18))
PartialDependenceDisplay.from_estimator(
    model,
    X_train,
    features=range(len(features_names)),
    feature_names=features_names,
    ax=ax,
    grid_resolution=200
)

#### 8. Prediction of the full layer

with the model stored in a file, and its behavior roughly understood, we can proceed to apply the model on the full spatial extent and/or different future scenarios.

In [None]:
from customFunctions import predictHeatLayer
help(predictHeatLayer) # check the docstring of the function

In [None]:
rf_model = joblib.load(os.path.join(datafolder, "random_forest_model.pkl"))
predictHeatLayer(datafolder + "/stacked_raster.tif", rf_model, datafolder + "/prediction_subset.tif")