This notebook trains a Regression-Kriging model with the daily average number of accidents per m2 for each LOR as a response variable and the daily average traffic volume per LOR as a predictor. First, a RandomForestRegressor is fitted to the data, in a second step Kriging of the regression residuals is undertaken. We use the RegressionKriging function from the PyKrige package.

We load the required packages and retrieve the input data from the model from our local database.

In [1]:
import sqlite3
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from pykrige.rk import RegressionKriging
import numpy as np
from joblib import dump

YEAR = "2019"

conn = sqlite3.connect("data/accidents.db")
query = conn.execute("SELECT centroid_lon, centroid_lat, daily_traffic_cars_per_m2, mean_accidents_per_m2_" + YEAR + " FROM accidents_per_lor")
cols = [column[0] for column in query.description]
df = pd.DataFrame.from_records(data = query.fetchall(), columns = cols)
conn.close()

We prepare the data for use in the sklearn environment and create a test set which contains 30% of the total data.

In [2]:
X = np.array(df["daily_traffic_cars_per_m2"]).reshape(-1, 1)
y = np.array(df["mean_accidents_per_m2_" + YEAR])
lonlat = np.array([(x, y) for x,y in zip(df["centroid_lon"], df["centroid_lat"])])

X_train, X_test, y_train, y_test, lonlat_train, lonlat_test = train_test_split(
    X, y, lonlat, test_size=0.3, random_state=69
    )

We fit the Regression-Kriging model to the predictor, response and coordinates of the train set. The RK model from the PyKrige package requires an instance of a RandomForestRegressor from the sklearn library.

In [3]:
rf = RandomForestRegressor(n_estimators=300)

rk = RegressionKriging(regression_model=rf, n_closest_points=10)
rk.fit(X_train, lonlat_train, y_train)

Finished learning regression model
Finished kriging residuals


The model isn't a great fit, but not bad either. For the purpose of this sample application it is good enough. We save the trained model to the disk.

In [4]:
print("Regression Score: ", rk.regression_model.score(X_test, y_test))
print("RK score: ", rk.score(X_test, lonlat_test, y_test))

# store fitted model
dump(rk, 'model.joblib')

Regression Score:  0.36284125153357194
RK score:  0.4408776625093902


['model.joblib']