# Predicting hotspots of burglaries

In this tutorial, hotspots of burglaries will be predicted in the city of Dallas. Three different models will be trained and their predictions will be validated.

Start by importing the hotspot library.

In [13]:
import predictivehp.processing.data_processing as dp
import predictivehp.models.models as models

## Importing crime data

The tutorial will consider crime data provided by the City of Dallas. The data can be imported with the Socrata API. The street map and city limits can be obtained from Dallas City Hall (https://gis.dallascityhall.com/shapefileDownload.aspx). For convenience, the data has already been stored in the package.

In [14]:
# CONSULTAR

# data = hp.import.tutorial('Dallas2017')
# The 'data' output is an instance of an object, which has all the relevant
# information and data such as a dataframe with incidences and shape files of
# the city. Either store the data in a folder or request data from Socrata with
# default parameters.

In general, data can be retrieved from the Socrata API and shape files can be specified as well.

In [15]:
b_path = 'predictivehp/data'
s_shp_p = f'{b_path}/streets.shp'
c_shp_p = f'{b_path}/councils.shp'
cl_shp_p = f'{b_path}/citylimit.shp'

pp = dp.PreProcessing()
shps = pp.shps_processing(s_shp_p, c_shp_p, cl_shp_p)
data = pp.get_data(year=2017, n=150000)

The crime data is stored in a Pandas DataFrame where columns **x** and **y** are the location coordinates in meters and **date** the date of the crime event.

In [16]:
data.head()

Unnamed: 0,x,y,date,month1,y_day
0,2526789.0,6972383.0,2017-01-01,January,1
1,2525848.0,6971430.0,2017-01-01,January,1
2,2483264.0,6927857.0,2017-01-01,January,1
3,2504030.0,6966014.0,2017-01-01,January,1
4,2479656.0,6956554.0,2017-01-01,January,1


## Specify the prediction configuration

Data has been loaded for the entire year of 2017. In this tutorial, the first week of November will be predicted from previous crime data. Since data is available for this week, the prediction can be validated as well.

All three available models, that is, STKDE, ProMap and Random Forest will be used for the prediction.

In [17]:
from datetime import date
model = models.create_model(
    data=data, shps=shps,
    start_prediction=date(2017, 11, 1), length_prediction=7,
    use_stkde=True, use_promap=True, use_rfr=True,
    read_rfr=True
)
# The output is an object that has all relevant attributes, for example the data
# for the incidences and a separate object for each model (stkde, promap, rfr).

## Specifying model parameters

Optionally, hyperparameters for the predictive models can be specified.

In [18]:
model.stkde.set_parameters(bw=[100, 200, 20])
# No output.

In [19]:
model.promap.set_parameters(bw=[400, 400, 7], hx=100, hy=100)
# No output.

In [20]:
model.rfr.set_parameters(t_history=4,
                         xc_size=100, yc_size=100, n_layers=7,
                         label_weights=None)
# No output.

In [21]:
model.print_parameters()
# Print all hyperparameters of the models.

STKDE Hyperparameters
bandwith x: 100 ft.
bandwith y: 200 ft.
bandwith t: 20 days

ProMap Hyperparameters
bandwith x: 400 mts
bandwith y: 400 mts
bandwith t: 7 days
hx: 100 mts
hy: 100 mts

RFR Hyperparameters
Training History:   4 weeks
xc_size:            100 m
yc_size:            100 m
n_layers:           7
l_weights:          None



In [22]:
model.preprocessing()
# Instancia la clase PreProcessing

## Training the models

The models will be trained on historical crime data to optimize the prediction.

In [23]:
model.fit()
# No output other than messages. Perform the calculations for the training.


Finished! (0.0 sec)

Finished! (3.7 sec)


The STKDE model has calculated the optimal space-time bandwidth for the prediction.

In [24]:
model.stkde.bw
# Output is a list of the x-, y-, and t-bandwidth.

array([100, 200,  20])

The ProMap model is a *lazy learner* and does not require any training: the space-time weights are fixed and only a rectangular mesh needs to be created.

The RFR model has created a forest of decision trees. Interestingly, the relative importance of the features can be retrieved. The table shows the feature importance with respect to the units of history and the depth of the layers.

In [None]:
model.rfr.plot_feature_importance()
# Output is a table with the index of the layer and the index of the histor
# unit (for example, week -1, week -2, until week -n). Color and values indicate
# the feature importance.

## Predicting future crimes

The crimes in the first week of November will be predicted with the different models. The results are risk scores that indicate the likelihood of a future crime to occur at the specific location. These propensity scores are normalized to a scale of 0 to 1.

In [12]:
model.predict()
# No output other than messages. Perform the calculations for the prediction.

  return _prepare_from_string(" ".join(pjargs))
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:2276
Right CRS: None

  valid_inc = gpd.tools.sjoin(geo_inc,
  return _prepare_from_string(" ".join(pjargs))
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:2276
Right CRS: None

  valid_inc = gpd.tools.sjoin(geo_inc,
  return _prepare_from_string(" ".join(pjargs))
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:2276
Right CRS: None

  valid_inc = gpd.tools.sjoin(geo_inc,
  return _prepare_from_string(" ".join(pjargs))
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:2276
Right CRS: None

  valid_inc = gpd.tools.sjoin(geo_inc,


Model not found.


TypeError: predict() argument after * must be an iterable, not NoneType

The STKDE has created a function stating the risk associated to an arbitrary point on the map for any time in the prediction window. This score can be calculated at each point in space and time.

In [None]:
model.stkde.score(x=123456, y=987654, t=4)
# Output is the value of the 'pdf' function in this point.

The ProMap model has created propensity scores for each cell in the rectangular mesh. These are stored in a matrix where each element corresponds to the score of the corresponding cell in the mesh.

In [None]:
model.promap.score()
# Output is the matrix with scores.

Similarly, RFR also creates scores for each cell in a rectangular mesh.

In [None]:
model.rfr.score()
# Output is the column of the dataframe with the predicted values.

## Visualizing the prediction

The prediction can be vizualized as a heat map, where the values indicate the likelihood of crime occurrence in the prediction window. Normally, hotspots of areas with a high crime risk can be seen.

In [None]:
model.plot_heatmap()
# Plot the heatmaps of all models, with the same axis and the same colorbar.
# Include optional parameters like 'crange=[0.2,0.8]' to plot values between 0.2
# and 0.8 only, cmap='jet', xlim=[123456,234567], savefig=True, etc.

The STKDE method provides space-time prediction, which can be visualized with a four-dimensional plot.

In [None]:
model.stkde.plot_4Dheatmap()
# Output is the 4D heatmap. If not inline, store the data in a file.

## Validation of the prediction

When data is available for the prediction window, the predictions can be validated.

The testing procedures consists of an experiment where we assume that all areas with a high score will be patrolled and all incidences of crimes in this hotspot area will be detected. Multiple thresholds of the propensity score can be specified at the same time.

In [None]:
model.validate(score=[0.5, 0.9])
# No output other than messages. Perform calculations of the propensity score
# for these values of c.

The models have calculated the number of detected incidences for each score threshold.

In [None]:
model.detected_incidences()
# The output is the number of hits for each model and each value of c.

More detected incidences indicates a better performance of the model. However, this could be at the expense of a larger area to be patrolled. Hence, it is important to calculate the area of the hotspots as well.

In [None]:
model.hotspot_area()
# The output is the hotspot area for each model and each value of c.

### Hit rate

Different models can perform better for a smaller or larger area of the hotspots. Therefore, let us plot the hit rate with respect to the area. The *hit rate* is defined by the number of detected incidences inside the hotspot, divided by the total number of crime incidences in the prediction window.

In [None]:
model.plot_hr()
# The output is a figure with the HR for each model.
# Optional parameters can be colors=['r','b','g'], models=['stkde','promap'],
# title='mytitle', savefig=True, etc.

### Predictive accuracy index

The hit rate will always increase for larger hotspot areas. To balance the merits of many hits and the costs of a large area, the predictive accuracy index (PAI) can be used. It is defined as the hit rate divided by the percentage of the hotspot area compared to the city size.

In [None]:
model.plot_pai()
# The output is a figure with the PAI for each model.
# Optional parameters as in plot_hr().

### Heatmap

The performance can also be visualized by a heatmap. For a given score threshold, the hotspots are marked by the blue areas. The markers represent the incidences in the prediction window where green marker are hits and red markers misses.

In [None]:
model.plot_heatmap(c=0.5, incidences=True)
# Plot heatmaps for each model. In this case with red (or another color) for
# areas with c larger than the input.
# The rest of the area is transparent. Also, plot the incidences in the
# prediction window with markers (points, crosses, etc.) with different colors
# for the hits and misses.
# In general, you want to specify the area, not the c. For example,
# when area=0.05, you want to plot the hotspots with 5% of the total area. So,
# you need to retrieve the corresponding c for each model.

## Storing data

The data of the models can be stored, which is convenient to reduce computational time when performing the same simulation another time.

In [None]:
model.store(file_name='model.data')
# Store all usefull data to disk. Can be NumPy arrays or Pickle objects and in
# different files.