# Creating Models for Forecasting

This notebook allows you to run a few models for estimating the risk of crime in grid cells over a given area (such as Durham) via a few different types of models.

The following models are currently implemented:
- Random: Rank all cells completely randomly. Baseline lowest-performing model.
- Naive: Count the number of events per cell.
- PHS: Each event spreads risk radially outward, decreasing in time and space depending on the parameters given to it.
- Ideal: Impossibly ideal model that "cheats" by reading the testing data instead of the training data. Baseline highest-performing model.

Specifically, this notebook is designed to simply provide a forecast of the highest-risk locations within a region, given previous history of crime events in the region. Alternatively, if you are interested in evaluating the efficacy of these models by testing them against historic data, see the related "hindcasting" notebook instead.

## Import Modules

Necessary modules are imported here.

In [1]:
### Run this without editing anything

# Import necessary tools from modules.
import sys
import os
import datetime
import pandas as pd
sys.path.insert(0, os.path.abspath(".."))
import riskModelsGeneric
import crimeRiskTimeTools
import geodataTools
import importlib
importlib.reload(riskModelsGeneric)
importlib.reload(crimeRiskTimeTools)
importlib.reload(geodataTools)
from riskModelsGeneric import runModelExperiments, \
                                std_file_name, \
                                graph_cov_vs_hit_from_csv, \
                                loadGenericData
from crimeRiskTimeTools import getSixDigitDate
from geodataTools import list_risk_model_properties, \
                         top_geojson_features, \
                         marker_cluster_from_data, \
                         combine_geojson_features, \
                         json_dict_to_geoframe

print("Successfully imported modules.")

Successfully imported modules.


# Selecting Your Data

Identify one data directory that will contain your input files, and another directory that will contain the output files generated by this notebook. (These can be the same directory.)

In your input directory, you should place these files:
- Input CSV file of crime events. The first line of the file should be a header, with appropriate labels for its columns. 4 columns must contain the information listed below for each crime event; other columns will be ignored. The expected formats of the data can be changed as needed via additional parameters discussed later.
    - Time and date
    - East/West coordinate; could be Eastings or Longitude
    - North/South coordinate; could be Northings or Latitude
    - Crime type, e.g. Burglary
    - Any other columns will be ignored.
    - The first line of the file should be a header, with appropriate labels for these columns.
- Geojson file that will generate a polygon of the relevant region
    - For Chicago, this can be found at:
        - https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6
    - For regions of the UK, here is one process of generating a file from Ordnance Survey data:
        - Visit https://www.ordnancesurvey.co.uk/opendatadownload/products.html
        - Scroll down to the "Boundary-Line" data, select ESRI SHAPE format, click the "Download" box, then scroll to the bottom and click "Continue".
        - After requesting the download from the next page, wait for a download link to be sent to your email, which should allow you to download a "force_kmls.zip" file full of .kml files.
        - Use the "ogr2ogr" tool to convert the relevant .kml file to .geojson, as in the following command: "ogr2ogr -f GeoJSON durham.geojson durham.kml"
        - The resulting GeoJSON file will have its coordinates in longitude an latitude. Optionally, you may convert that geojson file to a new one that has a UK-specific projection (EPSG 27700); this can be done with the function convertGeojsonUKCounty in onetimeruns.py

## Set Parameters

Set your parameters here. The current default arguments are for a Fantasy Durham data set; examples of further options are shown on commented-out lines.

In [2]:
### Edit these parameters, then run this.


# Your input directory, containg your input data files discussed above
input_datadir = "../../Input"

# Your intended directory for the output files that will be generated
output_datadir = "../../Output"

# Name of your input csv file of crime data
in_csv_file_name = "Fantasy-Durham-Data.csv"

# Names of the appropriate columns in the header of the csv file
#    column with date (and possibly time)
csv_date_name       = "Date"
#    column with eastings or longitudes
csv_east_name       = "Longitude"
#    column with northings or latitudes
csv_north_name      = "Latitude"
#    column with type of crime
csv_crimetypes_name = "Crime type"

# Name of your input GeoJSON file of the relevant region
area_geojson_file_name = "Police_Force_Areas_December_2016_Durham_fixed.geojson"

# Concise name for your crime data set, to be included as part of output file names
#  Note: Any characters that are not letters or numbers will be removed
dataset_name = "FantDurFORE"

# Relevant crime types, as named in the crime type column of your input file
#  If you want to aggregate multiple crime types, you can separate them by commas
#crime_type_set = "Burglary, Vehicle crime"
crime_type_set = "Burglary"
#crime_type_set = "Vehicle crime"

# Size (in meters) of the side of squares in the grid over the area
cell_width = 500



# The date associated with the experiment
#  This should be the first day AFTER the intended training data window.
#  If set to None or "today", this will default to today's date.
#  (Currently, provide date in format "YYYY-MM-DD")
exp_date = "2019-09-15"

# For time length parameters, use a number followed by a letter.
#  D=days, W=weeks, M=months, Y=years
#  3D=3 days, 12W=12 weeks, 6M=6 months, 1Y=1 year, etc

# Size of the time window of events for the models to be trained on in each experiment
train_len          = "4W"




# Predictive model(s) to run, comma-separated if running multiple different ones
#  Currently recognised names ares: random, naive, ideal, phs
models_to_run = "random,naive,ideal,phs"


# Parameters for PHS model, each one comma-separated as needed
#  Atomic unit for time bandwidths
phs_time_units = "1W"
#  Time bandwidths -- should be a multiple of phs_time_units
phs_time_bands = "4W"
#  Atomic unit for distance bandwidths, in meters
#   Recommended to set this equal to cell_width
phs_dist_units = "500"
#  Distance bandwidths, in meters -- should be a multiple of phs_dist_units
phs_dist_bands = "1500"
#  Weight function
#   "classic": 
#   "linear": 
phs_weight = "classic"
#  Spread type
#   "grid": 
#   "continuous": 
phs_spread = "continuous"

# CSV formatting parameters
# If Fantasy Durham data:
local_epsg = 27700
csv_date_format = "%d/%m/%Y"
csv_longlat = True
csv_epsg = 27700
csv_infeet = False



print("Parameter assignment complete.")

Parameter assignment complete.


# Run experiments using various models and data subsets

This function (runModelExperiments) takes the parameters from above and runs the models with all desired parameter combinations, using training and testing data sets over sliding-window timeframes.

A csv output file will appear in the defined data directory, containing results from each model with each parameter combination on each timeframe's data set.

If only 1 data timeframe is used, heatmap visualisations will be generated, appearing below as well as in the same defined data directory.

In [3]:
### Run this without editing anything

created_files = runModelExperiments(
        input_datadir_in = input_datadir, 
        output_datadir_in = output_datadir, 
        dataset_name_in = dataset_name, 
        crime_type_set_in = crime_type_set, 
        cell_width_in = cell_width, 
        in_csv_file_name_in = in_csv_file_name, 
        geojson_file_name_in = area_geojson_file_name, 
        local_epsg_in = local_epsg, 
        earliest_exp_date_in = exp_date, 
        train_len_in = train_len, 
        test_len_in = "0D", 
        models_to_run_in = models_to_run, 
        phs_time_units_in = phs_time_units, 
        phs_time_bands_in = phs_time_bands, 
        phs_dist_units_in = phs_dist_units, 
        phs_dist_bands_in = phs_dist_bands, 
        phs_weight_in = phs_weight, 
        phs_spread_in = phs_spread, 
        csv_date_format = csv_date_format, 
        csv_longlat = csv_longlat, 
        csv_epsg = csv_epsg, 
        csv_infeet = csv_infeet, 
        csv_col_names = [csv_date_name, csv_east_name, csv_north_name, csv_crimetypes_name], 
        )

train_geojson_file, details_csv_file, results_geojson_file = created_files
print("Output file names:")
for fname in created_files:
    print(fname)

The input data directory is: ../../Input
The output data directory is: ../../Output
Number of experiments to run: 1
Associated dates of each experiment: ['2019-09-15']
Obtaining full data set and region...
Number of relevant crimes: 987
Number of relevant crimes in area: 987
...Obtained full data set and region.
Time taken to obtain data: 1.263
Running experiment 1/1...
Writing training data to ../../Output/train_200203_FantDurFORE_Burglary_500m_190915_1x_0D_4W_0D_1.geojson
num_crimes_train: 281
num_crimes_test: 0
Running model: random
Running model: naive
Running model: ideal
Running model: phs
 Parameter set #1/1
Saving detailed results to csv:
../../Output/details_200203_FantDurFORE_Burglary_500m_190915_1x_0D_4W_0D_1.csv
       cell_num  coverage random_cell  random_cell_risk  random_found_count  \
0             1  0.000099    (56, 99)          0.999978                   0   
1             2  0.000198    (35, 41)          0.999957                   0   
2             3  0.000297   (

If a detailed csv file has been generated, a line graph of the results can be viewed by running this code, after editing the parameters appropriately.

In [None]:
### Optionally, edit this then run it

# Full path to the details csv file
details_csv = "../../Output/details_200130_FantDur_Burglary_500m_190915_1x_1W_4W_1W_1.csv"

# Bounds of the x-axis, corresponding to coverage rate
coverage_range = [0, 0.1]

# Title for the graph (if no title, set to None)
#graph_title = None
graph_title = "Coverage vs Hit Rate"

# Dimensions of output graph (if None, will be (12,6))
graph_size = None
graph_size = (12,8)


# If you want to save the line graph, provide a file name here
img_file_path = None
#img_file_path = "../../Output/details_200130_FantDur_Burglary_500m_190915_1x_1W_4W_1W_1.png"


graph_cov_vs_hit_from_csv(details_csv, 
                          x_limits = coverage_range, 
                          title = graph_title, 
                          img_size = graph_size, 
                          out_img_file_path = img_file_path)

# Interactive Display

After conda installing Ipyleaflet, we need to enable some notebook extensions and import its functions. 

In [None]:
### Run this without editing anything

!jupyter nbextension enable --py --sys-prefix ipyleaflet
from ipyleaflet import *
print("Successfully imported ipyleaflet.")

If you successfully ran runModelExperiments above, performing only one experiment so that a "results" GeoJSON file is generated, then these file paths to the GeoJSON files will already be saved, so you don't need to run this section.

However, if you want to examine a different previously created file, then you can declare the full paths of those GeoJSON files yourself, here.

In [None]:
### Optional; only edit & run this if desired

# Name of GeoJson file with training data's events
train_geojson_file = "../../Data/train_200115_FantDur_190901_1W_1W_4W_1W.geojson"

# Name of GeoJson file with risk scores for relevant cells
results_geojson_file = "../../Data/results_200115_FantDur_190901_1W_1W_4W_1W.geojson"

Read in the data from the results GeoJSON file, and display a list of properties from the results GeoJSON that can be selected for visualization.

In [None]:
### Run this without editing

with open(results_geojson_file) as cg:
    cell_results = json.load(cg)
print("Successfully read geojson data.\n")

print('Properties available to visualize as "property_to_map":\n')
for p in list_risk_model_properties(geojson_file_contents=cell_results):
    print(p)

Only use this section if you want to combine properties together to make a new property, which will be saved off into a new GeoJSON file.

In [None]:
### Optional; only edit & run this if desired

# The geojson data (or, file names) with the info you want to combine.
# If it's all from the same data, just give the file name or the data object
# If you're pulling from multiple geojson files, make a list of them, 
#  repeating them for as many properties you're taking from them.
# For example, if you're combining 2 properties from geojsonA and 1 property
#  from geojsonB, this would be [geojsonA, geojsonA, geojsonB]

geojson_data_to_combine = ["../../Output/results_200129_FantDur_Burglary_500m_190915_1x_1W_4W_1W.geojson", 
                           "../../Output/results_200129_FantDur_Burglary_500m_190915_1x_1W_4W_1W.geojson"]

# The crime type analysed for each file
crime_type_set = "Burglary, Vehicle crime"

# The names of the properties you're combining.
# If it's the same property name from all the geojsons, just name it here.
# If you want multiple properties, make a list of them here,
#  corresponding to the list of geojsons in geojson_data_to_combine

properties_to_combine = "phs-4W-1500m, phs-4W-1500m"

# The relative weights to place on each property.
# For example, if you consider the second property to be three times as 
#  important as the first property, then use [1, 3] here.
# That would create a new property that is equal to the sum of
#  the first property and treble the second property.

property_weights = [2,1]

# The name for the new combined property you create.
# If this is not included or set to None, then the name of the property
#  will be the names of all the other properties, combined with "_"

combined_property_name = "triple-test"

# The name for the output GeoJSON file containing the combined property.
# If this is not included or set to None, then the name of the file
#  will be "results_{combined_property_name}.geojson" and it will be
#  placed in the same directory as the first GeoJSON file listed in
#  the list geojson_data_to_combine.

combined_geojson_file_path = None

# The combined property is created here

cell_results, new_property_name, new_geojson_file = combine_geojson_features(
                        geojson_data_list     = geojson_data_to_combine, 
                        combine_property_list = properties_to_combine, 
                        multiplier_list       = property_weights, 
                        new_property_name     = combined_property_name, 
                        new_file_name         = combined_geojson_file_path, 
                        )

print(f"Combined properties to make new property: {new_property_name}")
print(f"New GeoJSON file saved at: {new_geojson_file}")

Name the property you want to view on an interactive map here, along with other parameters:

In [None]:
### Edit this, then run it

# The property you want to map from the results file
#property_to_map = "naive"
property_to_map = "phs-4W-1500m"
#property_to_map = "phs-combo"

# The top proportion of cells you want to highlight
highlight_portion = 0.01

# The style of highlighted cells
highlight_cell_style = {'color':'blue',
                        'weight':1.5,
                        'fillColor':'transparent',
                       }

# Whether you want to plot the events from the training data
#  Choose from:
#   "none"    : Do not display the events
#   "point"   : Each event is a slightly transparent black circle
#   "cluster" : Multiple events cluster together as single circles,
#                 changing at different zoom levels
show_training_events = "cluster"

# Whether you want to save the top proportion of cells as a GeoJSON
#  file for later use, perhaps focusing solely on those areas.
#  This will create a file in the ouput directory, with a name
#  formatted as "top-cells_(property name)_(proportion).geojson"
save_top_cells = True

print("Parameters for map display have been set.")

Create the map:

In [None]:
### Run this without editing anything

# Instantiate a map centred at Durham
#m = Map(center=[54.776100, -1.573300], zoom=10)
m = Map(center=[54.75, -1.573300], zoom=9)

# We now load the police force area GeoJSON file and add it as a layer to the map
with open(std_file_name([input_datadir, area_geojson_file_name]), 'r') as f:
    bounds = json.load(f)
bounds_layer = GeoJSON(data=bounds, style = {'color': 'green', 'opacity':1, 'weight':2, 'fillColor':'transparent'})
m.add_layer(bounds_layer)

# Obtain relevant scores from GeoJSON file, and only include
#  cells with non-zero scores
score_mapping = dict()
nonzero_cell_results = dict()
for key in cell_results:
    if key != 'features':
        nonzero_cell_results[key] = cell_results[key]
nonzero_cell_results['features'] = []
for feat in cell_results['features']:
    property_value = feat['properties'][property_to_map]
    if property_value > 0:
        nonzero_cell_results['features'].append(feat)
    score_mapping[feat['id']] = property_value


# Create map layer with color-coded cells reperenting risk scores, like a heat map
from branca.colormap import linear
layer = Choropleth(
    geo_data=nonzero_cell_results,
    choro_data=score_mapping,
    colormap=linear.YlOrRd_04,
    border_color='transparent',
    style={'fillOpacity': 0.8})
m.add_layer(layer)


# Create map layer with circles representing crime events
if show_training_events != None:
    show_training_events = show_training_events.lower()
    if show_training_events not in ["false","no","none"]:
        if show_training_events == "point":
            with open(train_geojson_file) as eg:
                datapoints = json.load(eg)
            geojson_datapoints = GeoJSON(data=datapoints, point_style={'color': 'transparent', 'fillColor': 'black', 'radius': 5})
            m.add_layer(geojson_datapoints)
        elif show_training_events == "cluster":
            cluster_datapoints = marker_cluster_from_data(train_geojson_file)
            m.add_layer(cluster_datapoints)




# Create map layer where top % of cells are highlighted
cell_results_gdf = json_dict_to_geoframe(cell_results)
top_cells_frame = top_geojson_features(cell_results_gdf, property_to_map, highlight_portion)
top_cells_layer = GeoData(geo_dataframe = top_cells_frame,
                         style = highlight_cell_style)
m.add_layer(top_cells_layer)
if save_top_cells:
    top_cells_geojson_file = std_file_name([output_datadir, 
                f"top-cells_{property_to_map}_{str(highlight_portion)[2:]}.geojson"])
    cell_results_gdf.to_file(top_cells_geojson_file, driver='GeoJSON')






# Add useful controls to the GUI for full screen mode and layer toggling
m.add_control(FullScreenControl())
m.add_control(LayersControl())


# Display useful information as text above the map
num_cells = len(cell_results['features'])
sqkm_per_cell = ((cell_width*.001)**2)
area_size = num_cells * sqkm_per_cell
num_cells_highlight = int(num_cells * highlight_portion)
area_highlight = num_cells_highlight * sqkm_per_cell
date_time_now = datetime.datetime.now().strftime("%d/%m/%Y, %H:%M")
message = (
    f"Generating map...... {date_time_now}\n", 
    f"Crime Type Analysed: {crime_type_set}\n", 
    f"Mapping: {property_to_map}\n", 
    f"Grid Config:\n",
    f"  Grid cell size: {cell_width}m x {cell_width}m\n",
    f"  {num_cells} grid cells analysed\n",
    f"  Total area: {area_size:.2f} km^2 \n",
    f"Priority Cells:\n",
    f"  Coverage {highlight_portion * 100}%\n",
    f"  Number of cells: {num_cells_highlight}\n",
    f"  Priority area: {area_highlight:.2f} km^2\n",
)
print(*message)


# Display the map
m