# Near-Repeat Analysis

This notebook provides analysis for how crimes cluster in space and time. Within any given window of time, we can check whether there is significant clustering within a certain distance, and within certain lengths of time. For example, perhaps a certain type of crime appears in clusters of 1500m, lasting for 4 weeks. This information can be useful for the PHS model in the Risk Models notebook.

The test of significance for these clustering bandwidths in space and time is called the Knox statistic.

## Import Modules

Necessary modules are imported here.

In [None]:
### Run this without editing anything

import sys
import os
import importlib
import geopandas as gpd
import statistics
import time

# In order to use our local edited versions of open_cp
# scripts, we insert the parent directory of the current
# file ("..") at the start of our sys.path here.
sys.path.insert(0, os.path.abspath(".."))

# Elements from PredictCode's custom "open_cp" package
import open_cp
import open_cp.knox

import knoxAnalysis
importlib.reload(knoxAnalysis)
from knoxAnalysis import make_knox_info_file, make_graphs_from_knox_file

print("Successfully imported modules.")

## Set Parameters

Set your parameters here. The current default arguments are for a Fantasy Durham data set.

In [None]:
### Edit this, then run it


# Location of data file
datadir           = "../../Data"

# Input csv file name
in_csv_file_name  = "Fantasy-Durham-Data.csv"

# Desired output file name for Knox information
knoxrun_file_name = "knoxtestingFD.txt"

# Geojson file
geojson_file_name = "Police_Force_Areas_December_2016_Durham_fixed.geojson"

# Crime types to include
# If you want to consider multiple crime types together, separate them by commas
crime_types       = "Burglary, Vehicle crime"

# Number of iterations of the algorithm to perform
# Around 1000 iterations is ideal, but you can reduce this to reduce the time it takes to run.
num_knox_iterations = 1000

# Size of distance bins, in meters
knox_sbin_size = 500
# Number of distance bins to test
knox_sbin_num  = 4
# Size of time bins, in days
knox_tbin_size = 7
# Number of time bins to test
knox_tbin_num  = 4

# Total number of experiments to run
#  (Multiple experiments will be offset by "time_step", below)
num_experiments  = 4
# Size of the time window of events to analyse
time_window_size = "6W"
# End of the first time window
#  (Currently, provide date in format YYYY-MM-DD)
first_test_end   = "2019-09-15"
# Time step between experiments
time_step        = "1W"

#  EPSG value for local geographic region; this is 27700 for most of the UK
local_epsg      = 27700

# CSV formatting parameters; current parameters are for Fantasy Durham data.
#  Names of the appropriate columns in the header of the CSV file
csv_date_name       = "Date"       # column with date (and time)
csv_east_name       = "Longitude"  # column with eastings or longitudes
csv_north_name      = "Latitude"   # column with northings or latitudes
csv_crimetypes_name = "Crime type" # column with type of crime
#  Format of dates in the input CSV file
csv_date_format = "%d/%m/%Y"
#  Whether the input CSV file contains long/lat as its coordinate system
csv_longlat     = True
#  The EPSG used by the input CSV file.
#   If the info file uses long/lat, ignore this.
csv_epsg        = 27700
#  Whether the coordinates in the input CSV file use feet instead of meters.
#   If the info file uses long/lat, ignore this.
csv_infeet      = False



# Significance thresholds we're interested in examining for crime clustering
signif_cutoff = [0.01, 0.05, 0.1]


print("Parameter assignment complete.")

## Generate Knox data

First, this will generate a text file containing information on the significance of the various cluster bandwidth pairings tested. There will be a different set of results for each time window.

Next, the text file will be processed so as to generate charts that illustrate the significance of these cluster bandwidth pairings. Regions marked with a number of A's represent bandwidth pairings that passed the strictest significance threshold listed; pairings marked with B's (one fewer than A's) passed the next strictest threshold, and so on.

Colour represents the Knox Ratio, which is a related value that also corresponds to the significance of the clustering at those bandwidths.

In [None]:
### Run this without editing anything

# Perform Knox runs and generate file of resulting Knox info
knox_info_file = make_knox_info_file(datadir=datadir, 
                    in_csv_file_name=in_csv_file_name, 
                    out_knox_file_name=knoxrun_file_name, 
                    geojson_file_name=geojson_file_name, 
                    local_epsg_in=local_epsg, 
                    crime_types=crime_types, 
                    num_knox_iterations=num_knox_iterations, 
                    knox_sbin_size=knox_sbin_size, 
                    knox_sbin_num=knox_sbin_num, 
                    knox_tbin_size=knox_tbin_size, 
                    knox_tbin_num=knox_tbin_num, 
                    earliest_exp_time=first_test_end, 
                    num_exp=num_experiments, 
                    time_step=time_step, 
                    time_len=time_window_size, 
                    csv_date_format = csv_date_format, 
                    csv_longlat = csv_longlat, 
                    csv_epsg = csv_epsg, 
                    csv_infeet = csv_infeet, 
                    csv_col_names = [csv_date_name, csv_east_name, csv_north_name, csv_crimetypes_name], 
                    )

print("Finished generating Knox data.")
print(f"Output file should be available at: {knox_info_file}")



make_graphs_from_knox_file(knox_info_file, 
                           signif_cutoff=signif_cutoff)