# Crime Forecast in Vancouver

by Ben Chen, Mo Norouzi, Orix Au Yeung, Yiwei Zhang

In [1]:
import pandas as pd
import warnings
from myst_nb import glue
warnings.filterwarnings("ignore")
pd.set_option('display.precision', 2)
pd.set_option('styler.format.precision', 2)
pd.set_option('styler.render.max_columns', 10)

In [2]:
data = pd.read_csv("../data/raw/crimedata_csv_AllNeighbourhoods_AllYears.csv", encoding="utf-8").round(2).fillna('nan').astype("string")

#format the preview of dataset
data_head = data.head()
                                     
glue("data_head", data_head.T, display=False)
glue("data_no_cols", data.shape[1], display=False)
glue("data_no_rows", data.shape[0], display=False)
glue("no_year_unique", max(data["YEAR"].astype("float")) - min(data["YEAR"].astype("float")), display=False)

data_description = pd.read_csv("../results/tables/description.csv", encoding="utf-8", index_col=0).round(2).astype("string")
glue("data_description", data_description.T, display=False)

missing_data = pd.read_csv("../results/tables/missing_values.csv", encoding="utf-8").round(2).astype("string")
glue("missing_values", missing_data, display=False)

correlation = pd.read_csv("../results/tables/correlation.csv", encoding="utf-8", index_col=0)
glue("data_correlation", correlation.style.background_gradient(), display=False)

preprocessed_head = pd.read_csv("../data/processed/preprocessed_data_head.csv", encoding="utf-8").round(2)
glue("preprocessed_head", preprocessed_head, display=False)

theft_from_vehicle_full = pd.read_csv("../data/processed/preprocessed_theft_from_vehicle_full.csv", encoding="utf-8")
glue("theft_from_vehicle_rows", theft_from_vehicle_full.shape[0], display=False)

theft_from_vehicle_head = pd.read_csv("../data/processed/preprocessed_theft_from_vehicle_head.csv", encoding="utf-8")
glue("theft_from_vehicle_head", theft_from_vehicle_head, display=False)

all_predictions = pd.read_csv("../results/tables/all_predictions.csv", encoding="utf-8", index_col=0).round(2).fillna('nan').astype("string")
glue("all_predictions_head", all_predictions.head(15), display=False)
glue("all_predictions_tail", all_predictions.tail(5), display=False)

observations = pd.read_csv("../results/tables/observations.csv", encoding="utf-8", index_col=0).round(2).astype("string")
glue("observations", observations, display=False)
glue("arima_mae", observations.query("Forecast_Column == 'ARIMA_Forecast'")["MAE"].squeeze(), display=False)


## Summary

In this notebook, our focus revolved around constructing a time-series forecasting model tailored to predict crime incidents in Vancouver, using "Month" as the temporal unit. Our primary emphasis centered on one of the most prevalent crime types in Vancouver over the past two decades: theft from vehicles. We evaluated the efficacy of three fundamental forecasting models—simple moving average, exponential smoothing, and ARIMA(1,1,0). Notably, the ARIMA(1,1,0) model emerged as the most effective, yielding a Mean Absolute Error (MAE) of {glue:text}`arima_mae`. Considering that the occurrences of "theft from a vehicle" crimes per month often range in the hundreds to thousands, achieving a forecast performance of this caliber is notably commendable. It's worth highlighting that further refinement through comprehensive parameter tuning and integration of additional external variables holds the potential to cultivate even more accurate forecasting models.

Moreover, the ARIMA(1,1,0) model's performance suggests promising potential for practical deployment in crime prevention strategies by law enforcement agencies. However, it's important to acknowledge the limitations inherent in solely relying on historical crime data; external socio-economic factors, seasonal fluctuations, and policy changes might significantly impact future crime trends, necessitating continual model adaptation and refinement for robust predictive accuracy.

## Introduction

Vehicle-related theft remains an ongoing concern nationwide in Canada, with statistics revealing a staggering incident of vehicle theft occurring every six minutes across the country {cite}`CTVnews`. This pervasive issue extends into Vancouver, presenting formidable challenges to both community safety and law enforcement efforts. Theft from vehicles, a prevalent form of this crime, significantly affects neighborhoods, inflicting distress and substantial financial losses on local residents. In response to this pressing concern, this project is dedicated to forecasting occurrences of theft from vehicles specifically within Vancouver.

The primary objective of this project is to forecast instances of theft from vehicles in Vancouver by analyzing historical data. Leveraging a comprehensive dataset sourced from the Vancouver Police Department, encompassing diverse crime records in Vancouver over the past {glue:text}`no_year_unique` years alongside incident locations, our goal is to construct a reliable predictive model. This model aims to anticipate the frequency and patterns specific to theft from vehicles. An accurate forecast holds the potential to empower the City of Vancouver to proactively allocate law enforcement resources, thereby curbing the occurrence of such crimes and enhancing community safety.

## Methods

### Data

The dataset utilized for this project originates from the Vancouver Police Department, available through the following link: https://geodash.vpd.ca/opendata/ {cite}`crimedata2023`. It comprises {glue:text}`data_no_cols` columns/variables and encompasses a substantial volume of data, totaling {glue:text}`data_no_rows` rows. Each row corresponds to a distinct crime incident recorded within the dataset. The available information includes details about the crime type, the corresponding date of occurrence, and the specific location or neighborhood where the crime took place. These data points serve as crucial elements for our analysis and forecasting efforts.

### Analysis

We're deploying three distinct time-series forecasting models—Simple Moving Average (SMA), Exponential Smoothing (ES), and Autoregressive Integrated Moving Average (ARIMA). These models rely solely on the timestamp and the targeted forecasted value. Despite having location data, which holds potential value, we've deferred its utilization in this phase of the project. Employing a rolling window approach, we'll predict and assess model performance across a {glue:text}`no_year_unique`-year duration, setting the window size to 12 months. This configuration ensures that forecasts leverage the preceding year's data for accuracy. Specifically for ARIMA, the hyperparameters (p, d, q) are set at (1, 1, 0). This specification signifies that the model factors in the most recent lagged observations of the differenced series to predict the subsequent value. Our analysis was executed using Python, leveraging various libraries: numpy {cite}`harris2020`, Pandas {cite}`mckinney2010`, Altair {cite}`vanderplas2018`, scikit-learn {cite}`pedregosa`, Matplotlib {cite}`hunter2007`, Seaborn {cite}`seaborn`, and Statsmodels {cite}`seabold2010`.

## Results & Discussions

First, we'll take a brief look into the dataset that we have at hand. We can take a sneak peak into our dataset ({numref}`Figure {number} <data_head>`) and identify the variables and datatypes that we are dealing with. A summary statistics ({numref}`Figure {number} <data_description>`) is displayed below and we can see that we also have some missing values ({numref}`Figure {number} <missing_values>`) in our dataset. 

### Data Preview and Summary

```{glue:figure} data_head
:name: "data_head"

A preview of the dataset.
```

```{glue:figure} data_description
:name: "data_description"

The summary statistics of the dataset.
```

### Missing values

From the missing value table below ({numref}`Figure {number} <missing_values>`), we can see that we have missing values in four variables: `NEIGHBORHOOD`, `X`, `Y` and `HUNDRED_BLOCK`. Considering the immense size of our dataset, these missing values only take up a small proprotion. In a future analysis, when we want to incorporate geographical information into our model, we could potentially impute these missing values (or just drop them). However, since we are applying a pure time-series model that depend only on lagged values at this stage of the project, the missing values will not affect our forecasting models. We can ignore them for now.

```{glue:figure} missing_values
:name: "missing_values"

Missing value in the dataset.
```

### Distribution

Upon conducting exploratory data analysis (EDA), conspicuous anomalies surface in the dataset. The `HOUR` and `MINUTE` columns exhibit an unusual frequency of zero values, along with a disproportionate occurrence of '30' in the `MINUTE` column. Additionally, the `DAY` column prominently features an excessive number of records logged on the 31st of the month ({numref}`Figure {number} <numeric_dist>`). These irregularities likely stem from convenience in data recording, casting uncertainty on the accuracy of these three columns. In light of these inconsistencies, the most prudent approach is to exclude the `DAY`, `HOUR`, and `MINUTE` columns from analysis and focus solely on forecasting crime occurrences based on the `MONTH` variable. Moreover, from the two bar graphs displaying the distribution of `TYPE` and `NEIGHBORHOOD` ({numref}`Figure {number} <categ_dist>`), we get a good idea of the types of crime happening in Vancouver as well as the location in which they're concentrated in. 

```{figure} ../results/figures/numeric_dist.png
---
width: 800px
name: numeric_dist
---
The distribution of numerical variables.
```

```{figure} ../results/figures/categ_dist.png
---
width: 600px
name: categ_dist
---
The distribution of categorical variables.
```

### Correlation

```{glue:figure} data_correlation
:name: "data_correlation"

Correlation of numerical variables.
```

### Preprocessing

We'll start the data preprocessing phase by grouping the rows according to the `TYPE`, `YEAR`, and `MONTH` columns to aggregate the counts of specific crimes occurring in each month. Additionally, we'll adjust the datetime variable format for consistency. However, as the latest month (November 2023) is incomplete, we'll exclude this month from the dataset ({numref}`Figure {number} <preprocessed_head>`). Finally, we'll filter the data so that we focus only on `Theft from Vehicle` crimes, the most common crime in Vancouver in the past {glue:text}`no_year_unique` years ({numref}`Figure {number} <theft_from_vehicle_head>`). This initial processing sets the groundwork for our subsequent time-series forecasting models.

```{glue:figure} preprocessed_head
:name: "preprocessed_head"
A preview of the preprocessed data.
```

```{glue:figure} theft_from_vehicle_head
:name: "theft_from_vehicle_head"

Filtered data of theft from vehicle crimes.
```

Looking into our preprocessed and filtered data, we see that we have {glue:text}`theft_from_vehicle_rows` months of data. We can also see that the trend for the counts of `Theft from Vehicle` crimes actually fluctuated quite a lot over the years ({numref}`Figure {number} <original_plot>`).

```{figure} ../results/figures/original_plot.png
---
width: 600px
name: original_plot
---
The time-series trend of theft from vehicle crimes in Vancouver.
```

### Simple Moving Average & Exponential Smoothing

As mentioned in the beginning, we will be applying three time-series model (Simple Moving Average, Exponential Smoothing and ARIMA(1,1,0)) with a rolling-window of 12 months to forecast the count of `Theft from Vehicle` crimes in a monthly basis. We do not have a forecasted value for the first 12 time units due to the size of the rolling window ({numref}`Figure {number} <all_predictions_head>`), but we will be forecasting one time unit into the future ({numref}`Figure {number} <all_predictions_tail>`).

```{glue:figure} all_predictions_head
:name: "all_predictions_head"

A preview (head) of the predicted values.
```

```{glue:figure} all_predictions_tail
:name: "all_predictions_tail"

A preview (tail) of the predicted values.
```

Based on a visual assessment of the Simple Moving Average (SMA) ({numref}`Figure {number} <sma_prediction_plot>`) and Exponential Smoothing (ES) ({numref}`Figure {number} <es_prediction_plot>`) forecasts, it's evident that both methods broadly capture the general trend of the actual values. However, neither forecast method appears to be highly accurate. The Exponential Smoothing approach demonstrates a slightly improved performance compared to SMA.

```{figure} ../results/figures/sma_prediction_plot.png
---
width: 600px
name: sma_prediction_plot
---
SMA predictions to actual values.
```

```{figure} ../results/figures/es_prediction_plot.png
---
width: 600px
name: es_prediction_plot
---
ES predictions to actual values.
```

### ARIMA(1,1,0)

Before fitting ARIMA, we first have to check the assumption of stationarity. In the autocorrelation plot of our data ({numref}`Figure {number} <autocorrelation_plot>`), the assumption is not met, so we have to applying differencing to make the data stationary. After applying 1st order differencing, the assumption appears to be largely met ({numref}`Figure {number} <autocorrelation_with_diff_plot>`).

```{figure} ../results/figures/autocorrelation_plot.png
---
width: 600px
name: autocorrelation_plot
---
Autocorrelation plot for the observations.
```

```{figure} ../results/figures/autocorrelation_with_diff_plot.png
---
width: 600px
name: autocorrelation_with_diff_plot
---
Autocorrelation plot with 1st order differencing for the observations.
```

The forecast from the ARIMA model looks much better ({numref}`Figure {number} <arima_prediction_plot>`)! We can see some clear overlaps between the forecasted value and the original value.

```{figure} ../results/figures/arima_prediction_plot.png
---
width: 600px
name: arima_prediction_plot
---
ARIMA predictions to actual values.
```

Finally, we'll compare the performances of the three models in terms of mean absolute error (MAE) and mean squared error (MSE). Notably, there's a discernible pattern showcasing a marked enhancement in performance, progressing from Simple Moving Average (SMA) to Exponential Smoothing Approach (ESA) and ultimately to ARIMA. The final MAE achieved by the ARIMA model is {glue:text}`arima_mae`. Given that crime incidents per month often range in the hundreds to thousands, an MAE of {glue:text}`arima_mae` could be considered within an acceptable range, especially considering the complexities inherent in forecasting crime data. This consistent trend aligns with the observations gleaned from the visualizations crafted earlier, affirming the gradual improvement in forecasting accuracy across the models.

```{glue:figure} observations
:name: "observations"

The observed error terms from different forecasting methods.
```

## Conclusion

In this notebook, we created an accurate forecasting model that forecasts (in the next Month) the number of theft from vehicle crimes in Vancouver. Our accurate forecast holds the potential to enable the City of Vancouver to preemptively utilize law enforcement resources in preparation for any upcoming surge in crimes. This early warning system will be crucial in ensuring the safety of the people of Vancouver.

Our analysis of three distinct forecasting models sheds light on the nature of predicting theft from vehicle crimes in Vancouver. Interestingly, our ARIMA model, relying solely on the most recent lagged differenced observation, outperformed both the Simple Moving Average (SMA) and Exponential Smoothing (ES) models, which incorporate information from all lagged observations over the past 12 months. This suggests that historical values from further back in time might not contribute significantly to the predictive accuracy of theft from vehicle crime forecasts. The superiority of the ARIMA model, despite its simplicity, underscores the importance of prioritizing recent data for more accurate and effective crime forecasting.

While the ARIMA model stands out as the most effective among the three forecasting models—simple moving average and exponential smoothing—it's crucial to acknowledge the room for enhancement in our predictive capabilities. Future advancements could entail fine-tuning the ARIMA hyperparameters or exploring alternative models to ascertain if further accuracy gains are attainable. Additionally, integrating exogenous variables, such as socioeconomic indicators or weather data, might augment the predictive power of our models, offering a more comprehensive understanding of crime dynamics. Furthermore, this analysis prompts future inquiries, including investigating the influence of specific external factors on crime occurrences or exploring spatial-temporal models to predict crime hotspots within Vancouver. These prospective avenues aim to refine our forecasting precision and deepen our insights into crime trends, paving the way for more informed law enforcement strategies and proactive crime prevention measures.

## References

```{bibliography}
```