Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights (ACM SIGSPATIAL 2019)
One important process is to transform raw input data into the form of input for a machine learning model. Here we employed multiple processes as follows:
-
Step 1: Run
1-CreateInputForAccidentPrediction.ipynb
from/1-GenerateFeatureVector
to generate raw feature vectors. Each vector represents a geographical region of size 5km x 5km (that we call it a geohash) during a 15 minutes time interval. This code uses LSTW dataset for traffic events data, raw weather observation records for weather-related attributes (checkdata/Sample_Weather.tar.gz
for sample data), and daylight information (checkdata/sample_daylight.csv
for sample data). -
Step 2: Run
2-CreateNaturalLanguageRepresentationForGeoHashes.ipynb
to generate description to vector representation for geographical regions. The main inputs for this process are LSTW and GloVe. A sample output can be find asdata/geohash_to_text_vec.csv
. -
Step 3: Run
3-DataCleaningAndIntegration.ipynb
for data cleaning, and preparation for integration with POI data. -
Step 4: Run
4-FinalTrainAndTestDataPreparation.ipynb
to prepare final train and test data. This includes creating sample entries, and negative sampling for non-accident data samples. There are two versions of the code: single thread vs multi-thread. The multi-thread version uses more system cores but it needs more memory as well. It is more suitable for running on servers. Single-thread is for running on desktop devices for generating smaller train-test sets.
Implementations of these steps can be found in 1-GenerateFeatureVector
. Also, note that the sample data and codes are for those cities that we used in the paper (e.g., Atlanta, Austin, Charlotte, Dallas, Houston, and Los Angeles).
To train and test our proposed model and the baselines, you can use our pre-generated train and test files for six cities Atlanta, Austin, Charlotte, Dallas, Houston, and Los Angeles. The time frame to generate sample data for these cities is the same as what we described in our paper. You can find these files in data/train_set.7z
. Use 7za -e train_set.7z
to decompress this file and obtain 4 numpy (.npy) files per city. Two files contain feature vectors for train and test, and two files contain train and test labels. These sample files are the result of the above input generation process.
Our Deep Accident Prediction model comprises several important components including Recurrent Component, Embedding Component, Description-to-Vector Component, Points-Of-Interest Component, and Fully-connected Component. The following image shows a demonstration of this model:
The implementation of this model can be found here: 2-DAP/DAP.ipynb
.
In terms of baselines, we employed the following models:
- Logistic Regressions (LR): Find sample code in
3-Baselines/Traditional_Models_Sklearn.py
. - Gradient Boosted classifier (GBC): Find sample code in
3-Baselines/Traditional_Models_Sklearn.py
. - FeedForward Neural Network Model (DNN): An implementation of this model can be found in
3-Baselines/DNN.ipynb
. - DAP Without Embedding Component (DAP-NoEmbed): An implementation of this model can be found in
2-DAP/DAP-NoEmbed.ipynb
.
We recommend using Python 3.x
, and install the following to properly run the code:
pip install tensorflow==1.14.0
pip install keras==2.3.1
pip install keras_metrics
pip install keras_self_attention
pip install scikit-learn==0.20.0
Please note that you may choose to use other versions of tensorflow
and/or keras
, but make sure that they are compatible.
All implementations are in python
, with deep learning models developed in Keras
using Tensorflow
as backend. For non-deep learning baselines (i.e., LR and GBC) you can run codes on CPU machines. But for deep-learning models, we recommend using GPU machines to speed-up the process.
- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights” In proceedings of the 27th ACM SIGSPATIAL, International Conference on Advances in Geographic Information Systems. ACM, 2019.