# ShotSpotter Alert Data Cleaning
This notebook documents the process I'm using to clean the ShotSpotter alert data Max and Jim obtained.

## Setup

In [1]:
import datetime as dt

import pandas as pd

from shotspotter import settings

## Source Data
The Chicago Police Department publishes ShotSpotter alert data on the [Chicago Data Portal](https://data.cityofchicago.org/Public-Safety/Violence-Reduction-Shotspotter-Alerts/3h7q-7mdb/about_data).

From the data portal description:
> This dataset contains all ShotSpotter alerts since the introduction of ShotSpotter to some Chicago Police Department (CPD) districts in 2017. ShotSpotter is a gunshot detection system designed to automatically determine the location of potential outdoor gunfire. ShotSpotter audio sensors are placed in several CPD districts throughout the city (specific districts are noted below). If at least three sensors detect a sound that the ShotSpotter software determines to be potential gunfire, a location is determined and the alert is sent to human ShotSpotter analysts for review. Either the alert is sent to CPD, or it is dismissed. Each alert can contain multiple rounds of gunfire; sometimes there are multiple alerts for what may be determined to be one incident. More detail on the technology and its accuracy can be found on the company’s website here. It should also be noted that ShotSpotter alerts may increase year-over-year while gun violence did not necessarily increase accordingly because of improvements in detection sensors.
>
> ShotSpotter does not exist in every CPD district, and it was not rolled out in every district at the same time. ShotSpotter was first deployed in Chicago in 2017, and sensors exist in the following districts as of the May 2021 launch of this dataset: 002, 003, 004, 005, 006, 007, 008, 009, 010, 011, 015, and 025.


### Chicago Data Portal Version

I exported the entire dataset available on the data portal to a CSV file, saved as `Violence_Reduction_-_Shotspotter_Alerts_20240905.csv` in the raw data directory.

In [2]:
shotspotter_portal = pd.read_csv(
    settings.DATA_DIR_SRC / "Violence_Reduction_-_Shotspotter_Alerts_20240905.csv",
    dtype=str,
)
shotspotter_portal.head()

Unnamed: 0,DATE,BLOCK,ZIP_CODE,WARD,COMMUNITY_AREA,AREA,DISTRICT,BEAT,STREET_OUTREACH_ORGANIZATION,UNIQUE_ID,MONTH,DAY_OF_WEEK,HOUR,INCIDENT_TYPE_DESCRIPTION,ROUNDS,ILLINOIS_HOUSE_DISTRICT,ILLINOIS_SENATE_DISTRICT,LATITUDE,LONGITUDE,LOCATION
0,04/08/2021 12:25:50 PM,1600 N HARLEM AVE,,,,,,,,SST-359776,4,5,12,MULTIPLE GUNSHOTS,15,78,39,41.909239723,-87.806192521,POINT (-87.806192520613 41.909239722745)
1,07/13/2018 11:58:53 PM,5400 S WESTERN AVE,60609.0,15.0,GAGE PARK,1.0,9.0,923.0,,SST-63963,7,6,23,SINGLE GUNSHOT,1,1,1,41.794573312,-87.683425433,POINT (-87.683425432688 41.794573311857)
2,12/31/2023 09:16:10 PM,"4600 W 16TH ST,",,,,,,,,SST-79100107733,12,1,21,MULTIPLE GUNSHOTS,7,23,12,41.857748513,-87.740012652,POINT (-87.740012651927 41.857748513273)
3,05/31/2020 10:07:30 PM,7500 S MAY ST,60620.0,17.0,AUBURN GRESHAM,2.0,6.0,612.0,,SST-103382,5,1,22,SINGLE GUNSHOT,1,31,16,41.756619925,-87.652472979,POINT (-87.652472979075 41.756619925448)
4,05/19/2018 12:16:33 AM,600-600 E 60TH ST,60637.0,20.0,WOODLAWN,1.0,3.0,313.0,,SST-1039,5,7,0,SINGLE GUNSHOT,1,5,3,41.786282586,-87.610490921,POINT (-87.610490921492 41.786282585604)


The point of this analysis is to determine whether or not each crime incidents involving a shooting have a corresponding ShotSpotter alert. Per our methodology, the way to do that is to check whether a ShotSpotter alert occurred within 0.5 miles of the incident and within a one-hour window of the incident's location. I've dealt with the shooting data separately, so all I need here is the following:
| Variable | Data Type | Description |
| -------- | --------- | ----------- |
| `id` | `str` | A unique identifier for the alert |
| `date_time` | `datetime.datetime` | The time and date for the alert |
| `latitude` | `np.float64` | The latitude for the alert |
| `longitude` | `np.float64` | The longitude for the alert |
| `type` | `pd.Categorical` | The type of alert |

The relevant columns in the source data are as follows:

| Column Name | Data Type | Description |
| ----------- | --------- | ----------- |
| `UNIQUE_ID` | Text | The unique identifier generated by shotspotter for each alert. |
| `DATE` | Floating timestamp | Date when the alert was generated. |
| `LATITUDE` | Number | The latitude of the potential gunfire detection. In order to preserve anonymity, the given coordinates are not the actual location of the crime. To produce slightly altered coordinates, a circle roughly the size of an average city block was drawn around the original point location, and a new location was picked randomly from a spot around the circumference of that circle. |
| `LONGITUDE` | Number | The longitude of the potential gunfire detection. This has been slightly altered to preserve anonymity (see details under LATITUDE). |
| `INCIDENT_TYPE_DESCRIPTION` | Text | ShotSpotter code describing the type of alert. Alert types are “Single Gunshot,” “Multiple Gunshot,” and “Gunshot or Firecracker.” |

(Data types and descriptions from the data dictionary on the data portal.)

## Cleaning

### Data Conversion

First, we convert the raw text data to the appropriate data types.

In [3]:
converted_portal = (
    pd.DataFrame(
        {
            "id": shotspotter_portal["UNIQUE_ID"],
            "date_time": pd.to_datetime(shotspotter_portal["DATE"], format="%m/%d/%Y %I:%M:%S %p"),
            "latitude": shotspotter_portal["LATITUDE"], 
            "longitude": shotspotter_portal["LONGITUDE"],
            "type": pd.Categorical(shotspotter_portal["INCIDENT_TYPE_DESCRIPTION"]),
        }
    )
    .set_index("id")
)
converted_portal.head()

Unnamed: 0_level_0,date_time,latitude,longitude,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SST-359776,2021-04-08 12:25:50,41.909239723,-87.806192521,MULTIPLE GUNSHOTS
SST-63963,2018-07-13 23:58:53,41.794573312,-87.683425433,SINGLE GUNSHOT
SST-79100107733,2023-12-31 21:16:10,41.857748513,-87.740012652,MULTIPLE GUNSHOTS
SST-103382,2020-05-31 22:07:30,41.756619925,-87.652472979,SINGLE GUNSHOT
SST-1039,2018-05-19 00:16:33,41.786282586,-87.610490921,SINGLE GUNSHOT


### Filtering

Our analysis only considers shooting incidents and alerts in 2023 and 2024, so we need to filter the dataset to only include that time period.

In [4]:
converted_portal_2023_2024 = converted_portal.loc[
    converted_portal["date_time"].between(
        dt.datetime(2023, 1, 1),
        dt.datetime(2025, 1, 1),
    )
]

### Saving

In [5]:
converted_portal_2023_2024.to_csv(
    settings.DATA_DIR_PROCESSED / "shotspotter_alerts_2023_2024.csv" 
)