# Reckless Firearm Discharge Data Cleaning Notebook

This notebook documents the process I'm using to clean the reckless firearm discharge data Max and Jim obtained.

## Setup

In [1]:
import datetime as dt

import geodatasets
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd

from shotspotter import settings

## Source Data

The original source for the reckless firearm discharge data is Chicago Police Department's "Crimes - 2001 to Present" dataset on the [Chicago Data Portal](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/about_data). Per the description on the portal:

>This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Data Fulfillment and Analysis Division of the Chicago Police Department at DFA@ChicagoPolice.org. Disclaimer: These crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. The Chicago Police Department will not be responsible for any error or omission, or for the use of, or the results obtained from the use of this information. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited. The Chicago Police Department is not responsible for the content of any off-site pages that are referenced by or that reference this web page other than an official City of Chicago or Chicago Police Department web page. The user specifically acknowledges that the Chicago Police Department is not responsible for any defamatory, offensive, misleading, or illegal conduct of other users, links, or third parties and that the risk of injury from the foregoing rests entirely with the user. The unauthorized use of the words "Chicago Police Department," "Chicago Police," or any colorable imitation of these words or the unauthorized use of the Chicago Police Department logo is unlawful. This web page does not, in any way, authorize such use. Data are updated daily. To access a list of Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) codes, go to http://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e


In [2]:
crimes = pd.read_csv(settings.DATA_DIR_SRC / "Crimes_-_2001_to_Present_20240905.csv", dtype=str)
crimes.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,5741943,HN549294,08/25/2007 09:22:18 AM,074XX N ROGERS AVE,560,ASSAULT,SIMPLE,OTHER,False,False,...,49,1,08A,,,2007,08/17/2015 03:03:40 PM,,,
1,25953,JE240540,05/24/2021 03:06:00 PM,020XX N LARAMIE AVE,110,HOMICIDE,FIRST DEGREE MURDER,STREET,True,False,...,36,19,01A,1141387.0,1913179.0,2021,11/18/2023 03:39:49 PM,41.917838056,-87.755968972,"(41.917838056, -87.755968972)"
2,26038,JE279849,06/26/2021 09:24:00 AM,062XX N MC CORMICK RD,110,HOMICIDE,FIRST DEGREE MURDER,PARKING LOT,True,False,...,50,13,01A,1152781.0,1941458.0,2021,11/18/2023 03:39:49 PM,41.995219444,-87.713354912,"(41.995219444, -87.713354912)"
3,13279676,JG507211,11/09/2023 07:30:00 AM,019XX W BYRON ST,620,BURGLARY,UNLAWFUL ENTRY,APARTMENT,False,False,...,47,5,05,1162518.0,1925906.0,2023,11/18/2023 03:39:49 PM,41.952345086,-87.677975059,"(41.952345086, -87.677975059)"
4,13274752,JG501049,11/12/2023 07:59:00 AM,086XX S COTTAGE GROVE AVE,454,BATTERY,"AGGRAVATED P.O. - HANDS, FISTS, FEET, NO / MIN...",SMALL RETAIL STORE,True,False,...,6,44,08B,1183071.0,1847869.0,2023,12/09/2023 03:41:24 PM,41.737750767,-87.604855911,"(41.737750767, -87.604855911)"


The point of this analysis is to determine whether or not each crime incidents involving a shooting have a corresponding ShotSpotter alert. Per our methodology, the way to do that is to check whether a ShotSpotter alert occurred within 0.5 miles of the incident and within a one-hour window of the incident's location. We also filter out out shootings that occurred indoors or outside of the ShotSpotter coverage area.

That means we need the following information for each shooting:
| Variable | Data Type | Description |
| -------- | --------- | ----------- |
| `id` | `str` | A unique identifier for the incident |
| `case_number` | `str` | The case number for the shooting |
| `date_time` | `datetime.datetime` | The time and date on which the incident occurred |
| `latitude` | `np.float64` | The latitude for the shooting |
| `longitude` | `np.float64` | The longitude for the shooting |
| `type` | `pd.Categorical` | The type of incident |
| `indoors` | `bool` | Whether or not the incident happened indoors — see `shotspotter.settings.INDOOR_LOCATIONS` for list of locations we count as "indoors" |
| `place_description` | `str` | A text description of the place where the crime occurred. |
| `in_coverage_area` | `bool` | Whether or not the incident happened within the ShotSpotter coverage area. This is defined contractually by police district. See [map](https://www.documentcloud.org/documents/24388755-mo-emails_factsheet_guidice?responsive=1&title=1) for details. |
| `police_district` | `pd.Categorical` | The CPD police district in which the incident occurred. |

The relevant columns in the source data are as follows (descriptions and data types from the data dictionary on the Chicago Data Portal):

| Column Name | Data Type | Description |
| ----------- | --------- | ----------- |
| `ID` | Number | Unique identifier for the record. |
| `Case Number` | Text | The Chicago Police Department RD Number (Records Division Number), which is unique to the incident. |
| `Date` | Floating timestamp | Date when the victimization occurred. This is sometimes a best estimate. |
| `Description` | Text | The secondary description of the IUCR code, a subcategory of the primary description. |
| `Latitude` | Number | The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block. |
| `Longitude` | Number | The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block. |
| `Location Description` | Text | Description of the location where the incident occurred. |
| `District` | Text | Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r. |

## Cleaning

The close correspondence between the source columns and the desired output makes cleaning fairly simple. 

### Conversion
First, we convert the raw text data to the appropriate data types.

In [3]:
converted_df = (
    pd.DataFrame(
        {
            "id": crimes["ID"],
            "case_number": crimes["Case Number"],
            "date_time": pd.to_datetime(crimes["Date"], format="%m/%d/%Y %I:%M:%S %p"),
            "latitude": crimes["Latitude"], 
            "longitude": crimes["Longitude"],
            "type": pd.Categorical(crimes["Description"]),
            "indoors": crimes["Location Description"].str.strip().isin(settings.INDOOR_LOCATIONS),
            "place_description": crimes["Location Description"].str.strip(),
            "in_coverage_area": crimes["District"].isin(settings.SHOTSPOTTER_DISTRICTS),
            "police_district": pd.Categorical(crimes["District"]),
        }
    )
    .set_index("id")
)
converted_df.head()

Unnamed: 0_level_0,case_number,date_time,latitude,longitude,type,indoors,place_description,in_coverage_area,police_district
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5741943,HN549294,2007-08-25 09:22:18,,,SIMPLE,False,OTHER,False,24
25953,JE240540,2021-05-24 15:06:00,41.917838056,-87.755968972,FIRST DEGREE MURDER,False,STREET,True,25
26038,JE279849,2021-06-26 09:24:00,41.995219444,-87.713354912,FIRST DEGREE MURDER,False,PARKING LOT,False,17
13279676,JG507211,2023-11-09 07:30:00,41.952345086,-87.677975059,UNLAWFUL ENTRY,True,APARTMENT,False,19
13274752,JG501049,2023-11-12 07:59:00,41.737750767,-87.604855911,"AGGRAVATED P.O. - HANDS, FISTS, FEET, NO / MIN...",True,SMALL RETAIL STORE,True,6


### Filtering

Our analysis only considers shooting incidents in 2023 and 2024, so we need to filter the dataset to only include that time period. We also want to exclude any shootings that happened indoors, as well as those that fall outside of the ShotSpotter coverage area. Finally, we want to drop any duplicate rows from the dataset (defined as a shooting at the same time and place).

In [4]:
shootings_2023_2024 = (
    converted_df.loc[
        converted_df["date_time"].between(dt.datetime(2023, 1, 1), dt.datetime(2025, 1, 1), inclusive="left")
        & ~converted_df["indoors"]
        & (converted_df["type"] == "RECKLESS FIREARM DISCHARGE")
        & converted_df["in_coverage_area"]
    ]
    .drop(columns=["indoors", "in_coverage_area"])
    .drop_duplicates(subset=["date_time", "latitude", "longitude"])
)
shootings_2023_2024.describe()

Unnamed: 0,date_time
count,3173
mean,2023-10-29 07:59:58.581783808
min,2023-01-01 00:00:00
25%,2023-06-03 03:17:00
50%,2023-10-17 05:08:00
75%,2024-04-03 19:08:00
max,2024-08-27 18:19:00


### Saving

In [5]:
shootings_2023_2024.to_csv(
    settings.DATA_DIR_PROCESSED / "reckless_firearm_discharges_2023_2024.csv" 
)