# Looking at the data from the national incident based reporting system.

National Incident-Based Reporting System, 2016: Extract Files (ICPSR 37066)

https://www.icpsr.umich.edu/web/NACJD/studies/37066#

You need to download the data from the website. It requires you create an account.

In [25]:
import os
import pathlib
import zipfile
import pandas as pd

## Variables of Interest

The field names in the data files do not typically have meaningful names. This maps some of the field names to more meaningful names.


In [45]:
# Field names and short names for variables of interest

variables_info = """
    INCNUM - INCIDENT NUMBER
    INCDATE - INCIDENT DATE
    BH007 - CITY NAME
    BH008 - STATE ABBREVIATION
    V1007 - INCIDENT DATE HOUR
    V1010 - TOTAL OFFENDER SEGMENTS
    V1013 - CLEARED EXCEPTIONALLY
    V20061 - UCR OFFENSE CODE
    V20071 - OFFENSE ATTEMPTED/COMPLETED
    V20081 - OFFENDER(S) SUSPECTED OF USING
    V20111 - LOCATION TYPE
    V20141 - TYPE OF CRIMINAL ACTIVITY
    V20171 - TYPE WEAPON/FORCE INVOLVED
    V20201 - BIAS MOTIVATION
    V4017 - TYPE OF VICTIM
    V4018 - AGE OF VICTIM
    V4019 - SEX OF VICTIM
    V4020 - RACE OF VICTIM
    V4032 - RELATIONSHIP OF VICTIM TO OFFENDER
    V1010 - TOTAL OFFENDER SEGMENTS
    V1009 - TOTAL VICTIM SEGMENTS
"""

fields = {}
for line in variables_info.split("\n"):
    line = line.strip()
    parts = line.split(" - ")
    if len(parts) == 2:
        fields[parts[0]] = parts[1]
        
fields

{'INCNUM': 'INCIDENT NUMBER',
 'INCDATE': 'INCIDENT DATE',
 'BH007': 'CITY NAME',
 'BH008': 'STATE ABBREVIATION',
 'V1007': 'INCIDENT DATE HOUR',
 'V1010': 'TOTAL OFFENDER SEGMENTS',
 'V1013': 'CLEARED EXCEPTIONALLY',
 'V20061': 'UCR OFFENSE CODE',
 'V20071': 'OFFENSE ATTEMPTED/COMPLETED',
 'V20081': 'OFFENDER(S) SUSPECTED OF USING',
 'V20111': 'LOCATION TYPE',
 'V20141': 'TYPE OF CRIMINAL ACTIVITY',
 'V20171': 'TYPE WEAPON/FORCE INVOLVED',
 'V20201': 'BIAS MOTIVATION',
 'V4017': 'TYPE OF VICTIM',
 'V4018': 'AGE OF VICTIM',
 'V4019': 'SEX OF VICTIM',
 'V4020': 'RACE OF VICTIM',
 'V4032': 'RELATIONSHIP OF VICTIM TO OFFENDER',
 'V1009': 'TOTAL VICTIM SEGMENTS'}

## Extract the raw data

This extracts the zip file from the download directory into a data directory.

In [54]:
download_dir = os.path.join(pathlib.Path.home(), "Downloads")
zip_download_file = os.path.join(download_dir, "ICPSR_37066-V2.zip")
national_incident_data_dir = "data/national_incident_data"

# Check that the file is downloaded
if not(os.path.isfile(zip_download_file)):
    raise Exception("Please download the data file!")
    
if not(os.path.isdir(national_incident_data_dir)):
    with zipfile.ZipFile(zip_download_file, "r") as zfh:
        zfh.extractall(national_incident_data_dir)

## Examine Incident File

Examine and explore the incident file.


In [23]:
incident_file = national_incident_data_dir + "/ICPSR_37066/DS0001/37066-0001-Data.tsv"

In [65]:
# Read 1 row to get list of all fields present.
data_1row = pd.read_csv(incident_file, sep="\t", nrows=1)

# Figure out which of the fields in the incident file we want.
fields_to_pull = [c for c in data_1row.columns.values if c in fields]

In [70]:
incident_data = pd.read_csv(incident_file, sep="\t", usecols=fields_to_pull)
incident_data.columns = [fields[c] for c in fields_to_pull]

In [71]:
# There are > 5 million incidents.
len(incident_data)

5293536

In [75]:
# 83% are 1 victim to 1 offender.

victim_to_offender_counts = (incident_data
    .groupby(['TOTAL VICTIM SEGMENTS', 'TOTAL OFFENDER SEGMENTS'])
    .size()
    .reset_index()
    .rename(columns={0: 'Count'})
    .sort_values(by='Count', ascending=False)
)
victim_to_offender_counts['Percent'] = 100.0 * victim_to_offender_counts.Count / victim_to_offender_counts.Count.sum()
victim_to_offender_counts

Unnamed: 0,TOTAL VICTIM SEGMENTS,TOTAL OFFENDER SEGMENTS,Count,Percent
1,1,1,4397938,83.081290
2,1,2,309035,5.837969
40,2,1,297505,5.620156
41,2,2,87040,1.644270
3,1,3,63430,1.198254
...,...,...,...,...
69,2,51,1,0.000019
70,2,66,1,0.000019
84,3,15,1,0.000019
86,3,28,1,0.000019


In [78]:
incident_data["INCIDENT DATE"].min()

20160101

In [79]:
incident_data["INCIDENT DATE"].max()

20161231