### NIBRS Data
An incident can have many different "offenses" attached to it (i.e., Robbery, Motor Vehicle Theft). For each offense, it can have many different offenders, arrestees, victims, and property mentioned. 

To account for this, the dataset is separable into 6 combinations, and a full join on everything would result in offenses with multiple arrestees being listed twice.

Virginia first started using the NIBRS system in 1994; however, it was still up to each individual police department to report crimes in a format they want. 

Goals with consistency in NIBRS data:
- Are we consistent with police departments we include?
- Are we still representative of the state by population and demographics (rural, urban)?
- Do we include enough years?
    - Violent crime has been on the rise for decades, so we need enough data to differentiate it.

Extracted Data goals:
- Homicides involving a Firearm
- Aggravated Assaults involving a Firearm
- Theft of Guns.
    - This is *exceptionally* common and a prominent way to obtain access to firearms if you're not allowed to purchase them (straw man buys are federal offense). 
    - It's usually guns left in unsecured cars that are "broken into" (AKA: the door is opened and the gun is stolen).

Theft of Guns can be used alongside gun laws, I think.

Full data links from VA's Open Data Portal: https://data.virginia.gov/browse?q=NIBRS&sortBy=relevance

In [1]:
import numpy as np
import os
import pandas as pd
from sodapy import Socrata

```python
client = Socrata("data.virginia.gov", app_token=os.environ.get('socrata_api_key'))
datasets = {
    'incident': 'uc5j-xp8k', 
    'victim':'v8hz-zj7n', 
    'offender': 'g6a7-yn7n', 
    'arrestee':'t3a6-79sg', 
    'offense': '6kj6-fuus', 
    'property':'iy5e-x83h'}

for name, key in datasets.items():
    res = client.get_all(key)
    df = pd.DataFrame.from_records(list(res))
    df.to_csv(f'../data/raw/{name}.csv')
````

### What does the offense data look like?
This will be used to derive the actual crime data, since it contains the the offense type attached to an incident (i.e., aggravated assault, homicide).

In [2]:
offenses = pd.read_csv('../data/raw/offenses.csv', low_memory=False)

In [3]:
offenses.columns

Index(['DATA_YEAR', 'OFFENSE_ID', 'INCIDENT_ID', 'OFFENSE_TYPE_NAME',
       'CRIMINAL_ACT_DESC', 'ATTEMPT_COMPLETE', 'LOCATION_NAME',
       'NUM_PREMISES_ENTERED', 'METHOD_ENTRY', 'REPORTING_YEAR', 'SOURCE_FILE',
       'LOAD_DATE'],
      dtype='object')

In [4]:
offenses.head()

Unnamed: 0,DATA_YEAR,OFFENSE_ID,INCIDENT_ID,OFFENSE_TYPE_NAME,CRIMINAL_ACT_DESC,ATTEMPT_COMPLETE,LOCATION_NAME,NUM_PREMISES_ENTERED,METHOD_ENTRY,REPORTING_YEAR,SOURCE_FILE,LOAD_DATE
0,2006,35779798,33853567,,,Completed,,,,2006,NIBRS_OFFENSE,10/12/2021
1,2007,42405041,41016049,,,Completed,,,,2007,NIBRS_OFFENSE,10/12/2021
2,2007,45312575,38053229,,,Completed,,,,2007,NIBRS_OFFENSE,10/12/2021
3,2006,37085497,35420723,,,Completed,,,,2006,NIBRS_OFFENSE,10/12/2021
4,2002,20845355,19589045,,,Completed,,,,2002,NIBRS_OFFENSE,10/12/2021


### How many Incidents have Offenses?

In [5]:
offenses[~offenses['OFFENSE_TYPE_NAME'].isna()].shape[0]/offenses.shape[0]

0.22991838193759842

### How many years have offense data attached to them?

In [6]:
offenses[offenses['OFFENSE_TYPE_NAME'].isna()].groupby('DATA_YEAR').agg(CtNoOffense=('DATA_YEAR', 'count'))

Unnamed: 0_level_0,CtNoOffense
DATA_YEAR,Unnamed: 1_level_1
1994,9624
1995,22455
1996,51231
1997,114108
1998,163148
1999,266975
2000,452767
2001,483699
2002,500537
2003,486990


##### Okay, so complete NIBRS data for virginia only exists for 2016 onwards

#### How much data do we have for 2016 onwards?

In [7]:
offenses[offenses['DATA_YEAR']>2015].groupby('DATA_YEAR').agg(OffensesPerYear=('DATA_YEAR', 'count'))

Unnamed: 0_level_0,OffensesPerYear
DATA_YEAR,Unnamed: 1_level_1
2016,438314
2017,438170
2018,422399
2019,418956
2020,379163
2021,361876


### Restrict offenses to 2016 onwards and save

In [8]:
offenses = offenses[offenses['DATA_YEAR']>2015].loc[:]

In [9]:
offenses.to_csv('../data/processed/offenses_2016+.csv')

#### How many distinct police departments per year?
To answer this question, we now need the Incidents data

In [11]:
incidents = pd.read_csv('../data/raw/incidents.csv', low_memory=False)

In [12]:
incidents = incidents[incidents['DATA_YEAR']>2015].loc[:]

In [13]:
incidents.head()

Unnamed: 0,DATA_YEAR,INCIDENT_ID,PUB_AGENCY_NAME,FIPS,CARGO_THEFT_FLAG,SUBMISSION_DATE,INCIDENT_DATE,REPORT_DATE_FLAG,INCIDENT_HOUR,INCIDENT_STATUS,DATA_HOME,ORIG_FORMAT,DID,REPORTING_YEAR,SOURCE_FILE,LOAD_DATE
37,2017,91016486,Chesterfield County Police Department,51041,N,08/17/2018,04/13/2017,,4.0,0,C,F,7857615.0,2017,NIBRS_incident,10/07/2021
46,2016,87798397,Hampton Police Department,51650,,,10/15/2016,,23.0,0,C,,,2016,NIBRS_incident,10/07/2021
48,2019,117413659,Norfolk Police Department,51710,,11/04/2019,08/28/2019,,10.0,0,C,F,59398285.0,2019,NIBRS_incident,10/07/2021
52,2016,86317943,Roanoke Police Department,51770,,,07/21/2016,,14.0,0,C,,,2016,NIBRS_incident,10/07/2021
56,2017,97110782,Fairfax County Police Department,51059,N,08/20/2018,09/11/2017,,0.0,0,C,F,19644756.0,2017,NIBRS_incident,10/07/2021


#### Combine Incidents and Offense Data

In [14]:
offenses = offenses.merge(incidents, how='left', on='INCIDENT_ID', suffixes=['','_drop'])
offenses.drop(columns=[col for col in offenses.columns if 'drop' in col], inplace=True)

### Now answer question about how many unique police departments each year

In [17]:
offenses.groupby('DATA_YEAR').agg(distinctpolice=('PUB_AGENCY_NAME', 'nunique'))

Unnamed: 0_level_0,distinctpolice
DATA_YEAR,Unnamed: 1_level_1
2016,413
2017,412
2018,412
2019,411
2020,408
2021,410


### Okay let's find Police departments that didn't participate every year and exclude them

In [18]:
departments = {} # format is gonna be year: set of departments
for year in list(pd.unique(offenses['DATA_YEAR'])):
    departments[year] = set(offenses[offenses['DATA_YEAR']==year]['PUB_AGENCY_NAME'])

In [30]:
total_depts = set(pd.unique(offenses['PUB_AGENCY_NAME']))
missing_depts = set()
for year, dept in departments.items():
    print(f'For year {year}, missing departments: {total_depts.difference(dept)}')
    missing_depts = missing_depts.union(total_depts.difference(dept))

For year 2017, missing departments: {'Eastville', 'Remington Police Department', 'Virginia Marine Resources Commission Law Enforcement Division', nan, 'Bowling Green Police Department', 'Wintergreen', 'Boykins Police Department', 'Central State Hospital', 'Pocahontas Police Department'}
For year 2018, missing departments: {'State Police: Poquoson', 'Remington Police Department', 'Eastville', nan, 'Bowling Green Police Department', 'Wintergreen', 'Boykins Police Department', 'Central State Hospital', 'Quantico Police Department'}
For year 2019, missing departments: {'Eastville', 'Remington Police Department', 'State Police: Manassas Park', nan, 'Wintergreen', 'Boykins Police Department', 'Central State Hospital', 'Glen Lyn Police Department', 'Thomas Nelson Community College', 'Quantico Police Department'}
For year 2016, missing departments: {'Eastville', 'State Police: Manassas Park', 'Virginia Marine Resources Commission Law Enforcement Division', 'Occoquan Police Department', nan, 'B

In [32]:
print(f'Combined Missing Departments ({len(missing_depts)}):\n')
missing_depts

Combined Missing Departments (24):



{'Appalachia Police Department',
 'Bowling Green Police Department',
 'Boykins Police Department',
 'Bridgewater College',
 'Central State Hospital',
 'Eastville',
 'Emory and Henry College',
 'Ferrum College',
 'Germanna Community College',
 'Glen Lyn Police Department',
 'Lord Fairfax Community College',
 'Occoquan Police Department',
 'Pocahontas Police Department',
 'Quantico Police Department',
 'Remington Police Department',
 'State Police: Manassas Park',
 'State Police: Poquoson',
 'State Police: Williamsburg',
 'Thomas Nelson Community College',
 'University of Mary Washington',
 'Virginia Marine Resources Commission Law Enforcement Division',
 'White Stone Police Department',
 'Wintergreen',
 nan}

### Remove Inconsistently Reported Police Departments and Data without a Police Department (nan)
This is a removal of only 1,801 rows, which is almost irrelevant in a dataset of our size at the moment. This makes complete sense, given that most of the police departments were likely low volume (i.e., university or perhaps only reported one year).

In [36]:
offenses = offenses[(~offenses['PUB_AGENCY_NAME'].isin(missing_depts)) & (offenses['PUB_AGENCY_NAME'].notna())]

In [38]:
offenses.to_csv('../data/processed/offenses_2016+_cons_dpts.csv')

### Okay now let's restrict to our crime types we care about

In [40]:
care = ['Murder and Nonnegligent Manslaughter', 'Aggravated Assault']
offenses = offenses[offenses['OFFENSE_TYPE_NAME'].isin(care)]

### lol... the virginia provided data doesn't include the weapon type in its' offense
New Data Source: https://www.openicpsr.org/openicpsr/project/118281/version/V8/view?path=/openicpsr/118281/fcr:versions/V8/nibrs_1991_2021_offense_segment_rds.zip&type=file

I was redirected to this site from the FBI's anyways. This is just a compilation of all the years to prevent having to download each individual year. It is provided as an `rds` file.

### This also provides an opportunity for us to do the entire US, if we identified a reason.
For now, we will only keep offenses within Virginia and departments that participated in all years.

In [1]:
from pyreadr import read_r
import pandas as pd

In [2]:
keep_crimes = ['robbery', 'aggravated assault', 'murder/nonnegligent manslaughter']
dfs = []
for year in range(2016, 2021, 1):
    df = read_r(rf'..\data\raw\nibrs_offense_segment_{str(year)}.rds')[None]
    # filter to VA
    df = df[(df['state_abb']=='VA') & (df['ucr_offense_code'].isin(keep_crimes))].loc[:]
    dfs.append(df)
offenses = pd.concat(dfs)

In [3]:
offenses.to_csv('../data/processed/nibrs_offense_segments.csv')

In [4]:
offenses.head()

Unnamed: 0,ori,year,state,state_abb,incident_number,incident_date,ucr_offense_code,offense_attempted_or_completed,offender_suspected_of_using_1,offender_suspected_of_using_2,...,type_criminal_activity_2,type_criminal_activity_3,type_weapon_force_involved_1,automatic_weapon_indicator_1,type_weapon_force_involved_2,automatic_weapon_indicator_2,type_weapon_force_involved_3,automatic_weapon_indicator_3,bias_motivation,unique_incident_id
4932027,VA0010000,2016.0,virginia,VA,5F10-0WNU52S,2015-04-13,robbery,completed,not applicable,,...,,,handgun,,,,,,no bias motivation,VA0010000 5F10-0WNU52S
4932041,VA0010000,2016.0,virginia,VA,5F10-0W0U45S,2015-12-02,robbery,completed,not applicable,,...,,,handgun,,,,,,unknown bias motivation,VA0010000 5F10-0W0U45S
4932046,VA0010000,2016.0,virginia,VA,5F10-0A5H54S,2015-12-25,robbery,completed,not applicable,,...,,,"personal weapons (hands, feet, teeth, etc.)",,,,,,unknown bias motivation,VA0010000 5F10-0A5H54S
4932047,VA0010000,2016.0,virginia,VA,5F10-0A5MW6S,2015-12-26,murder/nonnegligent manslaughter,completed,not applicable,,...,,,firearm (type not stated),,,,,,no bias motivation,VA0010000 5F10-0A5MW6S
4932048,VA0010000,2016.0,virginia,VA,5F10-0A5JW6S,2015-12-30,robbery,completed,not applicable,,...,,,handgun,,,,,,no bias motivation,VA0010000 5F10-0A5JW6S


In [5]:
offenses.info()

<class 'pandas.core.frame.DataFrame'>
Index: 67036 entries, 4932027 to 7826945
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ori                             67036 non-null  object 
 1   year                            67036 non-null  float64
 2   state                           67036 non-null  object 
 3   state_abb                       67036 non-null  object 
 4   incident_number                 67036 non-null  object 
 5   incident_date                   67036 non-null  object 
 6   ucr_offense_code                67036 non-null  object 
 7   offense_attempted_or_completed  67036 non-null  object 
 8   offender_suspected_of_using_1   67036 non-null  object 
 9   offender_suspected_of_using_2   349 non-null    object 
 10  offender_suspected_of_using_3   2 non-null      object 
 11  location_type                   67036 non-null  object 
 12  number_of_premises_entered   

In [9]:
df.head()

Unnamed: 0,ori,year,state,state_abb,incident_number,incident_date,ucr_offense_code,offense_attempted_or_completed,offender_suspected_of_using_1,offender_suspected_of_using_2,...,type_criminal_activity_2,type_criminal_activity_3,type_weapon_force_involved_1,automatic_weapon_indicator_1,type_weapon_force_involved_2,automatic_weapon_indicator_2,type_weapon_force_involved_3,automatic_weapon_indicator_3,bias_motivation,unique_incident_id
7448148,VA0010000,2020.0,virginia,VA,2W2HGU7 7NBA,2020-07-04,aggravated assault,completed,not applicable,,...,,,handgun,automatic weapon,,,,,no bias motivation,VA0010000 2W2HGU7 7NBA
7448160,VA0010000,2020.0,virginia,VA,2W2HGU7 8G8A,2020-06-01,robbery,completed,not applicable,,...,,,handgun,automatic weapon,,,,,no bias motivation,VA0010000 2W2HGU7 8G8A
7448195,VA0010000,2020.0,virginia,VA,2W2HGU7 JNBA,2020-06-20,aggravated assault,completed,alcohol,,...,,,"personal weapons (hands, feet, teeth, etc.)",,,,,,no bias motivation,VA0010000 2W2HGU7 JNBA
7448202,VA0010000,2020.0,virginia,VA,2W2HGU7 L66A,2020-06-09,robbery,completed,alcohol,,...,,,none,,,,,,no bias motivation,VA0010000 2W2HGU7 L66A
7448205,VA0010000,2020.0,virginia,VA,2W2HGU7 LC6A,2020-06-12,aggravated assault,completed,not applicable,,...,,,other firearm,,,,,,no bias motivation,VA0010000 2W2HGU7 LC6A


### Let's restrict it to Police Departments that participated each year

In [10]:
departments = {} # format is gonna be year: set of departments
for year in list(pd.unique(offenses['year'])):
    departments[year] = set(offenses[offenses['year']==year]['ori'])
total_depts = set(pd.unique(offenses['ori']))
missing_depts = set()
for year, dept in departments.items():
    print(f'For year {year}, missing departments: {total_depts.difference(dept)}')
    missing_depts = missing_depts.union(total_depts.difference(dept))

For year 2016.0, missing departments: {'VA0110000', 'VA023SP00', 'VA0810300', 'VA0750700', 'VA115SP00', 'VA0330300', 'VA0520100', 'VA067SP00', 'VA093SP00', 'VA075S100', 'VA0820300', 'VA1220400', 'VA0430700', 'VA0380200', 'VA079SP00', 'VA045SP00', 'VA1170800', 'VA019SP00', 'VA0780000', 'VA0750600', 'VA048SP00', 'VA0860200', 'VA0730200', 'VA082099E', 'VA0690300', 'VA1090100', 'VA062019P', 'VA0350300', 'VA0140100', 'VA1300900', 'VA0580400', 'VA0260200', 'VA098SP00', 'VA0680200', 'VA070SP00', 'VA0430300', 'VA090SP00', 'VA046SP00', 'VA114SP00', 'VA078SP00', 'VA0690200', 'VA091SP00', 'VA0750500', 'VA0740200', 'VA0820100', 'VA101SP00', 'VA0710100', 'VA0860300', 'VA127SP00', 'VA049SP00', 'VA044SP00', 'VA0410100', 'VA0840200', 'VA0010500', 'VA100SP00', 'VA102SP00', 'VA030SP00', 'VA107SP00', 'VA0840100', 'VA050SP00', 'VA0710200'}
For year 2017.0, missing departments: {'VA0550100', 'VA0810300', 'VA0750700', 'VA115SP00', 'VA067SP00', 'VA075S100', 'VA0820300', 'VA118SP00', 'VA1220400', 'VA022SP00',

Although the list of missing departments looks long, it only removes 745 offenses.

In [15]:
offenses =offenses[~offenses['ori'].isin(missing_depts)].loc[:]

### Let's identify weapons involved in the offense
Within the data, this is located in up to 4 fields because of how it was prepared.

In [16]:
pd.unique(offenses['type_weapon_force_involved_1'])

array(['handgun', 'personal weapons (hands, feet, teeth, etc.)',
       'firearm (type not stated)',
       'knife/cutting instrument (ice pick, screwdriver, ax, etc.)',
       'blunt object (club, hammer, etc.)', 'unknown', 'none',
       'asphyxiation (by drowning, strangulation, suffocation, gas, etc.)',
       'other', 'rifle', 'motor vehicle', 'fire/incendiary device',
       'drugs/narcotics/sleeping pills', 'shotgun', 'other firearm',
       'poison (include gas)', 'explosives'], dtype=object)

In [17]:
pd.unique(offenses['type_weapon_force_involved_2'])

array([nan, 'rifle', 'personal weapons (hands, feet, teeth, etc.)',
       'shotgun', 'blunt object (club, hammer, etc.)', 'other', 'handgun',
       'knife/cutting instrument (ice pick, screwdriver, ax, etc.)',
       'firearm (type not stated)', 'explosives',
       'asphyxiation (by drowning, strangulation, suffocation, gas, etc.)',
       'motor vehicle', 'unknown', 'other firearm',
       'drugs/narcotics/sleeping pills', 'fire/incendiary device',
       'poison (include gas)'], dtype=object)

In [18]:
pd.unique(offenses['type_weapon_force_involved_3'])

array([nan, 'blunt object (club, hammer, etc.)',
       'knife/cutting instrument (ice pick, screwdriver, ax, etc.)',
       'handgun', 'personal weapons (hands, feet, teeth, etc.)', 'rifle',
       'asphyxiation (by drowning, strangulation, suffocation, gas, etc.)',
       'drugs/narcotics/sleeping pills', 'shotgun', 'other',
       'other firearm', 'firearm (type not stated)', 'unknown',
       'poison (include gas)', 'fire/incendiary device', 'motor vehicle'],
      dtype=object)

In [39]:
def weapon_involved(values):
    '''Accepts a list and checks if any values indicate firearm or gun'''
    check_string = ', '.join([str(value) for value in values]).lower() # ensure it's a string
    if 'firearm' in check_string or 'gun' in check_string or'rifle' in check_string:
        return 1
    else:
        return 0

In [40]:
offenses['firearm_involved'] = offenses.apply(lambda x: weapon_involved([x['type_weapon_force_involved_1'], x['type_weapon_force_involved_2'], x['type_weapon_force_involved_3']]), axis=1)

In [41]:
offenses['firearm_involved'].fillna(0, inplace=True)

### If any of the offenses involve firearm, gun or rifle in either of the 3 type_weapon_force_involved columns, it's a 1 else 0

In [42]:
offenses.groupby(['year', 'firearm_involved']).agg(CtCrimes=('firearm_involved','count')).head(15)

Unnamed: 0_level_0,Unnamed: 1_level_0,CtCrimes
year,firearm_involved,Unnamed: 2_level_1
2016.0,0,8782
2016.0,1,5529
2017.0,0,8146
2017.0,1,4888
2018.0,0,8165
2018.0,1,4465
2019.0,0,8350
2019.0,1,4701
2020.0,0,7829
2020.0,1,5447


In [44]:
offenses.to_csv('..\data\processed\offenses_firearm_indicator.csv')