<style>div.container { width: 100% }</style>
<img style="float:left;  vertical-align:text-bottom;" height="65" width="172" src="../NIBRS Logo 500.jpg" />
<div style="float:right; vertical-align:text-bottom;"><h2>NIBRS DATA</h2></div>


This notebook shows how to obtain NIBRS datasets for 2010 -> 2019 and the NIBRS data cleaning / 
manipulation steps required before joining sports data. 

&nbsp;

### Download Data

The first step is to download the datasets and store them in an appropriate directory structure. \
These were downloaded manually from the [NIBRS Crime Data Explorer website](https://crime-data-explorer.fr.cloud.gov/pages/downloads) website. \
\
On the website:
 1. Navigate to the section labeled "Crime Incident-Based Data by State"
 1. Specify 'Michigan' as state name and add the needed year, e.g. '2010' (repeat for each year needed)
 1. Click "Download" - this downloads that year's dataset to the location on your harddrive specified by browser settings (usually folder named "Download")
 1. Once the datasets are downloaded, and you're able to navigate to them locally, you'll want to move them to a single folder under your project folder
   - The project folder **must** have a parent folder titled **'Data'** and a subfolder titled **'NIBRS'** (e.g. /Project/Data/NIBRS)
 1. Proceed to the following code steps which will import necessary libraries and extract files of interest into 4 separate DataFrame objects


In [86]:
# the following libraries will be used for the NIBRS data manipulation.  The OS library is native to Python and should already be available if you're running conda
# To install pandas, you can uncomment the following line(s).  Please note that you can either use a utility called "PiP" or Conda if you're using the anaconda data science distribution

# Uncomment of the following two lines to install pandas (only if you're not able to run pandas)
# !pip install pandas
# !conda install pandas

import os
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns',100)

In [87]:
#Initialize each topic's list of dataframes (ie each list will store a dataframe corresponding to the year for that type of data
incident_lst = []
agencies_lst = []
offenses_lst = []
offense_type_lst = []


for folder in os.listdir('../00_nibrs_downloads/'):
    filepath = f'../00_nibrs_downloads/{folder}'
    if os.path.isdir(filepath):
        for file in os.listdir(filepath):
            if file == 'nibrs_incident.csv' or file=='NIBRS_incident.csv':
                print('found incident file in', filepath)
                incident_lst.append(pd.read_csv(filepath + "/" + file))
            if file == 'nibrs_offense.csv' or file =='NIBRS_OFFENSE.csv':
                print('found offense file in', filepath)
                temp = pd.read_csv(filepath + "/" + file)
                if 'DATA_YEAR' not in temp.columns:
                    temp.insert(0,'DATA_YEAR',filepath.split('-')[1]) 
                offenses_lst.append(temp)
            if file == 'cde_agencies.csv' or file == 'agencies.csv':
                print('found agency file in', filepath)
                temp = pd.read_csv(filepath + "/" + file)
                temp['DATA_YEAR'] = filepath.split('-')[1]
                agencies_lst.append(temp)
            if file == 'nibrs_offense_type.csv' or file == 'NIBRS_OFFENSE_TYPE.csv':
                print('found offense_type file in', filepath)
                temp = pd.read_csv(filepath + "/" + file)
                temp['DATA_YEAR'] = filepath.split('-')[1]
                offense_type_lst.append(temp)

found offense_type file in ../00_nibrs_downloads/MI-2009
found incident file in ../00_nibrs_downloads/MI-2009
found agency file in ../00_nibrs_downloads/MI-2009
found offense file in ../00_nibrs_downloads/MI-2009
found offense_type file in ../00_nibrs_downloads/MI-2013
found incident file in ../00_nibrs_downloads/MI-2013


  exec(code_obj, self.user_global_ns, self.user_ns)


found agency file in ../00_nibrs_downloads/MI-2013
found offense file in ../00_nibrs_downloads/MI-2013
found offense_type file in ../00_nibrs_downloads/MI-2014
found incident file in ../00_nibrs_downloads/MI-2014
found agency file in ../00_nibrs_downloads/MI-2014
found offense file in ../00_nibrs_downloads/MI-2014
found offense_type file in ../00_nibrs_downloads/MI-2015
found incident file in ../00_nibrs_downloads/MI-2015
found agency file in ../00_nibrs_downloads/MI-2015
found offense file in ../00_nibrs_downloads/MI-2015
found offense_type file in ../00_nibrs_downloads/MI-2012
found incident file in ../00_nibrs_downloads/MI-2012
found agency file in ../00_nibrs_downloads/MI-2012
found offense file in ../00_nibrs_downloads/MI-2012
found agency file in ../00_nibrs_downloads/MI-2019
found offense_type file in ../00_nibrs_downloads/MI-2019
found incident file in ../00_nibrs_downloads/MI-2019
found offense file in ../00_nibrs_downloads/MI-2019
found agency file in ../00_nibrs_downloads/MI


---
 
&nbsp;

## DATA CLEANING

Now we need to look at each of the 4 list of 10 dataframes and compare how they are set up. \
We'll start by looking at the columns of incidents frames. \
We need to evaluate whether they are formatted the same way across the years and if they contain the same data.



&nbsp;

## INCIDENTS


In [88]:
for frame in incident_lst:
    print(frame.dtypes, len(frame.columns))


agency_id                int64
incident_id              int64
nibrs_month_id           int64
incident_number          int64
cargo_theft_flag       float64
submission_date        float64
incident_date           object
report_date_flag        object
incident_hour          float64
cleared_except_id        int64
cleared_except_date     object
incident_status          int64
data_home               object
ddocname                object
orig_format            float64
ff_line_number         float64
did                    float64
dtype: object 17
agency_id                int64
incident_id              int64
nibrs_month_id           int64
incident_number          int64
cargo_theft_flag        object
submission_date        float64
incident_date           object
report_date_flag        object
incident_hour          float64
cleared_except_id        int64
cleared_except_date     object
incident_status          int64
data_home               object
ddocname                object
orig_format           

Looking at the columns, a couple things stand out - some of the files, the column names are in All CAPS which could cause problems when the data is joined/concatenated. \
\
The other thing is that those same years where the data is capitalized also have an additional column at the start called "DATA_YEAR". \
It might be worth ensuring this is replicated across all of the data frames.

In [89]:
for frame in incident_lst:
    
  
    # Fortunately, we can check whether the Data Year column is not in the frame and if so we can add it AND we can capitalize the column names
    if "DATA_YEAR" not in frame.columns:
        frame.insert(0,'DATA_YEAR',value=np.NaN)
    
        frame.drop(["incident_number","ddocname",'ff_line_number'],axis=1, inplace=True)
        
        
    frame.columns = [column.upper() for column in frame.columns]
    

    


In [90]:
for frame in incident_lst:
    print(frame.columns, len(frame.columns))

Index(['DATA_YEAR', 'AGENCY_ID', 'INCIDENT_ID', 'NIBRS_MONTH_ID',
       'CARGO_THEFT_FLAG', 'SUBMISSION_DATE', 'INCIDENT_DATE',
       'REPORT_DATE_FLAG', 'INCIDENT_HOUR', 'CLEARED_EXCEPT_ID',
       'CLEARED_EXCEPT_DATE', 'INCIDENT_STATUS', 'DATA_HOME', 'ORIG_FORMAT',
       'DID'],
      dtype='object') 15
Index(['DATA_YEAR', 'AGENCY_ID', 'INCIDENT_ID', 'NIBRS_MONTH_ID',
       'CARGO_THEFT_FLAG', 'SUBMISSION_DATE', 'INCIDENT_DATE',
       'REPORT_DATE_FLAG', 'INCIDENT_HOUR', 'CLEARED_EXCEPT_ID',
       'CLEARED_EXCEPT_DATE', 'INCIDENT_STATUS', 'DATA_HOME', 'ORIG_FORMAT',
       'DID'],
      dtype='object') 15
Index(['DATA_YEAR', 'AGENCY_ID', 'INCIDENT_ID', 'NIBRS_MONTH_ID',
       'CARGO_THEFT_FLAG', 'SUBMISSION_DATE', 'INCIDENT_DATE',
       'REPORT_DATE_FLAG', 'INCIDENT_HOUR', 'CLEARED_EXCEPT_ID',
       'CLEARED_EXCEPT_DATE', 'INCIDENT_STATUS', 'DATA_HOME', 'ORIG_FORMAT',
       'DID'],
      dtype='object') 15
Index(['DATA_YEAR', 'AGENCY_ID', 'INCIDENT_ID', 'NIBRS_MONTH_ID',
 

In [91]:
inc = pd.concat(incident_lst)

In [92]:
# We'd like to see what the data looks like across the fully combined frame. We'll use the sample method (and use the random_state parameter to make it consistent) to see across the dataset

Some additional steps we need to take to clean the data.... the dates (particularly the incident dates) are in very different formats. \
Data Year is missing for the frames we added the column to

In [93]:
inc['INCIDENT_DATE'] = pd.to_datetime(inc['INCIDENT_DATE'])

In [94]:
print(inc.shape)

(5704160, 15)


In [95]:
inc.sample(30)

Unnamed: 0,DATA_YEAR,AGENCY_ID,INCIDENT_ID,NIBRS_MONTH_ID,CARGO_THEFT_FLAG,SUBMISSION_DATE,INCIDENT_DATE,REPORT_DATE_FLAG,INCIDENT_HOUR,CLEARED_EXCEPT_ID,CLEARED_EXCEPT_DATE,INCIDENT_STATUS,DATA_HOME,ORIG_FORMAT,DID
281100,2017.0,9069,93190644,7748214,,17-AUG-18,2017-07-17,R,2.0,6,,0,C,F,11831588.0
352000,,9069,56551316,5229877,,,2010-05-13,R,12.0,6,,0,C,,
206182,,9049,90615419,7440808,N,2017-11-07 00:00:00,2016-03-04,R,,6,,0,C,F,17393381.0
229187,2017.0,8534,101076259,7913414,N,28-SEP-18,2017-12-01,,12.0,6,,0,C,F,27432700.0
359128,,9064,51481272,4995703,,,2009-09-11,R,13.0,6,,0,C,,
271765,2017.0,8628,94359057,7785469,,17-AUG-18,2017-10-12,,12.0,6,,0,C,F,14386011.0
126318,2018.0,8859,114176148,8132181,N,04-SEP-19,2018-07-08,,0.0,4,11-JUL-19,0,C,F,51917862.0
174489,2018.0,9030,98394827,8032454,N,20-AUG-18,2018-03-10,,16.0,6,,0,C,F,22112547.0
355076,2017.0,8850,92827117,7728070,N,17-AUG-18,2017-06-11,,12.0,6,,0,C,F,11090579.0
260655,,9049,90665945,7572453,,2017-11-07 00:00:00,2016-11-15,R,,6,,0,C,F,17501030.0


In [96]:
inc.groupby(inc['INCIDENT_DATE'].dt.year)['INCIDENT_ID'].agg('count')

INCIDENT_DATE
2009    617838
2010    597205
2011    564424
2012    562587
2013    522879
2014    492028
2015    487547
2016    504997
2017    477382
2018    458209
2019    419064
Name: INCIDENT_ID, dtype: int64

In [97]:
inc['DATA_YEAR'] = inc['INCIDENT_DATE'].dt.year
inc = inc[['DATA_YEAR', 'AGENCY_ID', 'INCIDENT_ID','INCIDENT_DATE','INCIDENT_HOUR']]

In [98]:
inc.to_csv('../01_nibrs_rawdata/Combined_Incidents.csv',index=False)


&nbsp;

## AGENCIES

We'll take a quick inspection of the agencies to see whether their name changed over the course of the 10 years.

In [99]:
# Let's check what agencies we care to focus on. I'll use Ann Arbor as a search term
test = agencies_lst[0]
test[test['agency_name'].str.contains('Ann Arbor')==True]


Unnamed: 0,agency_id,ori,legacy_ori,agency_name,short_name,agency_type_id,agency_type_name,tribe_id,campus_id,city_id,city_name,state_id,state_abbr,primary_county_id,primary_county,primary_county_fips,agency_status,submitting_agency_id,submitting_sai,submitting_name,submitting_state_abbr,start_year,dormant_year,current_year,revised_rape_start,current_nibrs_start_year,population,population_group_code,population_group_desc,population_source_flag,suburban_area_flag,core_city_flag,months_reported,nibrs_months_reported,past_10_years_reported,covered_by_id,covered_by_ori,covered_by_name,staffing_year,total_officers,total_civilians,icpsr_zip,icpsr_lat,icpsr_lng,DATA_YEAR
542,9030,MI8121800,MI8121800,Ann Arbor Police Department,Ann Arbor,1,City,,,4504.0,Ann Arbor,26,MI,1324,Washtenaw,26161.0,A,23374,MIUCR0001,Michigan State Police Statistical Records Divi...,MI,1960,,2016,2013.0,2016.0,118730,2,"Cities from 100,000 thru 249,000",L,N,Y,12,12,10,,,,2016.0,125.0,26.0,48104,42.252327,-83.844634,2009
567,9040,MI8190300,MI8190300,University of Michigan: Ann Arbor,University of Michigan: Ann Arbor,3,University or College,,764.0,,,26,MI,1324,Washtenaw,26161.0,A,23374,MIUCR0001,Michigan State Police Statistical Records Divi...,MI,1991,,2016,2013.0,2016.0,0,7,"Cities under 2,500",L,Y,N,12,12,10,,,,2016.0,59.0,17.0,48502,42.252327,-83.844634,2009


In [100]:
# Let's do the same for MSU. Using East Lansing is not going to get us MSU so I'm searching for both the EL police and the university

test[(test['agency_name'].str.contains('East Lansing')==True) | (test['agency_name'].str.contains('Michigan State University')==True)]

Unnamed: 0,agency_id,ori,legacy_ori,agency_name,short_name,agency_type_id,agency_type_name,tribe_id,campus_id,city_id,city_name,state_id,state_abbr,primary_county_id,primary_county,primary_county_fips,agency_status,submitting_agency_id,submitting_sai,submitting_name,submitting_state_abbr,start_year,dormant_year,current_year,revised_rape_start,current_nibrs_start_year,population,population_group_code,population_group_desc,population_source_flag,suburban_area_flag,core_city_flag,months_reported,nibrs_months_reported,past_10_years_reported,covered_by_id,covered_by_ori,covered_by_name,staffing_year,total_officers,total_civilians,icpsr_zip,icpsr_lat,icpsr_lng,DATA_YEAR
111,8554,MI3358100,MI3358100,Michigan State University,Michigan State University,3,University or College,,365.0,,,26,MI,1276,Ingham,26065.0,A,23374,MIUCR0001,Michigan State Police Statistical Records Divi...,MI,1960,,2016,2013.0,2016.0,0,7,"Cities under 2,500",L,Y,N,12,12,10,,,,2016.0,85.0,28.0,48824,42.603534,-84.373811,2009
427,8550,MI3336400,MI3336400,East Lansing Police Department,East Lansing,1,City,,,4623.0,East Lansing,26,MI,1276,Ingham,26065.0,A,23374,MIUCR0001,Michigan State Police Statistical Records Divi...,MI,1960,,2016,2013.0,2016.0,48668,4,"Cities from 25,000 thru 49,999",L,N,Y,12,12,10,,,,2016.0,54.0,13.0,48823,42.603534,-84.373811,2009


We have the agency ids that we care about, now let's see how the names/info changes for each agency as we move across dataframes

In [101]:
first_agency = agencies_lst[1] #2013 dataset
# first_agency[first_agency['agency_name'].str.contains("Michigan State")==True]
first_agency[first_agency['agency_id'].isin([9030,9040])== True]
# first_agency[first_agency['agency_id']== 8286]

Unnamed: 0,agency_id,ori,legacy_ori,agency_name,short_name,agency_type_id,agency_type_name,tribe_id,campus_id,city_id,city_name,state_id,state_abbr,primary_county_id,primary_county,primary_county_fips,agency_status,submitting_agency_id,submitting_sai,submitting_name,submitting_state_abbr,start_year,dormant_year,current_year,revised_rape_start,current_nibrs_start_year,population,population_group_code,population_group_desc,population_source_flag,suburban_area_flag,core_city_flag,months_reported,nibrs_months_reported,past_10_years_reported,covered_by_id,covered_by_ori,covered_by_name,staffing_year,total_officers,total_civilians,icpsr_zip,icpsr_lat,icpsr_lng,DATA_YEAR
546,9030,MI8121800,MI8121800,Ann Arbor Police Department,Ann Arbor,1,City,,,4504.0,Ann Arbor,26,MI,1324,Washtenaw,26161.0,A,23374,MIUCR0001,Michigan State Police Statistical Records Divi...,MI,1960,,2016,2013.0,2016.0,118730,2,"Cities from 100,000 thru 249,000",L,N,Y,12,12,10,,,,2016.0,125.0,26.0,48104,42.252327,-83.844634,2013
571,9040,MI8190300,MI8190300,University of Michigan: Ann Arbor,University of Michigan: Ann Arbor,3,University or College,,764.0,,,26,MI,1324,Washtenaw,26161.0,A,23374,MIUCR0001,Michigan State Police Statistical Records Divi...,MI,1991,,2016,2013.0,2016.0,0,7,"Cities under 2,500",L,Y,N,12,12,10,,,,2016.0,59.0,17.0,48502,42.252327,-83.844634,2013


In [102]:
second_agency = agencies_lst[-1]  # 2018 dataset
# second_agency.head()
second_agency[second_agency['AGENCY_ID'].isin([9030,9040])== True]

Unnamed: 0,YEARLY_AGENCY_ID,AGENCY_ID,DATA_YEAR,ORI,LEGACY_ORI,COVERED_BY_LEGACY_ORI,DIRECT_CONTRIBUTOR_FLAG,DORMANT_FLAG,DORMANT_YEAR,REPORTING_TYPE,UCR_AGENCY_NAME,NCIC_AGENCY_NAME,PUB_AGENCY_NAME,PUB_AGENCY_UNIT,AGENCY_STATUS,STATE_ID,STATE_NAME,STATE_ABBR,STATE_POSTAL_ABBR,DIVISION_CODE,DIVISION_NAME,REGION_CODE,REGION_NAME,REGION_DESC,AGENCY_TYPE_NAME,POPULATION,SUBMITTING_AGENCY_ID,SAI,SUBMITTING_AGENCY_NAME,SUBURBAN_AREA_FLAG,POPULATION_GROUP_ID,POPULATION_GROUP_CODE,POPULATION_GROUP_DESC,PARENT_POP_GROUP_CODE,PARENT_POP_GROUP_DESC,MIP_FLAG,POP_SORT_ORDER,SUMMARY_RAPE_DEF,PE_REPORTED_FLAG,MALE_OFFICER,MALE_CIVILIAN,PED.MALE_OFFICER+PED.MALE_CIVILIAN,FEMALE_OFFICER,FEMALE_CIVILIAN,PED.FEMALE_CIVILIAN+PED.FEMALE_OFFICER,0,0.1,NIBRS_CERT_DATE,NIBRS_START_DATE,NIBRS_LEOKA_START_DATE,NIBRS_CT_START_DATE,NIBRS_MULTI_BIAS_START_DATE,NIBRS_OFF_ETH_START_DATE,COVERED_FLAG,COUNTY_NAME,MSA_NAME,PUBLISHABLE_FLAG,PARTICIPATED,NIBRS_PARTICIPATED
549,90302018,9030,2018,MI8121800,MI8121800,,N,N,,I,ANN ARBOR,ANN ARBOR PD,Ann Arbor,,A,26,Michigan,MI,MI,3,East North Central,2,Midwest,Region II,City,122571,23374,MIUCR0001,Michigan State Police Criminal Justice Informa...,N,6,2,"Cities from 100,000 thru 249,999",2,"Cities from 100,000 thru 249,999",Y,1,R,Y,92.0,8.0,100.0,27.0,18.0,45.0,0,0,01-OCT-94,01-JAN-03,01-JUN-09,01-JUL-12,01-JAN-17,01-JAN-17,N,WASHTENAW,"Ann Arbor, MI",Y,Y,Y
556,90402018,9040,2018,MI8190300,MI8190300,,N,N,,I,UNIV OF MI: ANN ARBOR,UNIV OF MICH DEPT OF PUBLIC SAFETY ANN ARBOR,University of Michigan:,Ann Arbor,A,26,Michigan,MI,MI,3,East North Central,2,Midwest,Region II,University or College,0,23374,MIUCR0001,Michigan State Police Criminal Justice Informa...,Y,11,7,"Cities under 2,500",7,"Cities under 2,500",N,2,R,Y,48.0,8.0,56.0,12.0,8.0,20.0,0,0,01-OCT-94,01-JAN-95,01-JUL-09,01-JUL-12,01-JAN-17,01-JAN-17,N,WASHTENAW,"Ann Arbor, MI",Y,Y,Y


In [103]:
second_agency[second_agency['AGENCY_ID'].isin([8554, 8550])==True][['AGENCY_ID','UCR_AGENCY_NAME','NCIC_AGENCY_NAME', 'PUB_AGENCY_UNIT','AGENCY_TYPE_NAME','DATA_YEAR']]

Unnamed: 0,AGENCY_ID,UCR_AGENCY_NAME,NCIC_AGENCY_NAME,PUB_AGENCY_UNIT,AGENCY_TYPE_NAME,DATA_YEAR
201,8550,EAST LANSING,EAST LANSING PD,,City,2018
205,8554,MICHIGAN STATE UNIVERSIT,MI STATE UNIV PD EAST LANSING,,University or College,2018


In [104]:
first_agency[first_agency['agency_id'].isin([8554,8550])==True]

Unnamed: 0,agency_id,ori,legacy_ori,agency_name,short_name,agency_type_id,agency_type_name,tribe_id,campus_id,city_id,city_name,state_id,state_abbr,primary_county_id,primary_county,primary_county_fips,agency_status,submitting_agency_id,submitting_sai,submitting_name,submitting_state_abbr,start_year,dormant_year,current_year,revised_rape_start,current_nibrs_start_year,population,population_group_code,population_group_desc,population_source_flag,suburban_area_flag,core_city_flag,months_reported,nibrs_months_reported,past_10_years_reported,covered_by_id,covered_by_ori,covered_by_name,staffing_year,total_officers,total_civilians,icpsr_zip,icpsr_lat,icpsr_lng,DATA_YEAR
112,8554,MI3358100,MI3358100,Michigan State University,Michigan State University,3,University or College,,365.0,,,26,MI,1276,Ingham,26065.0,A,23374,MIUCR0001,Michigan State Police Statistical Records Divi...,MI,1960,,2016,2013.0,2016.0,0,7,"Cities under 2,500",L,Y,N,12,12,10,,,,2016.0,85.0,28.0,48824,42.603534,-84.373811,2013
430,8550,MI3336400,MI3336400,East Lansing Police Department,East Lansing,1,City,,,4623.0,East Lansing,26,MI,1276,Ingham,26065.0,A,23374,MIUCR0001,Michigan State Police Statistical Records Divi...,MI,1960,,2016,2013.0,2016.0,48668,4,"Cities from 25,000 thru 49,999",L,N,Y,12,12,10,,,,2016.0,54.0,13.0,48823,42.603534,-84.373811,2013



Comparing the two different dataframes (2016 vs 2018) there are quite a few differences. \
\
The biggest challenge is that the naming of the agency (which provides attribution for the the incident/offense) changes. \
\
Where 2016 simply had an agency name, 2018 is showing that the name is split between different reporting systems / agencies so it's really just a matter of choosing a column that is closest to the `agency_name` column found in the older datasets. \
\
Fortunately, it appears that the core agency_ids are still in-tact so we can assume that **9040 is University of Michigan Police and 8554 is MSU**.
---
Of less concern but worthy to note, both types of datasets contain a field that contains county, but the columns are labeled a bit differently.\
\
That might also be helpful to include if we want to investigate a little more broadly geographically. \
\
Also, `DATA_YEAR` is not found in the old dataset under that name, but we can use the current year column to get that information.\
\
Now that we have a sense of the differences, we can reduce each dataframe down to only the fields we care about, and in the process, ensure that the columns are consistently named so that we are successfully concatenating them together.



In [105]:
clean_agencies_lst = []
for frame in agencies_lst:
    if len(frame.columns) < 59:
        frame = frame[['agency_id','agency_name','primary_county', 'agency_type_name', 'DATA_YEAR']]
    else:
        frame = frame[['AGENCY_ID','UCR_AGENCY_NAME','PUB_AGENCY_UNIT','AGENCY_TYPE_NAME','DATA_YEAR']]
        
    frame.columns = ['AGENCY_ID','AGENCY_NAME','COUNTY','TYPE','YEAR']
    clean_agencies_lst.append(frame)

In [106]:
# Concatenate across the years and see what the output is for one of the agencies we care about
agen = pd.concat(clean_agencies_lst)
agen[agen['AGENCY_ID'] == 9040]

Unnamed: 0,AGENCY_ID,AGENCY_NAME,COUNTY,TYPE,YEAR
567,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2009
571,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2013
580,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2014
705,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2015
571,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2012
560,9040,UNIV OF MI: ANN ARBOR,Ann Arbor,University or College,2019
553,9040,UNIV OF MI: ANN ARBOR,Ann Arbor,University or College,2017
567,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2010
571,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2011
527,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2016


In [107]:
# and again for MSU
agen[agen['AGENCY_ID'] == 8554]

Unnamed: 0,AGENCY_ID,AGENCY_NAME,COUNTY,TYPE,YEAR
111,8554,Michigan State University,Ingham,University or College,2009
112,8554,Michigan State University,Ingham,University or College,2013
115,8554,Michigan State University,Ingham,University or College,2014
39,8554,Michigan State University,Ingham,University or College,2015
112,8554,Michigan State University,Ingham,University or College,2012
211,8554,MICHIGAN STATE UNIVERSIT,,University or College,2019
208,8554,MICHIGAN STATE UNIVERSIT,,University or College,2017
111,8554,Michigan State University,Ingham,University or College,2010
112,8554,Michigan State University,Ingham,University or College,2011
30,8554,Michigan State University,Ingham,University or College,2016


Since there's a lot of duplication in this data (same agencies appear for multiple years), we want to condense it down to only the unique agencies and pull in the other details like County, etc.\
\
We have to account for the fact that there might be agency ids that existed in earlier years that don't have in the latter.  We'll groupby by the agency id and get the highest year (most recent) that we have record of them.  We'll convert this to a dataframe and give the same named columns as what the larger dataframe has in preparation for a merge.

In [108]:
maxagens = agen.groupby('AGENCY_ID')['YEAR'].agg('max')
unique_agens = maxagens.to_frame().reset_index()
unique_agens.columns = ['AGENCY_ID','YEAR']

Next, we'll merge the datasets on the agency_id and year (common to both datasets) and get the details from duplicated table.  We'll save this off as a .csv file.

In [109]:
combined = unique_agens.merge(agen, how='left', on=['AGENCY_ID','YEAR'])
combined.to_csv('../01_nibrs_rawdata/agencies.csv',index=False)

In [110]:
combined

Unnamed: 0,AGENCY_ID,YEAR,AGENCY_NAME,COUNTY,TYPE
0,8286,2019,SP: ALCONA COUNTY,Alcona County,State Police
1,8287,2019,ALCONA,,County
2,8288,2015,Harrisville Police Department,Alcona,City
3,8289,2015,Lincoln Police Department,Alcona,City
4,8290,2019,SP: ALGER COUNTY,Alger County,State Police
...,...,...,...,...,...
838,26021,2019,"STATE POLICE, DETROIT",,State Police
839,26624,2019,METRO POL AUTH GENESEE CNTY,,City
840,28034,2019,DEPT NAT RESOURCES LAW ENF DIV,,Other State Agency
841,28154,2019,WASHTENAW COMMUNITY COLLEGE,,University or College



&nbsp;

## OFFENSE TYPE

In [111]:
for frame in offense_type_lst:
    print(frame.columns, len(frame.columns))

Index(['offense_type_id', 'offense_code', 'offense_name', 'crime_against',
       'ct_flag', 'hc_flag', 'hc_code', 'offense_category_name', 'DATA_YEAR'],
      dtype='object') 9
Index(['offense_type_id', 'offense_code', 'offense_name', 'crime_against',
       'ct_flag', 'hc_flag', 'hc_code', 'offense_category_name', 'DATA_YEAR'],
      dtype='object') 9
Index(['offense_type_id', 'offense_code', 'offense_name', 'crime_against',
       'ct_flag', 'hc_flag', 'hc_code', 'offense_category_name', 'DATA_YEAR'],
      dtype='object') 9
Index(['offense_type_id', 'offense_code', 'offense_name', 'crime_against',
       'ct_flag', 'hc_flag', 'hc_code', 'offense_category_name', 'DATA_YEAR'],
      dtype='object') 9
Index(['offense_type_id', 'offense_code', 'offense_name', 'crime_against',
       'ct_flag', 'hc_flag', 'hc_code', 'offense_category_name', 'DATA_YEAR'],
      dtype='object') 9
Index(['OFFENSE_TYPE_ID', 'OFFENSE_CODE', 'OFFENSE_NAME', 'CRIME_AGAINST',
       'CT_FLAG', 'HC_FLAG', 'HC_CO

Looks like there is some variation here - capitalization of column names in later files and the addition of a column ('OFFENSE_GROUP") in latter files. To keep it simple, we'll only extract the ID and Name

In [112]:
offense_type_lst[0].head()

Unnamed: 0,offense_type_id,offense_code,offense_name,crime_against,ct_flag,hc_flag,hc_code,offense_category_name,DATA_YEAR
0,58,23*,Not Specified,Property,N,Y,6.0,Larceny/Theft Offenses,2009
1,1,09C,Justifiable Homicide,Not a Crime,N,N,,Homicide Offenses,2009
2,2,26A,False Pretenses/Swindle/Confidence Game,Property,Y,Y,,Fraud Offenses,2009
3,3,36B,Statutory Rape,Person,N,Y,,Sex Offenses,2009
4,4,11C,Sexual Assault With An Object,Person,N,Y,2.0,Sex Offenses,2009


In [113]:
offense_type_lst[-1].sample(10)

Unnamed: 0,OFFENSE_TYPE_ID,OFFENSE_CODE,OFFENSE_NAME,CRIME_AGAINST,CT_FLAG,HC_FLAG,HC_CODE,OFFENSE_CATEGORY_NAME,OFFENSE_GROUP,DATA_YEAR
51,27,13A,Aggravated Assault,Person,N,Y,4.0,Assault Offenses,A,2018
40,16,35A,Drug/Narcotic Violations,Society,N,Y,,Drug/Narcotic Offenses,A,2018
44,20,200,Arson,Property,N,Y,8.0,Arson,A,2018
64,41,26B,Credit Card/Automated Teller Machine Fraud,Property,Y,Y,,Fraud Offenses,A,2018
56,32,09A,Murder and Nonnegligent Manslaughter,Person,N,Y,1.0,Homicide Offenses,A,2018
26,1,09C,Justifiable Homicide,Person,N,N,,Homicide Offenses,A,2018
75,57,510,Bribery,Property,Y,Y,,Bribery,A,2018
1,72,30B,False Citizenship,Society,N,N,,Other Offenses,A,2018
10,81,526,Explosives Violation,Society,N,N,,Other Offenses,A,2018
31,6,90F,"Family Offenses, Nonviolent",Person,N,N,,"Family Offenses, Nonviolent",B,2018


In [114]:
clean_off_type_lst = []
for frame in offense_type_lst:
    if len(frame.columns) == 9:
        temp = frame[['offense_type_id','offense_name', 'crime_against','offense_category_name','DATA_YEAR']]
    else:
        temp = frame[['OFFENSE_TYPE_ID','OFFENSE_NAME','CRIME_AGAINST','OFFENSE_CATEGORY_NAME','DATA_YEAR']]
        
    temp.columns = ['OFFENSE_TYPE_ID','NAME','AGAINST','CATEGORY','YEAR']
    clean_off_type_lst.append(temp)


In [115]:
off_type = pd.concat(clean_off_type_lst)
print(off_type.shape)
off_type.sample(10)

(748, 5)


Unnamed: 0,OFFENSE_TYPE_ID,NAME,AGAINST,CATEGORY,YEAR
29,4,Sexual Assault With An Object,Person,Sex Offenses,2019
62,64,Hacking/Computer Invasion,Property,Fraud Offenses,2013
18,18,Purse-snatching,Property,Larceny/Theft Offenses,2013
34,34,Trespass of Real Property,Society,,2017
17,51,Simple Assault,Person,Assault Offenses,2019
41,41,Credit Card/Automated Teller Machine Fraud,Property,Fraud Offenses,2013
23,23,Shoplifting,Property,Larceny/Theft Offenses,2017
35,35,Drug Equipment Violations,Society,Drug/Narcotic Offenses,2012
13,13,Pocket-picking,Property,Larceny/Theft Offenses,2015
44,20,Arson,Property,Arson,2019


In [116]:
#Quick check of the number of unique offense types. 

len(off_type.OFFENSE_TYPE_ID.unique())

86

In [117]:
off_type_max = off_type.groupby('OFFENSE_TYPE_ID')['YEAR'].max().to_frame()

In [118]:
off_types_full = off_type_max.merge(off_type,how='left',on=['OFFENSE_TYPE_ID','YEAR'])

In [119]:
off_types_full.to_csv('../01_nibrs_rawdata/NIBRS_OFFENSE_TYPE.csv',index=False)


&nbsp;

## OFFENSES


In [120]:
for frame in offenses_lst:
    print(frame.columns, len(frame.columns))

Index(['DATA_YEAR', 'offense_id', 'incident_id', 'offense_type_id',
       'attempt_complete_flag', 'location_id', 'num_premises_entered',
       'method_entry_code', 'ff_line_number'],
      dtype='object') 9
Index(['DATA_YEAR', 'offense_id', 'incident_id', 'offense_type_id',
       'attempt_complete_flag', 'location_id', 'num_premises_entered',
       'method_entry_code', 'ff_line_number'],
      dtype='object') 9
Index(['DATA_YEAR', 'offense_id', 'incident_id', 'offense_type_id',
       'attempt_complete_flag', 'location_id', 'num_premises_entered',
       'method_entry_code', 'ff_line_number'],
      dtype='object') 9
Index(['DATA_YEAR', 'offense_id', 'incident_id', 'offense_type_id',
       'attempt_complete_flag', 'location_id', 'num_premises_entered',
       'method_entry_code', 'ff_line_number'],
      dtype='object') 9
Index(['DATA_YEAR', 'offense_id', 'incident_id', 'offense_type_id',
       'attempt_complete_flag', 'location_id', 'num_premises_entered',
       'method_entry_

Looks like everything is consistent sans some capitalization...this is no problem.  We only need the first few columns so we'll use indexing to extract those and put them into a list of clean frames

In [121]:
clean_offense_lst = []
for frame in offenses_lst:
    frame = frame.iloc[:,:4]
    frame.columns = ['YEAR','OFFENSE_ID','INCIDENT_ID','OFFENSE_TYPE_ID']
    clean_offense_lst.append(frame)

In [122]:
offenses = pd.concat(clean_offense_lst)
offenses.sample(10)

Unnamed: 0,YEAR,OFFENSE_ID,INCIDENT_ID,OFFENSE_TYPE_ID
118609,2009,51900651,50694254,23
317036,2017,116957914,94360118,21
159416,2019,139427064,114197519,14
292535,2010,56950956,54132533,5
105650,2019,145343316,119951752,36
127094,2017,117827032,95122195,45
32996,2012,71240535,64068048,5
42661,2010,57690153,56407214,21
235834,2009,55808157,50983314,45
293049,2009,55828624,49074886,51


In [123]:
offenses.to_csv('../01_nibrs_rawdata/NIBRS_OFFENSE.csv',index=False)

In [124]:
offenses.shape

(6002034, 4)

In [125]:
inc.shape

(5704160, 5)