## NIBRS DATA ##

This notebook will run through the process of obtaining the NIBRS dataset for years (2010-2019) and the data cleaning/manipulation steps required prior to joining with the sports data. 



### Download Data ###

The first step is to download the data and store in the appropriate folder.  This step was completed manually from the https://crime-data-explorer.fr.cloud.gov/pages/downloads website.  Once at the website, you'll need to do the following:
 1) Go to the section labeled "Crime Incident-Based Data by State"
 2) For each year, specify "Michigan" as the state and a specific years in the analysis (2010-2019)
 3) Click "Download"...this will download to the location on your harddrive specified by your browser settings (usually folder named "Download")
 4) Once the folders are downloaded and you're able to navigate to them on your drive, you'll want to move them to a folder under your project folder. The project folder MUST have a 'Data' Folder and a subfolder called "NIBRS" (e.g. Project/Data/NIBRS)
 5) Proceed to the following code steps which will load in the correct libraries and extract the files of interest into 4 separate dataframes.



In [1]:
# the following libraries will be used for the NIBRS data manipulation.  The OS library is native to Python and should already be available if you're running conda
# To install pandas, you can uncomment the following line(s).  Please note that you can either use a utility called "PiP" or Conda if you're using the anaconda data science distribution

# Uncomment of the following two lines to install pandas (only if you're not able to run pandas)
# !pip install pandas
# !conda install pandas

import os
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns',100)

In [2]:
# !pwd

incident_lst = []
agencies_lst = []
offenses_lst = []
offense_type_lst = []


for folder in os.listdir('Data/NIBRS/'):
    filepath = f'Data/NIBRS/{folder}'
    if os.path.isdir(filepath):
        for file in os.listdir(filepath):
            if file == 'nibrs_incident.csv' or file=='NIBRS_incident.csv':
                print('found incident file in', filepath)
                incident_lst.append(pd.read_csv(filepath + "/" + file))
            if file == 'nibrs_offense.csv' or file =='NIBRS_OFFENSE.csv':
                print('found offense file in', filepath)
                temp = pd.read_csv(filepath + "/" + file)
                if 'DATA_YEAR' not in temp.columns:
                    temp.insert(0,'DATA_YEAR',filepath.split('-')[1]) 
                offenses_lst.append(temp)
            if file == 'cde_agencies.csv' or file == 'agencies.csv':
                print('found agency file in', filepath)
                temp = pd.read_csv(filepath + "/" + file)
                temp['DATA_YEAR'] = filepath.split('-')[1]
                agencies_lst.append(temp)
            if file == 'nibrs_offense_type.csv' or file == 'NIBRS_OFFENSE_TYPE.csv':
                print('found offense_type file in', filepath)
                temp = pd.read_csv(filepath + "/" + file)
                temp['DATA_YEAR'] = filepath.split('-')[1]
                offense_type_lst.append(temp)

found offense_type file in Data/NIBRS/MI-2013
found incident file in Data/NIBRS/MI-2013


  exec(code_obj, self.user_global_ns, self.user_ns)


found agency file in Data/NIBRS/MI-2013
found offense file in Data/NIBRS/MI-2013
found offense_type file in Data/NIBRS/MI-2014
found incident file in Data/NIBRS/MI-2014
found agency file in Data/NIBRS/MI-2014
found offense file in Data/NIBRS/MI-2014
found offense_type file in Data/NIBRS/MI-2015
found incident file in Data/NIBRS/MI-2015
found agency file in Data/NIBRS/MI-2015
found offense file in Data/NIBRS/MI-2015
found offense_type file in Data/NIBRS/MI-2012
found incident file in Data/NIBRS/MI-2012
found agency file in Data/NIBRS/MI-2012
found offense file in Data/NIBRS/MI-2012
found agency file in Data/NIBRS/MI-2019
found offense_type file in Data/NIBRS/MI-2019
found incident file in Data/NIBRS/MI-2019
found offense file in Data/NIBRS/MI-2019
found agency file in Data/NIBRS/MI-2017
found offense_type file in Data/NIBRS/MI-2017
found incident file in Data/NIBRS/MI-2017
found offense file in Data/NIBRS/MI-2017
found offense_type file in Data/NIBRS/MI-2010
found incident file in Data/

### Data Cleaning ###

Now we need to look at each of the 4 list of 10 dataframes and compare how they are setup.  We'll start by looking at the column setup of the incidents frames.  We need to evaluate whether they are formatted the same way across the years and contain the same data.


In [148]:
for frame in incident_lst:
    print(frame.dtypes, len(frame.columns))


agency_id                int64
incident_id              int64
nibrs_month_id           int64
incident_number          int64
cargo_theft_flag        object
submission_date        float64
incident_date           object
report_date_flag        object
incident_hour          float64
cleared_except_id        int64
cleared_except_date     object
incident_status          int64
data_home               object
ddocname                object
orig_format            float64
ff_line_number         float64
did                    float64
dtype: object 17
agency_id                int64
incident_id              int64
nibrs_month_id           int64
incident_number          int64
cargo_theft_flag        object
submission_date        float64
incident_date           object
report_date_flag        object
incident_hour          float64
cleared_except_id        int64
cleared_except_date     object
incident_status          int64
data_home               object
ddocname                object
orig_format           

Looking at the columns, a couple things stand out - some of the files, the column names are in All CAPS which could cause problems when the data is joined/concatenated.  The other thing is that those same years where the data is capitalized also have an additional column at the start called "DATA_YEAR".  It might be worth ensuring this is replicated across all of the data frames.

In [3]:
for frame in incident_lst:
    
  
    # Fortunately, we can check whether the Data Year column is not in the frame and if so we can add it AND we can capitalize the column names
    if "DATA_YEAR" not in frame.columns:
        frame.insert(0,'DATA_YEAR',value=np.NaN)
    
        frame.drop(["incident_number","ddocname",'ff_line_number'],axis=1, inplace=True)
        
        
    frame.columns = [column.upper() for column in frame.columns]
    

    


In [4]:
for frame in incident_lst:
    print(frame.columns, len(frame.columns))

Index(['DATA_YEAR', 'AGENCY_ID', 'INCIDENT_ID', 'NIBRS_MONTH_ID',
       'CARGO_THEFT_FLAG', 'SUBMISSION_DATE', 'INCIDENT_DATE',
       'REPORT_DATE_FLAG', 'INCIDENT_HOUR', 'CLEARED_EXCEPT_ID',
       'CLEARED_EXCEPT_DATE', 'INCIDENT_STATUS', 'DATA_HOME', 'ORIG_FORMAT',
       'DID'],
      dtype='object') 15
Index(['DATA_YEAR', 'AGENCY_ID', 'INCIDENT_ID', 'NIBRS_MONTH_ID',
       'CARGO_THEFT_FLAG', 'SUBMISSION_DATE', 'INCIDENT_DATE',
       'REPORT_DATE_FLAG', 'INCIDENT_HOUR', 'CLEARED_EXCEPT_ID',
       'CLEARED_EXCEPT_DATE', 'INCIDENT_STATUS', 'DATA_HOME', 'ORIG_FORMAT',
       'DID'],
      dtype='object') 15
Index(['DATA_YEAR', 'AGENCY_ID', 'INCIDENT_ID', 'NIBRS_MONTH_ID',
       'CARGO_THEFT_FLAG', 'SUBMISSION_DATE', 'INCIDENT_DATE',
       'REPORT_DATE_FLAG', 'INCIDENT_HOUR', 'CLEARED_EXCEPT_ID',
       'CLEARED_EXCEPT_DATE', 'INCIDENT_STATUS', 'DATA_HOME', 'ORIG_FORMAT',
       'DID'],
      dtype='object') 15
Index(['DATA_YEAR', 'AGENCY_ID', 'INCIDENT_ID', 'NIBRS_MONTH_ID',
 

In [5]:
inc = pd.concat(incident_lst)

In [6]:
# We'd like to see what the data looks like across the fully combined frame. We'll use the sample method (and use the random_state parameter to make it consistent) to see across the dataset

Some additional steps we need to take to clean the data....the dates (particularly the incident dates) are in very different formats. Data Year is missing for the frames we added the column to

In [7]:
inc['INCIDENT_DATE'] = pd.to_datetime(inc['INCIDENT_DATE'])

In [8]:
inc.sample(30)

Unnamed: 0,DATA_YEAR,AGENCY_ID,INCIDENT_ID,NIBRS_MONTH_ID,CARGO_THEFT_FLAG,SUBMISSION_DATE,INCIDENT_DATE,REPORT_DATE_FLAG,INCIDENT_HOUR,CLEARED_EXCEPT_ID,CLEARED_EXCEPT_DATE,INCIDENT_STATUS,DATA_HOME,ORIG_FORMAT,DID
246409,,9049,71198909,6258022,,,2013-07-08,,22.0,6,,0,C,,
453205,,8814,66630891,5992155,,,2012-10-13,,1.0,6,,0,C,,
350089,,8638,75645108,6501896,,,2014-03-20,,20.0,6,,0,C,,
341874,2018.0,9041,101847696,8380183,,21-DEC-18,2018-10-03,,20.0,6,,0,C,F,29061114.0
203893,2017.0,8626,88812073,7487399,,17-AUG-18,2017-01-18,,17.0,6,,0,C,F,3215221.0
235398,2017.0,9063,93513338,7644880,,17-AUG-18,2017-04-11,,17.0,6,,0,C,F,12551144.0
62386,,8630,53751476,5251331,,,2010-06-12,,20.0,6,,0,C,,
452979,,8814,57721350,5707882,,,2011-12-08,,22.0,6,,0,C,,
124614,,8710,81797336,6823517,,,2015-03-26,R,,6,,0,C,,
88872,,8613,79340565,6912263,,,2015-10-08,R,20.0,6,,0,C,,


In [9]:
inc.groupby(inc['INCIDENT_DATE'].dt.year)['INCIDENT_ID'].agg('count')

INCIDENT_DATE
2010    597205
2011    564424
2012    562587
2013    522879
2014    492028
2015    487547
2016    504997
2017    477382
2018    458209
2019    419064
Name: INCIDENT_ID, dtype: int64

In [10]:
inc['DATA_YEAR'] = inc['INCIDENT_DATE'].dt.year
inc = inc[['DATA_YEAR', 'AGENCY_ID', 'INCIDENT_ID','INCIDENT_DATE','INCIDENT_HOUR']]

In [11]:
inc.to_csv('Data/Combined_Incidents.csv',index=False)

In [92]:
first_agency = agencies_lst[1]
# first_agency[first_agency['agency_name'].str.contains("Michigan State")==True]
first_agency[first_agency['agency_id']==9040]
# first_agency[first_agency['agency_id']== 8286]

Unnamed: 0,agency_id,ori,legacy_ori,agency_name,short_name,agency_type_id,agency_type_name,tribe_id,campus_id,city_id,city_name,state_id,state_abbr,primary_county_id,primary_county,primary_county_fips,agency_status,submitting_agency_id,submitting_sai,submitting_name,submitting_state_abbr,start_year,dormant_year,current_year,revised_rape_start,current_nibrs_start_year,population,population_group_code,population_group_desc,population_source_flag,suburban_area_flag,core_city_flag,months_reported,nibrs_months_reported,past_10_years_reported,covered_by_id,covered_by_ori,covered_by_name,staffing_year,total_officers,total_civilians,icpsr_zip,icpsr_lat,icpsr_lng,DATA_YEAR
580,9040,MI8190300,MI8190300,University of Michigan: Ann Arbor,University of Michigan: Ann Arbor,3,University or College,,764.0,,,26,MI,1324,Washtenaw,26161.0,A,23374,MIUCR0001,Michigan State Police Statistical Records Divi...,MI,1991,,2016,2013.0,2016.0,0,7,"Cities under 2,500",L,Y,N,12,12,10,,,,2016.0,59.0,17.0,48502,42.252327,-83.844634,2014


In [71]:
second_agency = agencies_lst[-1]
# second_agency.head()
second_agency[second_agency['AGENCY_ID']==9040]

Unnamed: 0,YEARLY_AGENCY_ID,AGENCY_ID,DATA_YEAR,ORI,LEGACY_ORI,COVERED_BY_LEGACY_ORI,DIRECT_CONTRIBUTOR_FLAG,DORMANT_FLAG,DORMANT_YEAR,REPORTING_TYPE,UCR_AGENCY_NAME,NCIC_AGENCY_NAME,PUB_AGENCY_NAME,PUB_AGENCY_UNIT,AGENCY_STATUS,STATE_ID,STATE_NAME,STATE_ABBR,STATE_POSTAL_ABBR,DIVISION_CODE,DIVISION_NAME,REGION_CODE,REGION_NAME,REGION_DESC,AGENCY_TYPE_NAME,POPULATION,SUBMITTING_AGENCY_ID,SAI,SUBMITTING_AGENCY_NAME,SUBURBAN_AREA_FLAG,POPULATION_GROUP_ID,POPULATION_GROUP_CODE,POPULATION_GROUP_DESC,PARENT_POP_GROUP_CODE,PARENT_POP_GROUP_DESC,MIP_FLAG,POP_SORT_ORDER,SUMMARY_RAPE_DEF,PE_REPORTED_FLAG,MALE_OFFICER,MALE_CIVILIAN,PED.MALE_OFFICER+PED.MALE_CIVILIAN,FEMALE_OFFICER,FEMALE_CIVILIAN,PED.FEMALE_CIVILIAN+PED.FEMALE_OFFICER,0,0.1,NIBRS_CERT_DATE,NIBRS_START_DATE,NIBRS_LEOKA_START_DATE,NIBRS_CT_START_DATE,NIBRS_MULTI_BIAS_START_DATE,NIBRS_OFF_ETH_START_DATE,COVERED_FLAG,COUNTY_NAME,MSA_NAME,PUBLISHABLE_FLAG,PARTICIPATED,NIBRS_PARTICIPATED
556,90402018,9040,2018,MI8190300,MI8190300,,N,N,,I,UNIV OF MI: ANN ARBOR,UNIV OF MICH DEPT OF PUBLIC SAFETY ANN ARBOR,University of Michigan:,Ann Arbor,A,26,Michigan,MI,MI,3,East North Central,2,Midwest,Region II,University or College,0,23374,MIUCR0001,Michigan State Police Criminal Justice Informa...,Y,11,7,"Cities under 2,500",7,"Cities under 2,500",N,2,R,Y,48.0,8.0,56.0,12.0,8.0,20.0,0,0,01-OCT-94,01-JAN-95,01-JUL-09,01-JUL-12,01-JAN-17,01-JAN-17,N,WASHTENAW,"Ann Arbor, MI",Y,Y,Y


In [61]:
second_agency[second_agency['AGENCY_ID']==8554][['AGENCY_ID','UCR_AGENCY_NAME','NCIC_AGENCY_NAME', 'PUB_AGENCY_UNIT','AGENCY_TYPE_NAME','DATA_YEAR']]

Unnamed: 0,AGENCY_ID,UCR_AGENCY_NAME,NCIC_AGENCY_NAME,PUB_AGENCY_UNIT,AGENCY_TYPE_NAME,DATA_YEAR
205,8554,MICHIGAN STATE UNIVERSIT,MI STATE UNIV PD EAST LANSING,,University or College,2018


Comparing the two different dataframes (2016 vs 2018) there are quite a few differences.  The biggest challenge is that the naming of the agency (which provides attribution for the the incident/offense) changes.  Where 2016 simply had an agency name, 2018 is showing that the name is split between different reporting systems/agencies so it's really just a matter of choosing a column that is closest to the agency_name column found in the older datasets. Fortunately, it appears that the core agency_ids are still in-tact so we can assume that 9040 is University of Michigan Police and 8554 is MSUs.  

Of less concern but worthy to note, both types of datasets contain a field that contains county but the columns are labeled a bit differently.   That might also be helpful to include if we want to investigate a little more broadly geographically. Also, DATA_YEAR is not found in the old dataset under that name but we can use the current year column to get that information

Now that we have a sense of the differences, we can reduce each dataframe down to only the fields we care about and in the process ensure that the columns are consistently named so that we are successfully concatenating them together.



In [93]:
clean_agencies_lst = []
for frame in agencies_lst:
    if len(frame.columns) < 59:
        frame = frame[['agency_id','agency_name','primary_county', 'agency_type_name', 'DATA_YEAR']]
    else:
        frame = frame[['AGENCY_ID','UCR_AGENCY_NAME','PUB_AGENCY_UNIT','AGENCY_TYPE_NAME','DATA_YEAR']]
        
    frame.columns = ['AGENCY_ID','AGENCY_NAME','COUNTY','TYPE','YEAR']
    clean_agencies_lst.append(frame)

In [94]:
agen = pd.concat(clean_agencies_lst)
agen[agen['AGENCY_ID']==9040]

Unnamed: 0,AGENCY_ID,AGENCY_NAME,COUNTY,TYPE,YEAR
571,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2013
580,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2014
705,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2015
571,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2012
560,9040,UNIV OF MI: ANN ARBOR,Ann Arbor,University or College,2019
553,9040,UNIV OF MI: ANN ARBOR,Ann Arbor,University or College,2017
567,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2010
571,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2011
527,9040,University of Michigan: Ann Arbor,Washtenaw,University or College,2016
556,9040,UNIV OF MI: ANN ARBOR,Ann Arbor,University or College,2018


In [95]:
agen[agen['AGENCY_ID']==8554]

Unnamed: 0,AGENCY_ID,AGENCY_NAME,COUNTY,TYPE,YEAR
112,8554,Michigan State University,Ingham,University or College,2013
115,8554,Michigan State University,Ingham,University or College,2014
39,8554,Michigan State University,Ingham,University or College,2015
112,8554,Michigan State University,Ingham,University or College,2012
211,8554,MICHIGAN STATE UNIVERSIT,,University or College,2019
208,8554,MICHIGAN STATE UNIVERSIT,,University or College,2017
111,8554,Michigan State University,Ingham,University or College,2010
112,8554,Michigan State University,Ingham,University or College,2011
30,8554,Michigan State University,Ingham,University or College,2016
205,8554,MICHIGAN STATE UNIVERSIT,,University or College,2018


Since there's a lot of duplication in this data (same agencies appear for multiple years), We want to condense it down to only the unique agencies and pull in the other details like County,etc.  We have to account for the fact that there might be agency ids that existed in earlier years that don't have in the latter.  We'll groupby by the agency id and get the highest year (most recent) that we have record of them.  We'll convert this to a dataframe and give the same named columns as what the larger dataframe has in preparation for a merge.

In [121]:
maxagens = agen.groupby('AGENCY_ID')['YEAR'].agg('max')
unique_agens = maxagens.to_frame().reset_index()
unique_agens.columns = ['AGENCY_ID','YEAR']

Next, we'll merge the datasets on the agency_id and year (common to both datasets) and get the details from duplicated table.  We'll save this off as a csv file.

In [128]:
combined = unique_agens.merge(agen, how='left', on=['AGENCY_ID','YEAR'])
combined.to_csv('Data/agencies.csv',index=False)

In [129]:
combined

Unnamed: 0,AGENCY_ID,YEAR,AGENCY_NAME,COUNTY,TYPE
0,8286,2019,SP: ALCONA COUNTY,Alcona County,State Police
1,8287,2019,ALCONA,,County
2,8288,2015,Harrisville Police Department,Alcona,City
3,8289,2015,Lincoln Police Department,Alcona,City
4,8290,2019,SP: ALGER COUNTY,Alger County,State Police
...,...,...,...,...,...
838,26021,2019,"STATE POLICE, DETROIT",,State Police
839,26624,2019,METRO POL AUTH GENESEE CNTY,,City
840,28034,2019,DEPT NAT RESOURCES LAW ENF DIV,,Other State Agency
841,28154,2019,WASHTENAW COMMUNITY COLLEGE,,University or College


## OFFENSE TYPE ##

In [156]:
for frame in offense_type_lst:
    print(frame.columns, len(frame.columns))

Index(['offense_type_id', 'offense_code', 'offense_name', 'crime_against',
       'ct_flag', 'hc_flag', 'hc_code', 'offense_category_name', 'DATA_YEAR'],
      dtype='object') 9
Index(['offense_type_id', 'offense_code', 'offense_name', 'crime_against',
       'ct_flag', 'hc_flag', 'hc_code', 'offense_category_name', 'DATA_YEAR'],
      dtype='object') 9
Index(['offense_type_id', 'offense_code', 'offense_name', 'crime_against',
       'ct_flag', 'hc_flag', 'hc_code', 'offense_category_name', 'DATA_YEAR'],
      dtype='object') 9
Index(['offense_type_id', 'offense_code', 'offense_name', 'crime_against',
       'ct_flag', 'hc_flag', 'hc_code', 'offense_category_name', 'DATA_YEAR'],
      dtype='object') 9
Index(['OFFENSE_TYPE_ID', 'OFFENSE_CODE', 'OFFENSE_NAME', 'CRIME_AGAINST',
       'CT_FLAG', 'HC_FLAG', 'HC_CODE', 'OFFENSE_CATEGORY_NAME',
       'OFFENSE_GROUP', 'DATA_YEAR'],
      dtype='object') 10
Index(['OFFENSE_TYPE_ID', 'OFFENSE_CODE', 'OFFENSE_NAME', 'CRIME_AGAINST',
       'CT

Looks like there is some variation here - capitalization of column names in later files and the addition of a column ('OFFENSE_GROUP") in latter files. To keep it simple, we'll only extract the ID and Name

In [157]:
offense_type_lst[0].head()

Unnamed: 0,offense_type_id,offense_code,offense_name,crime_against,ct_flag,hc_flag,hc_code,offense_category_name,DATA_YEAR
0,58,23*,Not Specified,Property,N,Y,6.0,Larceny/Theft Offenses,2013
1,1,09C,Justifiable Homicide,Not a Crime,N,N,,Homicide Offenses,2013
2,2,26A,False Pretenses/Swindle/Confidence Game,Property,Y,Y,,Fraud Offenses,2013
3,3,36B,Statutory Rape,Person,N,Y,,Sex Offenses,2013
4,4,11C,Sexual Assault With An Object,Person,N,Y,2.0,Sex Offenses,2013


In [161]:
offense_type_lst[-1].sample(10)

Unnamed: 0,OFFENSE_TYPE_ID,OFFENSE_CODE,OFFENSE_NAME,CRIME_AGAINST,CT_FLAG,HC_FLAG,HC_CODE,OFFENSE_CATEGORY_NAME,OFFENSE_GROUP,DATA_YEAR
32,7,23G,Theft of Motor Vehicle Parts or Accessories,Property,N,N,,Larceny/Theft Offenses,A,2018
65,42,90B,Curfew/Loitering/Vagrancy Violations,Society,N,N,,Curfew/Loitering/Vagrancy Violations,B,2018
76,59,64A,"Human Trafficking, Commercial Sex Acts",Person,N,Y,12.0,Human Trafficking,A,2018
17,51,13B,Simple Assault,Person,N,Y,9.0,Assault Offenses,A,2018
7,78,49C,Flight to Avoid Deportation,Society,N,N,,Other Offenses,A,2018
16,46,26C,Impersonation,Property,Y,Y,,Fraud Offenses,A,2018
35,11,250,Counterfeiting/Forgery,Property,N,Y,,Counterfeiting/Forgery,A,2018
84,69,101,Treason,Society,N,N,,Other Offenses,A,2018
10,81,526,Explosives Violation,Society,N,N,,Other Offenses,A,2018
56,32,09A,Murder and Nonnegligent Manslaughter,Person,N,Y,1.0,Homicide Offenses,A,2018


In [162]:
clean_off_type_lst = []
for frame in offense_type_lst:
    if len(frame.columns) == 9:
        temp = frame[['offense_type_id','offense_name', 'crime_against','offense_category_name','DATA_YEAR']]
    else:
        temp = frame[['OFFENSE_TYPE_ID','OFFENSE_NAME','CRIME_AGAINST','OFFENSE_CATEGORY_NAME','DATA_YEAR']]
        
    temp.columns = ['OFFENSE_TYPE_ID','NAME','AGAINST','CATEGORY','YEAR']
    clean_off_type_lst.append(temp)


In [163]:
off_type = pd.concat(clean_off_type_lst)
print(off_type.shape)
off_type.sample(10)

(684, 5)


Unnamed: 0,OFFENSE_TYPE_ID,NAME,AGAINST,CATEGORY,YEAR
38,38,Negligent Manslaughter,Person,Homicide Offenses,2017
24,48,All Other Offenses,Society,All Other Offenses,2019
5,5,Destruction/Damage/Vandalism of Property,Property,Destruction/Damage/Vandalism of Property,2012
3,3,Statutory Rape,Person,Sex Offenses,2016
62,64,Hacking/Computer Invasion,Property,Fraud Offenses,2011
42,42,Curfew/Loitering/Vagrancy Violations,Society,,2014
42,42,Curfew/Loitering/Vagrancy Violations,Society,,2015
37,37,Embezzlement,Property,Embezzlement,2017
13,84,Federal Liquor Offenses,Society,Other Offenses,2019
34,34,Trespass of Real Property,Society,,2010


In [166]:
#Quick check of the number of unique offense types. 

len(off_type.OFFENSE_TYPE_ID.unique())

86

In [171]:
off_type_max = off_type.groupby('OFFENSE_TYPE_ID')['YEAR'].max().to_frame()

In [174]:
off_types_full = off_type_max.merge(off_type,how='left',on=['OFFENSE_TYPE_ID','YEAR'])

In [175]:
off_types_full.to_csv('Data/NIBRS/off_types.csv',index=False)

## Offenses ##

In [196]:
for frame in offenses_lst:
    print(frame.columns, len(frame.columns))

Index(['DATA_YEAR', 'offense_id', 'incident_id', 'offense_type_id',
       'attempt_complete_flag', 'location_id', 'num_premises_entered',
       'method_entry_code', 'ff_line_number'],
      dtype='object') 9
Index(['DATA_YEAR', 'offense_id', 'incident_id', 'offense_type_id',
       'attempt_complete_flag', 'location_id', 'num_premises_entered',
       'method_entry_code', 'ff_line_number'],
      dtype='object') 9
Index(['DATA_YEAR', 'offense_id', 'incident_id', 'offense_type_id',
       'attempt_complete_flag', 'location_id', 'num_premises_entered',
       'method_entry_code', 'ff_line_number'],
      dtype='object') 9
Index(['DATA_YEAR', 'offense_id', 'incident_id', 'offense_type_id',
       'attempt_complete_flag', 'location_id', 'num_premises_entered',
       'method_entry_code', 'ff_line_number'],
      dtype='object') 9
Index(['DATA_YEAR', 'OFFENSE_ID', 'INCIDENT_ID', 'OFFENSE_TYPE_ID',
       'ATTEMPT_COMPLETE_FLAG', 'LOCATION_ID', 'NUM_PREMISES_ENTERED',
       'METHOD_ENTRY_

Looks like everything is consistent sans some capitalization...this is no problem.  We only need the first few columns so we'll use indexing to extract those and put them into a list of clean frames

In [197]:
clean_offense_lst = []
for frame in offenses_lst:
    frame = frame.iloc[:,:4]
    frame.columns = ['YEAR','OFFENSE_ID','INCIDENT_ID','OFFENSE_TYPE_ID']
    clean_offense_lst.append(frame)

In [198]:
offenses = pd.concat(clean_offense_lst)
offenses.sample(10)

Unnamed: 0,YEAR,OFFENSE_ID,INCIDENT_ID,OFFENSE_TYPE_ID
456000,2011,67438164,57667684,51
8250,2011,65064510,59560132,5
479031,2016,94466074,86370052,56
346270,2013,75151265,69000066,27
311940,2017,121512046,98318253,5
319917,2016,97575188,89130009,39
33733,2010,57627404,56355968,49
214375,2015,92047288,84260835,27
92198,2014,79803458,75451535,35
42512,2014,79679711,75493746,20


In [199]:
offenses.to_csv('Data/Offenses.csv',index=False)

In [200]:
offenses.shape

(5354701, 4)

In [203]:
inc.shape

(5086322, 15)