### Clean Crime Data

This may not be useable because "These data do not represent county totals as they exclude crime counts for city agencies and other types of agencies that have jurisdiction within each county." Source: https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/tables/table-10/table-10-data-declaration

"General comments: The Metropolitan Counties classification encompasses jurisdictions covered by noncity law enforcement agencies located within currently designated Metropolitan Statistical Areas (MSAs). The Nonmetropolitan Counties classification encompasses jurisdictions covered by noncity agencies located outside currently designated MSAs.This table provides the volume of violent crime (murder and nonnegligent manslaughter, rape, robbery, and aggravated assault) and property crime (burglary, larceny-theft, and motor vehicle theft) as reported by law enforcement agencies (such as individual sheriffs’ offices and/or county police departments) in metropolitan counties and nonmetropolitan counties (listed alphabetically by state) that contributed data to the UCR Program. (Note:&nbsp; Arson is not included in the property crime total in this table; however, if complete arson data were provided, it will appear in the Arson column.)These data do not represent county totals as they exclude crime counts for city agencies and other types of agencies that have jurisdiction within each county."

In [None]:
import pandas as pd

# Crime data https://www.fbi.gov/services/cjis/ucr/publications table 10 for years 2014 - 2019


In [None]:
mapper = pd.read_csv('/work/cleaned-csvs/us_counties.csv')

crime_df = pd.DataFrame(columns=['year'])

for year in range(2014, 2019):
    # Create filepath read in df
    yr_range = str(year)
    
    filepath = 'assets/crime_counties_{}.xls'.format(yr_range)
    df = pd.read_excel(filepath)
    
    df = df.iloc[3:,:] #get rid of top few unneccessary rows
    
    new_header = df.iloc[0] #grab the first row for the header
    df = df[1:] #take the data less the header row
    df.columns = new_header #set the header row as the df header
    df['year'] = yr_range

    df['State'] = df['State'].ffill()

    crime_df = pd.concat([crime_df,df], ignore_index=True)


# Remove footnotes
crime_df.sort_values('County',inplace = True)

crime_df = crime_df.iloc[:-50,:]

len(crime_df)

12642

### There's a lot of missing data

consider bringing in metropolitan statistic area data, but this feels too messy to be worth the effort. 

In [None]:
crime_msa = pd.read_excel('/work/assets/crime_msa_2019.xls')

crime_msa

crime_msa = crime_msa.iloc[2:,:] #get rid of top few unneccessary rows
    
new_header = crime_msa.iloc[0] #grab the first row for the header
crime_msa = crime_msa[1:] #take the data less the header row
crime_msa.columns = new_header #set the header row as the df header


crime_msa['Metropolitan Statistical Area'] = crime_msa['Metropolitan Statistical Area'].ffill()

crime_msa[crime_msa['Counties/principal cities']=='Rate per 100,000 inhabitants']

crime_msa = crime_msa[['Metropolitan Statistical Area', 'Counties/principal cities','Violent\ncrime', 'Property\ncrime']]

crime_msa.head(5)

2,Metropolitan Statistical Area,Counties/principal cities,Violent\ncrime,Property\ncrime
3,"Abilene, TX M.S.A.",,,
4,"Abilene, TX M.S.A.","Includes Callahan, Jones, and Taylor Counties",,
5,"Abilene, TX M.S.A.",City of Abilene,458.0,3112.0
6,"Abilene, TX M.S.A.",Total area actually reporting,543.0,3603.0
7,"Abilene, TX M.S.A.","Rate per 100,000 inhabitants",317.3,2105.5


### Clean Up Crime Data - find inconsistencies in names

In [None]:
crime_df['state']=crime_df['State'].str.split(' -').str[0].str.lower() #isolate state in one column

crime_df['county']=crime_df['County'].str.lower() #make county lowercase


#There's got to be a better way --> dictionary?
crime_df['county'] = crime_df['county'].str.replace('\d+', '')
crime_df['county'] = crime_df['county'].str.replace(' county police department', '')
crime_df['county'] = crime_df['county'].str.replace(',', '')
crime_df['county'] = crime_df['county'].str.replace(' police department', '')
crime_df['county'] = crime_df['county'].str.replace(' public safety', '')
crime_df['county'] = crime_df['county'].str.replace('hartsville/trousdale', 'trousdale')
crime_df['county'] = crime_df['county'].str.replace('de witt', 'dewitt')
crime_df['county'] = crime_df['county'].str.replace('la porte', 'laporte')
crime_df['county'] = crime_df['county'].str.replace('la salle', 'lasalle')
crime_df['county'] = crime_df['county'].str.replace('de kalb', 'dekalb')
crime_df['county'] = crime_df['county'].str.replace(' county unified', '')
crime_df['county'] = crime_df['county'].str.replace('dona ana', 'doña ana')
crime_df['county'] = crime_df['county'].str.replace('snohomish ', 'snohomish')
crime_df['county'] = crime_df['county'].str.replace('duchess', 'dutchess')
#crime_df['county'] = crime_df['county'].str.replace('story', 'storey')
crime_df['county'] = crime_df['county'].str.replace('de soto ', 'de soto')
crime_df['county'] = crime_df['county'].str.replace('livingston ', 'livingston')
crime_df['county'] = crime_df['county'].str.replace('vermilion ', 'vermilion')
crime_df['county'] = crime_df['county'].str.replace('allen ', 'allen')
crime_df['county'] = crime_df['county'].str.replace('tulare ', 'tulare')
crime_df['county'] = crime_df['county'].str.replace('hinds ', 'hinds')
crime_df['county'] = crime_df['county'].str.replace('carson city', 'cars')
crime_df['county'] = crime_df['county'].str.replace('butte-silver bow', 'silver bow')
crime_df['county'] = crime_df['county'].str.replace('baltimore county', 'baltimore')
crime_df['county'] = crime_df['county'].str.replace("prince george's ", "prince george's")
crime_df['county'] = crime_df['county'].str.replace('bartholemew', 'bartholomew')
#crime_df['county'] = crime_df['county'].str.replace('st. bernard', 'st. bernard')
crime_df['county'] = crime_df['county'].str.replace('augusta-richmond', 'richmond')


crime_df = crime_df [[ 'county', 'Violent\ncrime', 'Property\ncrime', 'state','year']]

dict_crime = {'Violent\ncrime':'violent_crime', 'Property\ncrime':'property_crime','county':'county_name','state':'state_name'}

crime_df.rename(columns=dict_crime,inplace = True)


  crime_df['county'] = crime_df['county'].str.replace('\d+', '')


### Match county names with the fips code for states and counties

In [None]:
# The mapper is necessary is to add the state/county fips code for all of the crime data

mapper['county_name'] = mapper['county_name'].str.replace('la salle', 'lasalle')

mapper['county_name'] = mapper['county_name'].str.replace('de witt', 'dewitt')
mapper['county_name'] = mapper['county_name'].str.replace('de kalb', 'dekalb')



In [None]:
crime_df = pd.merge(mapper,crime_df,how='right',left_on = ['state_name','county_name'],right_on = ['state_name','county_name'])

In [None]:
# Get list of rows without state/county index

crime_df = crime_df[~crime_df['county_name'].isna()]
missing_1 = crime_df[crime_df['state'].isna()].sort_values('state_name')

missing_1.head()

Unnamed: 0,state,county,county_name,state_name,violent_crime,property_crime,year
10662,,,st. bernard,louisiana,86,833,2014
10933,,,story,nevada,31,95,2015


### Deal with Duplicates

There are to be 113 rows with information about one county, state, and year information. Most of these rows report both county crimes and county police department crimes. It appears that those that disaggregate in this way attribute nearly all of their crimes to the police department and none to the county. See Clayton below for an example. Currently, the fix is to take the higher crimes row for a given year, county and state. I'm open to other solutions too. 

In [None]:
crime_df = crime_df.set_index(['county','state','year'],)

crime_df.index.value_counts()[:112]

(177.0, 21.0, 2015)    2
(3.0, 24.0, 2017)      2
(117.0, 21.0, 2016)    2
(103.0, 36.0, 2017)    2
(127.0, 13.0, 2016)    2
                      ..
(99.0, 38.0, 2016)     1
(53.0, 30.0, 2014)     1
(57.0, 41.0, 2017)     1
(77.0, 22.0, 2014)     1
(41.0, 8.0, 2016)      1
Length: 112, dtype: int64

In [None]:
crime_df.reset_index(inplace=True)

In [None]:
# Anne arundel - used police department + county
# crime_df[(crime_df['county']==3) & (crime_df['state']==24)]


# Clayton police vs. Clayton County
# crime_df[(crime_df['county']==63) & (crime_df['state']==13)]

#crime_df[(crime_df['county']==37) & (crime_df['state']==21)]


crime_df = crime_df.sort_values('property_crime',ascending=False)

crime_df = crime_df.groupby(['county', 'state', 'year', 'county_name', 'state_name',]).first().reset_index()

In [None]:
14854 - len(crime_df) 

2313

### Export to csv

In [None]:
crime_df.to_csv('/work/cleaned-csvs/crime.csv')

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f6c76417-5fde-42f3-8920-755838dec3fa' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>