# Use of Fatal Force by the US Police

In the United States, use of deadly force by police has been a high-profile and contentious issue. 1000 people are shot and killed by US cops each year. The ever-growing argument is that the US has a flawed Law Enforcement system that costs too many innocent civilians their lives. In this project, we will analyze one of America’s hottest political topics, which encompasses issues ranging from institutional racism to the role of Law Enforcement personnel
in society.

We will use 5 data sets in this study. Four of them describes demographics of cities in the US (city data sets) while the remaining one records the fatal incidents (police data set).

In [49]:
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 

## Data Preprocessing

In [50]:
education = pd.read_csv('data\education.csv', encoding = "ISO-8859-1")
income = pd.read_csv('data\income.csv', encoding = "ISO-8859-1")
poverty = pd.read_csv('data\poverty.csv', encoding = "ISO-8859-1")
race = pd.read_csv('data\share_race_by_city.csv', encoding = "ISO-8859-1")
test = pd.read_csv('data\police_killings_test.csv', encoding = "ISO-8859-1")
train = pd.read_csv('data\police_killings_train.csv', encoding = "ISO-8859-1")

We first inspect and clean null data.

In [51]:
test.isnull().sum()

id                           0
name                         0
date                         0
manner_of_death              0
armed                        3
age                         40
gender                       0
race                       104
city                         0
state                        0
signs_of_mental_illness      0
threat_level                 0
flee                        38
body_camera                  0
dtype: int64

Also look for invalid values.

In [52]:
# education has '-' as numbers

There is a discrepancy between the encoding of names between the police data set and city data sets. For example, the former refer LA in California as Los Angeles while the latter uses Los Angeles city.

We also observe that the police data set provide less information because it only has Chicago as a city, while the city data sets have Chicago city, Chicago Heights city and Chicago Ridge village. 

Assuming that cities bearing similar name should be geographically and demographically close to each other, we shall evenly distribute the number of fatal incidents between them.

In [53]:
# count the number of incidents grouping by city and state because city names may duplicate
city_count = train.value_counts(['city', 'state']).rename_axis(['City', 'Geographic Area']).reset_index(name='Counts')
city_count.head()

Unnamed: 0,City,Geographic Area,Counts
0,Los Angeles,CA,31
1,Phoenix,AZ,24
2,Houston,TX,22
3,Chicago,IL,21
4,Las Vegas,NV,16


In [54]:
merged = education.merge(income, on=['Geographic Area', 'City']).merge(poverty, on=['Geographic Area', 'City']).merge(race, on=['Geographic Area', 'City'])
merged.head()

Unnamed: 0,Geographic Area,City,percent_completed_hs,Median Income,poverty_rate,share_white,share_black,share_native_american,share_asian,share_hispanic
0,AL,Abanda CDP,21.2,11207,78.8,67.2,30.2,0.0,0.0,1.6
1,AL,Abbeville city,69.1,25615,29.1,54.4,41.4,0.1,1.0,3.1
2,AL,Adamsville city,78.9,42575,25.5,52.3,44.9,0.5,0.3,2.3
3,AL,Addison town,81.4,37083,30.7,99.1,0.1,0.0,0.1,0.4
4,AL,Akron town,68.6,21667,42.0,13.2,86.5,0.0,0.0,0.3


In [None]:
def merge_count(record):
    # find record(s) matching both name and state
    match_city = merged['City'].str.startswith(record['City'])
    match_state = merged['Geographic Area'] == record['Geographic Area']
    match_both = np.logical_and(match_city, match_state)
    # count the number of True
    length = np.count_nonzero(match_both)
    if length == 1:
        # if unique
        merged.loc[match_both, 'Counts'] = record['Counts']
        return merged.loc[match_both]
    elif length > 1:
        # if multiple, take average
        count = record['Counts']/length
        merged.loc[match_both, 'Counts'] = count
        return merged.loc[match_both]

merged['Counts'] = 0
city_count.apply(merge_count, axis=1)

In [77]:
merged.sort_values(by='Counts', ascending=False).head()

Unnamed: 0,Geographic Area,City,percent_completed_hs,Median Income,poverty_rate,share_white,share_black,share_native_american,share_asian,share_hispanic,Counts
2701,CA,Los Angeles city,75.5,50205,22.1,49.8,9.6,0.7,11.3,48.5,31.0
1198,AZ,Phoenix city,80.7,47326,23.1,65.9,6.5,2.2,3.2,40.8,24.0
25036,TX,Houston city,76.7,46187,22.5,50.5,23.7,0.7,6.0,43.8,22.0
15596,NV,Las Vegas city,83.3,50202,17.5,62.1,11.1,0.7,6.1,31.5,16.0
25744,TX,San Antonio city,81.4,46744,19.8,72.6,6.9,0.9,2.4,63.2,15.0


## Exploratory Data Analysis