# Data Preparation
---
This notebook is meant to take a first look at The Washington Post's fatal police shootings dataset. This version of the data is from January 11, 2019. It also geocodes the cities/states into latitude/longitude coordinates to make some of the visualizations easier. Some features will also be created that are used in visualizations.

In [1]:
import pandas as pd
import numpy as np
import geopy
from IPython.display import display
import utils

print('Pandas version:', pd.__version__)
print('Numpy version:', np.__version__)
print('Geopy version:', geopy.__version__)

Pandas version: 0.20.3
Numpy version: 1.14.2
Geopy version: 1.14.0


In [2]:
# Load data
data = pd.read_csv('fatal-police-shootings-data.csv')

# Print shape
print(data.shape)
data.info()

(3959, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3959 entries, 0 to 3958
Data columns (total 14 columns):
id                         3959 non-null int64
name                       3959 non-null object
date                       3959 non-null object
manner_of_death            3959 non-null object
armed                      3710 non-null object
age                        3810 non-null float64
gender                     3956 non-null object
race                       3576 non-null object
city                       3959 non-null object
state                      3959 non-null object
signs_of_mental_illness    3959 non-null bool
threat_level               3959 non-null object
flee                       3819 non-null object
body_camera                3959 non-null bool
dtypes: bool(2), float64(1), int64(1), object(10)
memory usage: 379.0+ KB


In [3]:
display(data.head())

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
0,3,Tim Elliot,2015-01-02,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False
1,4,Lewis Lee Lembke,2015-01-02,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False
2,5,John Paul Quintero,2015-01-03,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False
3,8,Matthew Hoffman,2015-01-04,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False
4,9,Michael Rodriguez,2015-01-04,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False


In [4]:
data.describe()

Unnamed: 0,id,age
count,3959.0,3810.0
mean,2218.298813,36.839633
std,1240.046747,13.085869
min,3.0,6.0
25%,1145.5,27.0
50%,2207.0,35.0
75%,3293.5,45.0
max,4361.0,91.0


In [5]:
# Print some info about dataset
print('Total number of fatal shootings:', len(data))
print('Types of columns:', ', '.join(data.columns.tolist()))
# Unidentified people are labeled TK TK
print('Number of unidentified people:', len(data.loc[data['name'].str.upper() == 'TK TK']))
print('*'*50)
print('Missing data...')
for column in data.columns:
    print('Number of NaN (missing data) for', column, 'column:', data[column].isnull().sum())

# Convert date column to datetime
data.loc[:, 'date'] = pd.to_datetime(data['date'])

print('*'*50)
print('Latest date in data: %d/%d' % (data.loc[data['date'].dt.year == data['date'].dt.year.max()]['date'].dt.month.max(), data['date'].dt.year.max()))
print('5 most common weapons ("armed" column):', ', '.join(data['armed'].value_counts().index.tolist()[:5]))

# Geographic info
print(len(data['city'].value_counts()), 'unique cities')
print(len(data['state'].value_counts()), 'unique states')

Total number of fatal shootings: 3959
Types of columns: id, name, date, manner_of_death, armed, age, gender, race, city, state, signs_of_mental_illness, threat_level, flee, body_camera
Number of unidentified people: 119
**************************************************
Missing data...
Number of NaN (missing data) for id column: 0
Number of NaN (missing data) for name column: 0
Number of NaN (missing data) for date column: 0
Number of NaN (missing data) for manner_of_death column: 0
Number of NaN (missing data) for armed column: 249
Number of NaN (missing data) for age column: 149
Number of NaN (missing data) for gender column: 3
Number of NaN (missing data) for race column: 383
Number of NaN (missing data) for city column: 0
Number of NaN (missing data) for state column: 0
Number of NaN (missing data) for signs_of_mental_illness column: 0
Number of NaN (missing data) for threat_level column: 0
Number of NaN (missing data) for flee column: 140
Number of NaN (missing data) for body_came

In [6]:
# Look at first few elements of ID
print(sorted(data['id'])[:5])
# Look at last few elements of ID
print(sorted(data['id'])[-5:])

[3, 4, 5, 8, 9]
[4357, 4358, 4359, 4360, 4361]


The IDs don't start at 1 and some are missing but the index of the dataframe will go from 0 till the end of dataframe (`data.index`). 

In [7]:
# Display data for min and max age
display(data[data['age'] == data['age'].min()])
display(data[data['age'] == data['age'].max()])

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
833,980,Jeremy Mardis,2015-11-03,shot,unarmed,6.0,M,W,Marksville,LA,False,other,Car,True
2910,3229,Kameron Prescott,2017-12-21,shot,unarmed,6.0,M,W,Schertz,TX,False,other,Not fleeing,False


Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
2164,2407,Frank W. Wratny,2017-03-08,shot,gun,91.0,M,W,Union Township,PA,False,attack,Not fleeing,False


The youngest person killed was 6 years old. There are actually two people in the dataset who were 6 years old while the oldest person was 91 years old.

## Fixing Typos in Data

There is an entry in the dataset with id 1536 that describes a shooting that occurs in "Jacksonsville, FL." I believe this is an error and is meant to be "Jacksonville, FL." This [article](https://www.actionnewsjax.com/news/local/19-year-old-man-armed-with-knives-shot-by-officers/282740202) details the shooting and mentions Jacksonville, FL. The below code cell makes the change to the city name.

There is another error with entry id 2858 where it says the location is "Frederickstown, WA" but the article [here](https://www.thenewstribune.com/news/local/crime/article203114119.html) detailing the shooting states the city name is "Frederickson, WA" and there is no Frederickstown in Washington state.

I found one more error for entry id 3108 where the data says the shooting took place in "Rudioso, NM" but according to this [article](https://www.krqe.com/news/new-mexico/video-ruidoso-officer-involved-shooting/1170040807) the shooting happened in "Ruidoso, NM." I also could not find a city named "Rudioso" in New Mexico and think this is a typo.

Another typo is for id 4233 with a shooting happening in "Hayneville, AL" not "Haynesville, AL." This [article](https://www.alabamanews.net/2018/11/15/update-two-hayneville-police-officers-on-leave-after-shooting/) writes about the shooting.

Another apparent typo is with "Philadelphia" being spelled as "Philadephia" in the data as id 4271. I think this [article](https://patch.com/pennsylvania/philadelphia/details-dec-5-fatal-philly-police-shooting-released) is about that shooting although the data has unknown name (TK TK) but the article names the person as Justin Smith.

ID 196 talks about a shooting happening in "Joilet, IL," but this [article](https://abc7chicago.com/news/suspect-in-joliet-theft-fatally-shot-by-officer/543807/) mentions the shooting happening in "Joliet, IL."

Entry id 4153 details a shooting happening in "Scarbo, WV," however this [article](https://www.wvnews.com/news/wvnews/man-fatally-shot-during-hostage-situation-in-fayette-county-wv/article_5deedabf-c3d0-5981-8975-c6136221f803.html) and this [article](http://wvmetronews.com/2018/10/04/man-accused-of-hostage-exchange-killed/) talk about the shooting happening in "Scarbro, WV."

One more typo is with id 4292 where the shooting takes place in "Rangley, CO" but this [article](https://www.denverpost.com/2018/12/11/rangely-fatal-officer-involved-shooting/) and this [article](https://kdvr.com/2018/12/11/suspect-killed-in-shooting-with-officers-deputies-in-rangely/) describe the shooting happening in "Rangely, CO."

In [8]:
print("Original:", data.loc[data['id'] == 1536, "city"].values[0])
data.loc[(data['city'] == "Jacksonsville") & (data['state'] == "FL"), "city"] = "Jacksonville"
print("After change:", data.loc[data['id'] == 1536, "city"].values[0])

print("Original:", data.loc[data['id'] == 2858, "city"].values[0])
data.loc[(data['city'] == "Frederickstown") & (data['state'] == "WA"), "city"] = "Frederickson"
print("After change:", data.loc[data['id'] == 2858, "city"].values[0])

print("Original:", data.loc[data['id'] == 3108, "city"].values[0])
data.loc[(data['city'] == "Rudioso") & (data['state'] == "NM"), "city"] = "Ruidoso"
print("After change:", data.loc[data['id'] == 3108, "city"].values[0])

print("Original:", data.loc[data['id'] == 4233, "city"].values[0])
data.loc[(data['city'] == "Haynesville") & (data['state'] == "AL"), "city"] = "Hayneville"
print("After change:", data.loc[data['id'] == 4233, "city"].values[0])

print("Original:", data.loc[data['id'] == 4271, "city"].values[0])
data.loc[(data['city'] == "Philadephia") & (data['state'] == "PA"), "city"] = "Philadelphia"
print("After change:", data.loc[data['id'] == 4271, "city"].values[0])

print("Original:", data.loc[data['id'] == 196, "city"].values[0])
data.loc[(data['city'] == "Joilet") & (data['state'] == "IL"), "city"] = "Joliet"
print("After change:", data.loc[data['id'] == 196, "city"].values[0])

print("Original:", data.loc[data['id'] == 4292, "city"].values[0])
data.loc[(data['city'] == "Rangley") & (data['state'] == "CO"), "city"] = "Rangely"
print("After change:", data.loc[data['id'] == 4292, "city"].values[0])

print("Original:", data.loc[data['id'] == 4153, "city"].values[0])
data.loc[(data['city'] == "Scarbo") & (data['state'] == "WV"), "city"] = "Scarbro"
print("After change:", data.loc[data['id'] == 4153, "city"].values[0])

Original: Jacksonsville
After change: Jacksonville
Original: Frederickstown
After change: Frederickson
Original: Rudioso
After change: Ruidoso
Original: Haynesville
After change: Hayneville
Original: Philadephia
After change: Philadelphia
Original: Joilet
After change: Joliet
Original: Rangley
After change: Rangely
Original: Scarbo
After change: Scarbro


## Geocode Locations
---
In the `PoliceShootingsAnalysis.ipynb` notebook we are going to overlay areas where shooting occurred over a map plot. But to do this we need the coordinates of each city/state pair as latitude and longitude values. Here we will use [geopy](https://geopy.readthedocs.io/en/stable/) to grab the coordinates for each city/state pair. Geopy offers several geocoding services to use (Google, Bing, Nominatim, etc). I first used Google but only the first few geocoding requests were free then you needed to upgrade to a premium service. I switched to Nominatim but some locations weren't found so I recorded the cities/states that didn't work with geopy and manually looked up their coordinates. The `add_coordinates` function adds these missing coordinates to the dataframe.

Latitude and longitude values were obtained from this website: https://www.latlong.net. I searched up the city and state that was given in the dataset with the exception of *300 block of State Line Road, TN* which gave me a location that was located in Bedford County, TN when the actual shooting occurred in Weakley County, TN, about 180 miles away. This is probably because the latlong website was confused by the weird address-like city *300 block of State Line Road, TN*. I found the county information for this particular shooting in articles [here](http://www.wpsdlocal6.com/2017/12/26/investigation-underway-officer-involved-shooting-near-ky-tn-line/) and [here](http://www.kfvs12.com/story/37142089/officer-involved-shooting-under-investigation-in-weakley-county-tn).

In [9]:
print("Original data shape:", data.shape)
data = utils.get_coordinates(data)
print("Updated data shape:", data.shape)

Original data shape: (3959, 14)
Error 300 block of State Line Road, Tennessee
Error Barona Indian Reservation, California
Error Bristol, Virginia
Error Clear Creek Canyon, Colorado
Error Columbua, Indiana
Error Franklin, Tennessee
Error Killeen, Alabama
Error Lower Mount Bethel, Pennsylvania
Error McKinneyville, California
Error Muckleshoot Indian Reservation, Washington
Error North Shore, Hawaii
Error Pinion Hills, California
Error Pueblo of Laguna, New Mexico
Error Roxand Township, Michigan
Error Simpsonsville, Kentucky
Error Standing Rock Reservation, North Dakota
Error Watagua, Texas
Error Watsonsville, California
Error Weeki Wachi, Florida
Updated data shape: (3959, 16)


In [3]:
# Add missing coordinates to dataframe
utils.add_coordinates(data)

In [4]:
# Saving data
utils.serialize_data(data)

Sometimes Nominatim can incorrectly geocode locations. Some cities in one state get mixed up with a city with the same name in a different state. I plotted each state on it's own map to find which ones were incorrect. The code and maps for this are in `FindIncorrectCoord.ipynb` notebook. The next code cell fixes these incorrect lat/lon coordinates.

For one of the rows the city and state is listed as Big Bear, MO. I found Big Bear to be the name of a nearby resort but the actual shooting happened on Boo Boo Boulevard in Hollister, MO stated by this [article](http://bransontrilakesnews.com/news_free/article_c1b49402-7c0c-11e6-aad2-2f3a4225e09a.html) and [here](https://www.news-leader.com/story/news/crime/2016/09/16/taney-county-shootout-law-enforcement-leaves-1-man-dead-1-deputy-injured/90478972/). So I encoded the lat/lon coordinates to be from that street in Hollister, otherwise different geocoding services would give me coordinates for Big Bear, CA.

In [3]:
# Fix incorrectly geocoded coordinates
utils.fix_coordinates(data)

## Engineer Features for Plotting
---
Here we will create a couple extra columns to make plotting easier in the next notebook. The first column is `num_shootings` which will be the total number of shootings that occur in a given city/state pair. The second column is `common_armed` which is the most common `armed` value for a given city/state pair. Both of these will be displayed in the geographic plot in the next notebook. In the case of a tie between `armed` values then whichever comes first in alphabetical order will be chosen. For example in Brooklyn, NY there were 2 "gun" and 2 "knife" shootings but "gun" will be chosen as most common.

In [5]:
# Create 2 new columns
# Iterate over all unique locations
for (city, state), row in data.groupby(['city', 'state']):
    num = len(data.loc[(data['city'] == city) & (data['state'] == state)].index)
    common_armed = data.loc[(data['city'] == city) & (data['state'] == state), 'armed'].value_counts(dropna=False).sort_index().idxmax()
    data.loc[(data['city'] == city) & (data['state'] == state), 'num_shootings'] = num
    data.loc[(data['city'] == city) & (data['state'] == state), 'common_armed'] = common_armed
data.loc[:, 'num_shootings'] = data.loc[:, 'num_shootings'].astype(int)
display(data.head())

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,lat,lon,num_shootings,common_armed
0,3,Tim Elliot,2015-01-02,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False,47.215094,-123.100707,1,gun
1,4,Lewis Lee Lembke,2015-01-02,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False,45.494284,-122.867045,2,gun
2,5,John Paul Quintero,2015-01-03,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False,37.692236,-97.337545,6,gun
3,8,Matthew Hoffman,2015-01-04,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False,37.779281,-122.419236,15,gun
4,9,Michael Rodriguez,2015-01-04,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False,40.37637,-104.692187,2,gun


## Serialize Data
---
Serialize data to pickle file for easy loading. Also save updated version of dataset with fixed latitude and longitude coordinates.

In [6]:
# Saving data
utils.serialize_data(data)

In [2]:
# Loading data
data = utils.load_data()