# ENGRD 2700 Class Project

*   Using the Stanford Open Policing Project to reveal racial bias in policing
*   Author: Brian Bobby (btb68)

The Stanford Open Policing Project describes its mission as "collecting and standardizing data on vehicle and pedestrian stops from law enforcement departments across the country... and making that information freely available," (www.openpolicing.stanford.edu). As no such dataset readily exists on a national level, the Stanford Open Policing Project is filling a crucial void in the path to open and public analysis of our police forces in a time when racial biases are increasingly being brought to light and such analysis is critically important.

For my class project, I thought that an interesting and telling way to discover racial biases in policing would be to see if I could create a simple machine learning model that could, given information about any traffic stop, accurately predict the race of the citizen involved. If I succeed in that, can I possibly predict other attributes of the citizen involved? Gender? Age? Does my ability to make these predictions accurately change depending on what location the data comes from?

To start, I will simply choose the city that seems to have the most available and most complete dataset, which seems to be Nashville, Tennessee. The dataset for Nashville has 70% or more responses in all of the fields that the Stanford Open Policing Project categorizes data into (it seems to be the only city with this distinction), and has more than 3 millions rows of data (a relatively large amount compared to the rest of the datasets):
![title](ssfromwebsite.png)

<br>
(a box indicates 70% or more response rate in that column)

For these reasons, I will begin my modelling using the Nashville dataset, and if need be, will switch to another good candidate. I will first import the dataset, and then remove columns that are unecessary for my project. There are 42 columns, and I will be removing the following: location, lat, lng, reporting_area, zone (these five can all be more usefully generalized with the column "precinct", which I will be keeping), officer_id, officer_id_hash (it will be too tedious to analyze on an officer-by-officer basis), all the uncleaned, raw columns of data included as the last few columns, as these have all been recategorized into more useful columns by the Stanford Open Policing Project, and a few other columns I just don't need. Additionally, I will be removing all rows that have nonresponses in order to have complete data.

In [1]:
import numpy as np
import pandas as pd

# import downloaded csv file as pandas dataframe
nashville_df = pd.read_csv('tn_nashville_2020_04_01.csv')

# drop unnecessary columns
nashville_df.drop(['raw_row_number','location','lat','lng','reporting_area',
                   'zone','officer_id_hash','type','reason_for_stop',
                   'vehicle_registration_state','notes','raw_verbal_warning_issued',
                   'raw_written_warning_issued','raw_traffic_citation_issued',
                   'raw_misd_state_citation_issued','raw_suspect_ethnicity',
                   'raw_driver_searched','raw_passenger_searched','raw_search_consent',
                   'raw_search_arrest','raw_search_warrant','raw_search_inventory',
                   'raw_search_plain_view'],axis=1,inplace=True)

print('Rows remaining before removing rows with missing values:',
      len(nashville_df))

# drop rows that have missing values
print('Rows remaining after removing rows with missing values:',
      len(nashville_df.dropna()))

  interactivity=interactivity, compiler=compiler, result=result)


Rows remaining before removing rows with missing values: 3092351
Rows remaining after removing rows with missing values: 111364


Obviously, one or more columns have a lot of nonresponses, and I'd rather have much more data to work with. I will attempt to see which column is this incomplete, and remove that column.

In [2]:
nashville_df.isna().sum()

date                        0
time                     5467
precinct               390222
subject_age               839
subject_race             1850
subject_sex             12822
violation                8020
arrest_made                28
citation_issued           320
outcome                  1935
contraband_found      2964646
contraband_drugs      2964646
contraband_weapons    2964646
frisk_performed            22
search_conducted           39
search_person              43
search_vehicle             41
search_basis          2964646
dtype: int64

It looks like the contraband columns and search column all have a lot of nonresponses, so I will remove them as well.

In [3]:
# drop columns
nashville_df.drop(['contraband_found','contraband_drugs','contraband_weapons',
                   'search_basis'],axis=1,inplace=True)

# drop rows with nonresponses
nashville_df.dropna(inplace=True)

# drop rows where 'precinct'='U', as these also signify unknown data
nashville_df.drop(nashville_df[nashville_df['precinct']=='U'].index.tolist(),
                  inplace=True)

# drop rows where the subject race is listed as "unknown"
nashville_df.drop(nashville_df[nashville_df['subject_race']=='unknown']
                  .index.tolist(),inplace=True)

# reset index
nashville_df.reset_index(drop=True,inplace=True)

print('Rows remaining after removing rows with missing values:',
      len(nashville_df))

Rows remaining after removing rows with missing values: 2647717


Now I have over 2.5 million complete rows of data, each representing a single traffic stop. My dataframe now looks somthing like this:

In [4]:
nashville_df.head()

Unnamed: 0,date,time,precinct,subject_age,subject_race,subject_sex,violation,arrest_made,citation_issued,warning_issued,outcome,frisk_performed,search_conducted,search_person,search_vehicle
0,2010-10-10,10:00:00,5,18.0,white,male,moving traffic violation,False,True,False,citation,False,False,False,False
1,2010-10-10,10:00:00,1,52.0,white,male,vehicle equipment violation,False,False,True,warning,False,False,False,False
2,2010-10-10,22:00:00,3,25.0,white,male,registration,False,False,True,warning,False,False,False,False
3,2010-10-10,01:00:00,7,26.0,white,female,moving traffic violation,False,False,True,warning,False,False,False,False
4,2010-10-10,10:04:00,7,33.0,white,male,seatbelt violation,False,False,True,warning,False,False,False,False


Next, I will turn all of my categorical features into numerical ones using One Hot Encoding. This strategy breaks up a column such as "violation" into a bunch of different columns, each signifying one speicifc violation, such as "moving traffic violation", "vehicle equipment violation", or "registration", and assigns a 1 to that row in that column if the original violation matches that new column, or a 0 if the original violation does not match that column. I will be employing this startegy on columns "subject_sex", "violation", and "outcome". I will turn all the columns containing booleans into numerical columns simply by replacing each "True" with a 1, and each "False" with a 0. After this, all of my columns should be numerical besides my target column, "subject_race".

In [5]:
# convert boolean columns to numerical
nashville_df['arrest_made']=nashville_df['arrest_made'].astype(int)
nashville_df['citation_issued']=nashville_df['citation_issued'].astype(int)
nashville_df['warning_issued']=nashville_df['warning_issued'].astype(int)
nashville_df['frisk_performed']=nashville_df['frisk_performed'].astype(int)
nashville_df['search_conducted']=nashville_df['search_conducted'].astype(int)
nashville_df['search_person']=nashville_df['search_person'].astype(int)
nashville_df['search_vehicle']=nashville_df['search_vehicle'].astype(int)


# convert categorical columns into numerical using One Hot Encoding

nashville_df['subject_sex']=pd.Categorical(nashville_df['subject_sex'])
dfDummies=pd.get_dummies(nashville_df['subject_sex'],prefix='subject_sex')
nashville_df.insert(6,'subject_sex_male',dfDummies['subject_sex_male'],)
nashville_df.insert(7,'subject_sex_female',dfDummies['subject_sex_female'],)
nashville_df.drop('subject_sex',axis=1,inplace=True)

nashville_df['violation']=pd.Categorical(nashville_df['violation'])
dfDummies=pd.get_dummies(nashville_df['violation'],prefix='violation')
nashville_df.insert(8,'violation_child_restraint',
                    dfDummies['violation_child restraint'],)
nashville_df.insert(9,'violation_investigative_stop',
                    dfDummies['violation_investigative stop'],)
nashville_df.insert(10,'violation_moving_traffic_violation',
                    dfDummies['violation_moving traffic violation'],)
nashville_df.insert(11,'violation_parking_violation',
                    dfDummies['violation_parking violation'],)
nashville_df.insert(12,'violation_registration',
                    dfDummies['violation_registration'],)
nashville_df.insert(13,'violation_safety_violation',
                    dfDummies['violation_safety violation'],)
nashville_df.insert(14,'violation_seatbelt_violation',
                    dfDummies['violation_seatbelt violation'],)
nashville_df.insert(15,'violation_vehicle_equipment_violation',
                    dfDummies['violation_vehicle equipment violation'],)
nashville_df.drop('violation',axis=1,inplace=True)

nashville_df['outcome']=pd.Categorical(nashville_df['outcome'])
dfDummies=pd.get_dummies(nashville_df['outcome'],prefix='outcome')
nashville_df.insert(19,'outcome_arrest',dfDummies['outcome_arrest'],)
nashville_df.insert(20,'outcome_citation',dfDummies['outcome_citation'],)
nashville_df.insert(21,'outcome_warning',dfDummies['outcome_warning'],)
nashville_df.drop('outcome',axis=1,inplace=True)

I now have all numerical columns besides my target, "subject_race". The column "subject_sex" was replaced by two columns, one for "male" and one for "female". The "violation column was replaced by eight separate columns, one for each type of violation. The "outcome" column was separated into three columns for "arrest", "citation", and "warning". My dataframe now looks something like this:

In [6]:
nashville_df.head()

Unnamed: 0,date,time,precinct,subject_age,subject_race,subject_sex_male,subject_sex_female,violation_child_restraint,violation_investigative_stop,violation_moving_traffic_violation,...,arrest_made,citation_issued,warning_issued,outcome_arrest,outcome_citation,outcome_warning,frisk_performed,search_conducted,search_person,search_vehicle
0,2010-10-10,10:00:00,5,18.0,white,1,0,0,0,1,...,0,1,0,0,1,0,0,0,0,0
1,2010-10-10,10:00:00,1,52.0,white,1,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
2,2010-10-10,22:00:00,3,25.0,white,1,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
3,2010-10-10,01:00:00,7,26.0,white,0,1,0,0,1,...,0,0,1,0,0,1,0,0,0,0
4,2010-10-10,10:04:00,7,33.0,white,1,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0


At this point, I am also going to reduce the number of "race" categories to just two: "white" and "person of color". I made this decision mainly to give my model an easier time predicting race — it is a fact that all people of color in this country experience discrimination, and I think it is more realistic to ask my model to discern between prejudice and no prejudice on a basis of race than to ask it to discern between slightly differing levels of prejudice, whatever those may be. For this reason, my model will have to predict whether the subject is white or a person of color, and the accuracy of such predictions should still be extremely telling.

In [7]:
# categorize "subject_race" in "white" and "poc"
nashville_df['subject_race'][nashville_df['subject_race']!='white']='poc'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


It is now time to try a model and see how successful it is. As a side note, I will not be including the columns "date" or "time" in my first model. However, that could be an interesting addition in a later model. I will be implementing a Naive Bayes' model, mostly because it's name reminded me of this class and I wanted to look into how it related to the Bayes' Theorem we learned. 

According to Wikipedia, the algorithm stems directly from the simple theorem we learned in class that calculates conditional probabilities. The classifier algorithm simply expands the theorem to take in multiple features as the condition, and then chooses the outcome with the highest likelihood for each prediction. Interestingly enough, the specific classifier that I will use, called a Gaussian Naive Bayes' classifier, uses the normal distribution to approximate each numerical feature, making calculations of conditional probabilities much more efficient. 

In [8]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

# establish features and target parameter
X = nashville_df.drop(['date','time','subject_race'], axis=1)
Y = nashville_df['subject_race']

# split data into a traning set (80% of the data) and a testing set
# (20% of the data)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = .2,
                                                    random_state = 42)

# create the model
model = GaussianNB()
model.fit(x_train, y_train)
print("Naive Bayes' model training accuracy score:",
      model.score(x_train,y_train))
print("Naive Bayes' model testing accuracy score:",
      model.score(x_test,y_test))

Naive Bayes' model training accuracy score: 0.5583774318717121
Naive Bayes' model testing accuracy score: 0.5588241959119544


This means that the model was able to accurately predict 56% of the data that it trained on, which was 80% of the original dataset. Additionally, when faced with entirely new data that it had not encountered before, the training set consisting of the other 20% of the original dataset, it again scored an accuracy of 56%. This means the model was not overfitted to the training data, and our random split yielded consistent results in both sets. 

While astounding, as I will discuss later, I believe these results could be improved slightly. To do so, I will try a few other models, and compare their accuracies. The next model I will try is a Decision Tree Classifier, and to enhance it I will iterate through maximum depths to assign to the model.

In [9]:
from sklearn.tree import DecisionTreeClassifier

# establish features and target parameter
X = nashville_df.drop(['date','time','subject_race'], axis=1)
Y = nashville_df['subject_race']

# split data into a traning set (80% of the data) and a testing set
# (20% of the data)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = .2,
                                                    random_state = 42)

# create the model, iterating through various depths
max_score=0
training_score=0
max_score_depth=0
for i in range(1,15):
    model = DecisionTreeClassifier(max_depth=i)
    model.fit(x_train,y_train)
    score=model.score(x_test,y_test)
    trscore=model.score(x_train,y_train)
    if(score>max_score):
        max_score=score
        training_score=trscore
        max_score_depth=i

print('Decision Tree model training accuracy score: ',max_score)
print('Decision Tree model testing accuracy score: ',training_score)
print('depth = ',max_score_depth)

Decision Tree model training accuracy score:  0.6385984922877042
Decision Tree model testing accuracy score:  0.6412861461268744
depth =  13


This means that the model was able to accurately predict 64% of the data that it trained on, which was 80% of the original dataset. Additionally, when faced with entirely new data that it had not encountered before, the training set consisting of the other 20% of the original dataset, it again scored an accuracy of 64%. This means the model was not overfitted to the training data, and our random split yielded consistent results in both sets. 

Given two choices, the first model predicts correctly almost 56% of the time. Given the same two choices, the second model predicts correctly 64% of the time. This result is significantly higher than naive random guessing, especially considering that the model achieved this success rate over 2.5 million independent trials. If police interactions truly had no pattern to them, the strategy should be just that: random guessing! If there really was no difference between traffic stops whose subjects are different races, then no model, no matter how complex, would be able to accurately discern between them, as there would be no difference to discern. But this model can accurately predict 64% of occurences, and it was shockingly easy for me to make. There is an obvious pattern, because if there wasn't, I wouldn't be able to make accurate predictions about a subject's race! I only wonder what an expert in machine learning could discover further.

Again, this model can accurately predict the race of the subject in 64% of traffic stops given solely situational information about the stop. Let that sink in for a moment. Race should be playing no role in the event of a police interaction, and instead, it's playing enough of a role that significant, *measurable* patterns are being picked up in the data on an incredibly large scale - the scale of an entire large American city, over the course of over 2.5 million traffic stops over a number of years. This speaks volumes as to the current state of our country. Racism reaches so much farther than just the cases that make the news. - it is prevalent at a systematic level, and the data proves it.

<br>
<br>
<br>
<br>
<br>
**Sources**
*   https://openpolicing.stanford.edu
*   https://en.wikipedia.org/wiki/Naive_Bayes_classifier