# We will try here to define a bias score that is not biased by the county carateristics.

Each arrest contributing to the score of an officer will be ponderated by the parameters derivated from the stop's county's parameter.

We will try to apply the score on a small dataset for speed and memory before puting it onto a large one.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from copy import copy

from tqdm import tqdm
tqdm.pandas()

import statsmodels.api as sm
import statsmodels.formula.api as smf

  from pandas import Panel


In [2]:
folder = '../data/'
state = folder + 'fl_statewide.csv.zip'
state_reduced = folder + 'fl_statewide_reduced.csv.zip'

keep_columns = ['date', 'time', 'county_name', 'subject_age', 'subject_race', 'subject_sex', 'officer_id_hash', 'officer_age', 'officer_race', 'officer_sex', 'officer_years_of_service', 'arrest_made', 'citation_issued', 'warning_issued', 'frisk_performed', 'search_conducted']
mandatory_columns = ['date', 'time', 'subject_age', 'subject_race', 'subject_sex', 'officer_id_hash', 'officer_age', 'officer_race', 'officer_sex', 'officer_years_of_service', 'arrest_made', 'citation_issued', 'warning_issued', 'search_conducted']
minorities = ['white', 'hispanic', 'black']
boolean_columns = ['arrest_made', 'citation_issued', 'warning_issued', 'frisk_performed', 'search_conducted']

Load the data

In [3]:
# load all dataset
df = pd.read_csv(state_reduced)
print(len(df))

72975


Drop the useless columns and correct types.

In [4]:
df.drop(columns=df.columns.difference(keep_columns), inplace=True) # drop unused columns
df.dropna(subset=mandatory_columns, how='any', inplace=True) # drop nan values in mandatory columns
df['date'] = pd.to_datetime(df['date']) # to datetime
df['year'] = df['date'].dt.to_period('y')

df = df[df['officer_race'].isin(minorities)]
print(len(df))

28052


Ajouter moyenne par année civile pour gagner en précision avec le grand dataset.

In [5]:
def create_year_hash(df):
    df['officer_hash_year'] = df['officer_id_hash'] + '-' + df['year'].astype(str)

In [6]:
create_year_hash(df)

We need to remove officers with too few arrests.

In [7]:
(df.groupby('officer_hash_year')['date'].count() > 5).mean()

0.3739992375142966

It seems that when we define a new officer each year, there are too few entries by hash. Let's keep one entry by officer for now, regardless of time. (On larger dataset, maybe possible to make the separation)

In [8]:
officers_to_keep = df.groupby('officer_id_hash')['year'].count().loc[df.groupby('officer_id_hash')['year'].count() > 5].index

df = df[df.officer_id_hash.isin(officers_to_keep)]

Test with a single minority here.

In [9]:
for minority in minorities:
    df[minority + '_stoped'] = (df['subject_race'] == minority)

In [10]:
county_means = {}
for minority in minorities:
    county_means[minority] = df.groupby('county_name')[minority + '_stoped'].mean()

In [11]:
def score_by_county(officer_df, minority):
    county_stop_proportion = officer_df.groupby('county_name')[minority + '_stoped'].mean()
    return ((county_stop_proportion - county_means[minority].loc[county_stop_proportion.index])/county_means[minority].loc[county_stop_proportion.index]).mean()

In [13]:
scores = {}
for minority in minorities:
    scores[minority] = df.groupby('officer_id_hash').apply(score_by_county, minority)

In [14]:
officer_numerics = ['officer_age', 'officer_years_of_service']
officer_cat = ['officer_race', 'officer_sex']

In [15]:
# Create a dataframe with the characteristics of officers.
officer_df = df.groupby('officer_id_hash')[officer_numerics].mean()

officer_df[officer_cat] = (df[['officer_id_hash'] + officer_cat].drop_duplicates()).set_index('officer_id_hash')

# Add the bias score
for minority in minorities:
    officer_df[minority + '_bias'] = scores[minority]

In [16]:
for minority in minorities:
    print()
    print(f'--------------{minority.upper()}--------------')
    res = smf.ols(formula=f'{minority}_bias ~ C(officer_race) + C(officer_sex) + officer_age + officer_years_of_service', data=officer_df).fit()
    print(res.summary())
    print()
    print()


--------------WHITE--------------
                            OLS Regression Results                            
Dep. Variable:             white_bias   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                  0.011
Method:                 Least Squares   F-statistic:                     3.343
Date:                Fri, 18 Dec 2020   Prob (F-statistic):            0.00531
Time:                        01:56:32   Log-Likelihood:                -462.73
No. Observations:                1072   AIC:                             937.5
Df Residuals:                    1066   BIC:                             967.3
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------