# New Orleans State Police records when stopping a person in the streets.

The purpose of this analysis is to compare the two main race groups, white or caucasian people and African-American or black people when being stopped by the police. The aim of this analysis is to understand the main differences between and if we can detect, based on these differences, if the police are being biased when treating with the black community. The following questions try to guide us to understand this topic better. 

#### Q1. Do Black individuals get stopped by the police for suspicious activities more often than whites?

##### Q1.2. Given that someone is pulled over for a suspicious reason, how often is because the person is black or white?

#### Q3. Do white people get more warnings for the same violations than black people do?

#### Q4. How likely is for a black person to be searched than for a white person.

#### Q5. Once a black or white person is being searched by the police, how likely is that the person indeed had contraband in general.

#### Q6. In what occasion a black or white person is more likely to be searched.

#### Q7.Which district had the most number of arrest and what is the proportion of races in that district?

#### Q8. What are the main reasons a person is being stopped in District 8 that end up in arrests?




In [1]:
#importing libraries
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 32)
#pd.set_option('display.max_rows', 1000)
import datetime
import time
from scipy import stats as ss

In [2]:
#Importing the data from New Orleans State patrol 

data = pd.read_csv('../data/la_new_orleans_2019_08_13.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
data.head()

Unnamed: 0,raw_row_number,date,time,location,lat,lng,district,zone,subject_age,subject_race,subject_sex,officer_assignment,type,arrest_made,citation_issued,warning_issued,outcome,contraband_found,contraband_drugs,contraband_weapons,frisk_performed,search_conducted,search_person,search_vehicle,search_basis,reason_for_stop,vehicle_color,vehicle_make,vehicle_model,vehicle_year,raw_actions_taken,raw_subject_race
0,1,2010-01-01,01:11:00,,,,6,E,26.0,black,female,6th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLACK,DODGE,CARAVAN,2005.0,,BLACK
1,9087,2010-01-01,01:29:00,,,,7,C,37.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLUE,NISSAN,MURANO,2005.0,,BLACK
2,9086,2010-01-01,01:29:00,,,,7,C,37.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLUE,NISSAN,MURANO,2005.0,,BLACK
3,267,2010-01-01,14:00:00,,,,7,I,96.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,GRAY,JEEP,GRAND CHEROKEE,2003.0,,BLACK
4,2,2010-01-01,02:06:00,,,,5,D,17.0,black,male,5th District,,False,False,False,,,,,False,False,False,False,,CALL FOR SERVICE,,,,,,BLACK


In [4]:
data.shape

(512092, 32)

### 1. Checking null values

In [5]:
data.isna().sum()

raw_row_number             0
date                       4
time                       0
location               95986
lat                   251684
lng                   251684
district                   0
zone                       0
subject_age            12786
subject_race           11730
subject_sex            11730
officer_assignment       123
type                  149907
arrest_made                0
citation_issued            0
outcome               176487
contraband_found      436301
contraband_drugs      436301
contraband_weapons    436301
frisk_performed            0
search_conducted           0
search_person              0
search_vehicle             0
search_basis          436301
reason_for_stop            0
vehicle_color         239138
vehicle_make          235765
vehicle_model         252982
vehicle_year          240388
raw_actions_taken     122455
raw_subject_race       11730
dtype: int64

In [6]:
#Removing columsn where null values represent at most 10% of the data
data = data.dropna(thresh=data.shape[0]*0.1, axis=1)

In [7]:
data.shape

(512092, 32)

### 2.Checking data types and make corrections if necessary

In [8]:
data.dtypes

raw_row_number         object
date                   object
time                   object
location               object
lat                   float64
lng                   float64
district               object
zone                   object
subject_age           float64
subject_race           object
subject_sex            object
officer_assignment     object
type                   object
arrest_made              bool
citation_issued          bool
outcome                object
contraband_found       object
contraband_drugs       object
contraband_weapons     object
frisk_performed          bool
search_conducted         bool
search_person            bool
search_vehicle           bool
search_basis           object
reason_for_stop        object
vehicle_color          object
vehicle_make           object
vehicle_model          object
vehicle_year          float64
raw_actions_taken      object
raw_subject_race       object
dtype: object

In [9]:
#Changing date and time column to datetime type
data = data.astype({'date': 'datetime64[ns]', 'time':'datetime64[ns]' })

In [10]:
#Getting full hour from the time column
data['time'] = pd.to_datetime(data['time']).dt.hour

In [11]:
data.head()

Unnamed: 0,raw_row_number,date,time,location,lat,lng,district,zone,subject_age,subject_race,subject_sex,officer_assignment,type,arrest_made,citation_issued,warning_issued,outcome,contraband_found,contraband_drugs,contraband_weapons,frisk_performed,search_conducted,search_person,search_vehicle,search_basis,reason_for_stop,vehicle_color,vehicle_make,vehicle_model,vehicle_year,raw_actions_taken,raw_subject_race
0,1,2010-01-01,1,,,,6,E,26.0,black,female,6th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLACK,DODGE,CARAVAN,2005.0,,BLACK
1,9087,2010-01-01,1,,,,7,C,37.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLUE,NISSAN,MURANO,2005.0,,BLACK
2,9086,2010-01-01,1,,,,7,C,37.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLUE,NISSAN,MURANO,2005.0,,BLACK
3,267,2010-01-01,14,,,,7,I,96.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,GRAY,JEEP,GRAND CHEROKEE,2003.0,,BLACK
4,2,2010-01-01,2,,,,5,D,17.0,black,male,5th District,,False,False,False,,,,,False,False,False,False,,CALL FOR SERVICE,,,,,,BLACK


### Q1. Do Black individuals get stopped by the police for suspicious activities more often than whites?


In [12]:
data.subject_race.value_counts(normalize=True).to_frame().apply(lambda x: x*100).applymap("{0:.2f}%".format)

Unnamed: 0,subject_race
black,69.91%
white,25.92%
hispanic,2.70%
asian/pacific islander,0.76%
unknown,0.64%
other,0.07%


In [13]:
(data.reason_for_stop.value_counts(normalize=True).to_frame()
.apply(lambda x: x*100)
.applymap("{0:.2f}%".format).head(10)
)

Unnamed: 0,reason_for_stop
TRAFFIC VIOLATION,57.08%
CALL FOR SERVICE,13.86%
SUSPECT PERSON,12.46%
CRIMINAL VIOLATION,5.88%
OTHER,4.79%
CITIZEN CONTACT,3.10%
SUSPECT VEHICLE,1.18%
FLAGGED DOWN,0.88%
JUVENILE VIOLATION,0.61%
PRESENT AT CRIME SCENE,0.15%


In [14]:
data[data.reason_for_stop =="SUSPECT PERSON"]["subject_race"].value_counts().to_frame()

Unnamed: 0,subject_race
black,48386
white,13966
hispanic,911
asian/pacific islander,134
unknown,93
other,20


#### If I would like to have relative proportions instead of absolute values, as well as having it presented in a neat way, I would use the following:

In [15]:
data_race = data[data.reason_for_stop =="SUSPECT PERSON"].subject_race.value_counts(normalize = True).to_frame()

In [16]:
data_race["subject_race"] = data_race["subject_race"] * 100

In [17]:
format_dict = {"subject_race":"{:.2f}%", 'reason_for_stop': '{:.2f}%' }

In [18]:
(data_race.style.format(format_dict)
.apply(lambda x: ['background: tomato' if c == x.loc['black'] or c == x.loc['white'] else "" for c in x], axis=0)
)

Unnamed: 0,subject_race
black,76.19%
white,21.99%
hispanic,1.43%
asian/pacific islander,0.21%
unknown,0.15%
other,0.03%


#### Comments:

The data indicate that 76% of the people that were stopped by the police because of suspicious reasons were African-American against only 22% of white people. It is important to mention that as seen before, the distributions of races is not balanced, especially between the two races of interest, black and white. For this reason, it is important to compare the percentage of being stopped for suspicious reasons between classes separately to see if the police are bias when seen a black or white person as a suspect just because of the color of their skin. 

#### Q1.2. Given that someone is pulled over for a suspicious reason, how often is because the person is black or white?


#### For a black person

In [19]:
data_suspect_black = data[data.subject_race == 'black'].reason_for_stop.value_counts(normalize=True).to_frame()

In [20]:
data_suspect_black.reason_for_stop = data_suspect_black.reason_for_stop * 100

In [21]:
(data_suspect_black.style.format(format_dict)
.apply(lambda x: ['background: tomato' if c == x.loc['SUSPECT PERSON'] else "" for c in x], axis=0)
)

Unnamed: 0,reason_for_stop
TRAFFIC VIOLATION,56.41%
CALL FOR SERVICE,13.97%
SUSPECT PERSON,13.83%
OTHER,5.11%
CRIMINAL VIOLATION,4.59%
CITIZEN CONTACT,3.21%
SUSPECT VEHICLE,1.21%
JUVENILE VIOLATION,0.81%
FLAGGED DOWN,0.67%
PRESENT AT CRIME SCENE,0.18%


#### For a white person

In [22]:
data_suspect_white = data[data.subject_race == 'white'].reason_for_stop.value_counts(normalize=True).to_frame()

In [23]:
data_suspect_white.reason_for_stop = data_suspect_white.reason_for_stop * 100 

In [24]:
(data_suspect_white.style.format(format_dict)
.apply(lambda x: ['background: tomato' if c == x.loc['SUSPECT PERSON'] else "" for c in x], axis=0)
)

Unnamed: 0,reason_for_stop
TRAFFIC VIOLATION,54.54%
CALL FOR SERVICE,14.43%
SUSPECT PERSON,10.77%
CRIMINAL VIOLATION,10.06%
OTHER,4.32%
CITIZEN CONTACT,3.22%
FLAGGED DOWN,1.46%
SUSPECT VEHICLE,0.90%
JUVENILE VIOLATION,0.20%
PRESENT AT CRIME SCENE,0.10%


#### Comments:

The data indicate that if we analyze both groups separately, white and black people, there seem to be no bias from the police when treating a person as a suspect regarding the color of their skin as ~14% of black people were stopped for suspicious reasons compared to ~11% for white people. Additionally, it would be a good idea to dig deeper into this subject as it is one of the main reasons why the black community blame the police of been racist. This practice of stopping black people just because they are black is called "profiling".

#### We can have it all in the same dataframe using groupby

In [25]:
data_group_reasons = data.groupby('subject_race').reason_for_stop.value_counts(normalize=True) \
                                                            .unstack().apply(lambda x: x*100)

In [26]:
data_group_reasons = data_group_reasons.applymap("{0:.3f}%".format)

In [27]:
data_group_reasons = pd.DataFrame(data_group_reasons, index=['black', 'white', 'hispanic',
                                                             'asian/pacific islander', 'other', 'unknown'])

In [28]:
(data_group_reasons.style.apply(lambda x: ['background: tomato' if c == x.loc['black']
                                           or c == x.loc['white'] else "" for c in x] , axis=0)
)

reason_for_stop,CALL FOR SERVICE,CITIZEN CONTACT,CRIMINAL VIOLATION,FLAGGED DOWN,JUVENILE VIOLATION,OTHER,PRESENT AT CRIME SCENE,SUSPECT PERSON,SUSPECT VEHICLE,TRAFFIC VIOLATION
black,13.973%,3.211%,4.594%,0.673%,0.810%,5.109%,0.179%,13.832%,1.206%,56.415%
white,14.432%,3.217%,10.056%,1.457%,0.200%,4.322%,0.103%,10.768%,0.901%,54.545%
hispanic,13.379%,2.113%,5.048%,1.134%,0.126%,3.736%,0.178%,6.753%,0.882%,66.652%
asian/pacific islander,9.966%,1.687%,3.296%,0.422%,0.026%,3.137%,0.026%,3.533%,0.580%,77.327%
other,11.111%,2.047%,4.386%,0.585%,nan%,4.094%,0.292%,5.848%,nan%,71.637%
unknown,8.805%,1.493%,2.209%,0.685%,0.031%,2.862%,0.093%,2.894%,0.653%,80.274%


### Q3. Do white people get more warnings for the same violations than black people do?


In [29]:
data_b_w_only = data[((data.subject_race == 'black') | (data.subject_race == 'white'))]

In [30]:
data_b_w_only.head()

Unnamed: 0,raw_row_number,date,time,location,lat,lng,district,zone,subject_age,subject_race,subject_sex,officer_assignment,type,arrest_made,citation_issued,warning_issued,outcome,contraband_found,contraband_drugs,contraband_weapons,frisk_performed,search_conducted,search_person,search_vehicle,search_basis,reason_for_stop,vehicle_color,vehicle_make,vehicle_model,vehicle_year,raw_actions_taken,raw_subject_race
0,1,2010-01-01,1,,,,6,E,26.0,black,female,6th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLACK,DODGE,CARAVAN,2005.0,,BLACK
1,9087,2010-01-01,1,,,,7,C,37.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLUE,NISSAN,MURANO,2005.0,,BLACK
2,9086,2010-01-01,1,,,,7,C,37.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLUE,NISSAN,MURANO,2005.0,,BLACK
3,267,2010-01-01,14,,,,7,I,96.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,GRAY,JEEP,GRAND CHEROKEE,2003.0,,BLACK
4,2,2010-01-01,2,,,,5,D,17.0,black,male,5th District,,False,False,False,,,,,False,False,False,False,,CALL FOR SERVICE,,,,,,BLACK


In [31]:
data_reasons_xtab = pd.crosstab(index=data_b_w_only.subject_race, columns=data_b_w_only.warning_issued)
data_reasons_xtab

warning_issued,False,True
subject_race,Unnamed: 1_level_1,Unnamed: 2_level_1
black,263384,86435
white,95342,34361


In [32]:
ss.fisher_exact(data_reasons_xtab)

(1.0981996203546454, 2.2920499294281307e-36)

The probability that we would observe this or an even more imbalanced ratio by chance is about 2.29e-36%. A commonly used significance level is 5%–if we adopt that, we can therefore conclude that our observed imbalance is statistically significant; **It seems that there is a significant effect of race on being issued a warning or not during a police stop .**

#### For traffic violations only

In [33]:
data_traffic_vio = (data[(data.reason_for_stop == 'TRAFFIC VIOLATION')
                        & ((data.subject_race == 'black') | (data.subject_race == 'white'))]
                   )

In [34]:
data_traf_xtab = pd.crosstab(index=data_traffic_vio.subject_race, columns=data_traffic_vio.warning_issued)
data_traf_xtab

warning_issued,False,True
subject_race,Unnamed: 1_level_1,Unnamed: 2_level_1
black,128088,69261
white,43975,26771


In [35]:
ss.fisher_exact(data_traf_xtab)

(1.1258445576737157, 8.279524369857724e-39)

Same as before **Black people received less warnings than white people for the same violation.**

#### Proportions in general for warnings

In [36]:
data_reasons_xtab['proportions'] = data_reasons_xtab.apply(lambda x: (x.iloc[1] / x.iloc[0]), axis=1)

In [37]:
data_reasons_xtab

warning_issued,False,True,proportions
subject_race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
black,263384,86435,0.328171
white,95342,34361,0.360397


Another way to get the proportions (simpler if I dont want to perform a chi-square test)

In [38]:
data_b_w_only.groupby('subject_race').warning_issued.mean().to_frame()

Unnamed: 0_level_0,warning_issued
subject_race,Unnamed: 1_level_1
black,0.247085
white,0.264921


Note. mean for boolean columns is like counting and dividing by 100 the times that the boolean takes one (whenever it is true)

In [39]:
(data_b_w_only.groupby(['reason_for_stop', 'subject_race']).warning_issued.mean().to_frame().unstack()
.sort_values(by=('warning_issued','black'), ascending=False))

Unnamed: 0_level_0,warning_issued,warning_issued
subject_race,black,white
reason_for_stop,Unnamed: 1_level_2,Unnamed: 2_level_2
TRAFFIC VIOLATION,0.350957,0.37841
SUSPECT VEHICLE,0.210002,0.244863
FLAGGED DOWN,0.158811,0.203175
SUSPECT PERSON,0.143389,0.182157
CITIZEN CONTACT,0.116097,0.123682
OTHER,0.10033,0.126472
CALL FOR SERVICE,0.093476,0.105882
PRESENT AT CRIME SCENE,0.089457,0.075188
CRIMINAL VIOLATION,0.073496,0.0884
JUVENILE VIOLATION,0.025768,0.023077


#### Comments: 

There seems to be a systematical difference in police behavior when giving warnings for same "violations" to black than to white people. 

#### Extra. Displaying both absolute and relative values in the same df


In [40]:
data_warnings_abs = data_b_w_only.warning_issued.value_counts().to_frame()

In [41]:
data_warnings_rel = data_b_w_only.warning_issued.value_counts(normalize=True).to_frame()

In [42]:
data_warnings_comb = pd.concat([data_warnings_rel, data_warnings_abs], axis=1)

In [43]:
data_warnings_comb.columns = ['data_warnings_rel','data_warnings_abs']

In [44]:
data_warnings_comb

Unnamed: 0,data_warnings_rel,data_warnings_abs
False,0.748091,358726
True,0.251909,120796


### Q4. How likely is for a black person to be searched than for a white person.

In [45]:
data_b_w_only.head()

Unnamed: 0,raw_row_number,date,time,location,lat,lng,district,zone,subject_age,subject_race,subject_sex,officer_assignment,type,arrest_made,citation_issued,warning_issued,outcome,contraband_found,contraband_drugs,contraband_weapons,frisk_performed,search_conducted,search_person,search_vehicle,search_basis,reason_for_stop,vehicle_color,vehicle_make,vehicle_model,vehicle_year,raw_actions_taken,raw_subject_race
0,1,2010-01-01,1,,,,6,E,26.0,black,female,6th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLACK,DODGE,CARAVAN,2005.0,,BLACK
1,9087,2010-01-01,1,,,,7,C,37.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLUE,NISSAN,MURANO,2005.0,,BLACK
2,9086,2010-01-01,1,,,,7,C,37.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,BLUE,NISSAN,MURANO,2005.0,,BLACK
3,267,2010-01-01,14,,,,7,I,96.0,black,male,7th District,vehicular,False,False,False,,,,,False,False,False,False,,TRAFFIC VIOLATION,GRAY,JEEP,GRAND CHEROKEE,2003.0,,BLACK
4,2,2010-01-01,2,,,,5,D,17.0,black,male,5th District,,False,False,False,,,,,False,False,False,False,,CALL FOR SERVICE,,,,,,BLACK


In [46]:
data_search_xtab = pd.crosstab(index=data_b_w_only.subject_race, columns=data_b_w_only.search_conducted)
data_search_xtab

search_conducted,False,True
subject_race,Unnamed: 1_level_1,Unnamed: 2_level_1
black,291879,57940
white,114407,15296


In [47]:
ss.fisher_exact(data_search_xtab)

(0.6735186656650447, 0.0)

There is clearly an effect whether you are white or black to be search by the police during a stop.

In [48]:
data_b_w_only.groupby('subject_race').search_conducted.mean().to_frame()

Unnamed: 0_level_0,search_conducted
subject_race,Unnamed: 1_level_1
black,0.165629
white,0.117931


#### Comments:

The fisher test concludes that there is some effect on being black or white in whether the police perform a search either looking for drugs or weapons. Also, ~16% of the black people were asked to be search in comparison to only ~12% of white people. It is important to mention that all the reason for stop are including in this analysis so we cannot be certain that this police behavior is as general as it seems. Maybe the reason for the stop of black people was more serious so the police had to search more often than in white people's "violations".

Note: I am not implying anything with this statement, just an example of what could be causing the differences between white and black in this manner. 

In Q5.1 I try to analyze why black people are being searched more than their white counterparts by calculating the probabilities that if person, white or black, is being stopped and search by the police they indeed have more probabilities of having contraband. 




### Q5. Once a black or white person is being searched by the police, how likely is that the person indeed had contraband in general.

In [49]:
data_search_true = (data[(data.search_conducted == True) & 
                    ((data.subject_race == 'black') | (data.subject_race == 'white'))])

In [50]:
data_search_true_xtab = pd.crosstab(index=data_search_true.subject_race, columns=data_search_true.contraband_found)

In [51]:
data_search_true_xtab

contraband_found,False,True
subject_race,Unnamed: 1_level_1,Unnamed: 2_level_1
black,46502,11438
white,12833,2463


In [52]:
ss.fisher_exact(data_search_true_xtab)

(0.7802930831888363, 3.8481194063687466e-25)

In [53]:
data_search_true_xtab_prop = data_search_true_xtab.copy()

In [55]:
data_search_true_xtab_prop['proportions'] = (data_search_true_xtab_prop.apply(lambda x: 
                                                                              (x.iloc[1] / x.iloc[0]), axis=1))

In [56]:
data_search_true_xtab_prop

contraband_found,False,True,proportions
subject_race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
black,46502,11438,0.245968
white,12833,2463,0.191927



#### Q5.1 Calculating probabilities of having contraband if you are black or white once the police already decided to search the person.

In [57]:
total_rows = list(data_search_true_xtab.sum(axis= 1, skipna = True))
data_search_true_xtab['total'] = total_rows

In [58]:
data_search_true_xtab

contraband_found,False,True,total
subject_race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
black,46502,11438,57940
white,12833,2463,15296


In [59]:
new_index= ['black','white','totals']
data_search_true_xtab = data_search_true_xtab.reindex(new_index)

In [60]:
data_search_true_xtab.loc['totals'] = data_search_true_xtab.select_dtypes(pd.np.number).sum()

In [61]:
#Calculating the probabilities that if you are black, you will have contraband.

prob_black_contraband = ((data_search_true_xtab[1][0] / data_search_true_xtab['total']['totals']) / 
(data_search_true_xtab['total'][0] / data_search_true_xtab['total']['totals']))

In [62]:
#Calculating the probabilities that if you are white, you will have contraband.
prob_white_contraband = ((data_search_true_xtab[1][1] / data_search_true_xtab['total']['totals']) / 
(data_search_true_xtab['total'][1] / data_search_true_xtab['total']['totals']))

In [63]:
#Calculating the probabilities of not having contraband for both cases
prob_black_no_contraband = 1 - prob_black_contraband
prob_white_no_contraband = 1 - prob_white_contraband

In [64]:
prob_contraband = [prob_black_contraband, prob_white_contraband]
prob_no_contraband = [prob_black_no_contraband,prob_white_no_contraband]

In [65]:
prob_df = pd.DataFrame(prob_contraband)

In [66]:
prob_df['1'] = prob_no_contraband

In [67]:
prob_df.columns = ['prob_contraband', 'prob_no_contraband']
prob_df.index = ['black', 'white']

In [68]:
prob_df

Unnamed: 0,prob_contraband,prob_no_contraband
black,0.197411,0.802589
white,0.161022,0.838978


#### Comments:

When performing a fisher test, it indicates that there is an effect on race and whether contraband is found or not. So, I proceeded to calculate the probabilities that if a person if being search they indeed had contraband as we can see above. The police have more probabilities of founding contraband when they conduct a search on a black person than on a white person. 

### Q6. In what occasion a black or white person is more likely to be searched.

#### All races together

In [69]:
(data.groupby('reason_for_stop').search_conducted.value_counts(normalize=True)
.unstack().sort_values(by=True,ascending=False).head(10)
)

search_conducted,False,True
reason_for_stop,Unnamed: 1_level_1,Unnamed: 2_level_1
CRIMINAL VIOLATION,0.648484,0.351516
PRESENT AT CRIME SCENE,0.693182,0.306818
CALL FOR SERVICE,0.74144,0.25856
FLAGGED DOWN,0.753507,0.246493
SUSPECT PERSON,0.808833,0.191167
SUSPECT VEHICLE,0.832172,0.167828
OTHER,0.847888,0.152112
JUVENILE VIOLATION,0.859205,0.140795
CITIZEN CONTACT,0.89261,0.10739
TRAFFIC VIOLATION,0.909643,0.090357


#### Now just black people

In [70]:
data_blacks = data[data.subject_race == 'black']

In [71]:
(data_blacks.groupby('reason_for_stop').search_conducted.value_counts(normalize=True)
.unstack().sort_values(by=True,ascending=False).head(10)
)

search_conducted,False,True
reason_for_stop,Unnamed: 1_level_1,Unnamed: 2_level_1
CRIMINAL VIOLATION,0.566557,0.433443
PRESENT AT CRIME SCENE,0.683706,0.316294
FLAGGED DOWN,0.726539,0.273461
CALL FOR SERVICE,0.727183,0.272817
SUSPECT PERSON,0.794982,0.205018
SUSPECT VEHICLE,0.799005,0.200995
OTHER,0.825583,0.174417
JUVENILE VIOLATION,0.85563,0.14437
CITIZEN CONTACT,0.879363,0.120637
TRAFFIC VIOLATION,0.892835,0.107165


#### Now just white people


In [72]:
data_whites = data[data.subject_race == 'white']

In [73]:
(data_whites.groupby('reason_for_stop').search_conducted.value_counts(normalize=True)
.unstack().sort_values(by=True,ascending=False).head(10)
)

search_conducted,False,True
reason_for_stop,Unnamed: 1_level_1,Unnamed: 2_level_1
CRIMINAL VIOLATION,0.749904,0.250096
PRESENT AT CRIME SCENE,0.75188,0.24812
CALL FOR SERVICE,0.766494,0.233506
FLAGGED DOWN,0.778307,0.221693
SUSPECT PERSON,0.851353,0.148647
SUSPECT VEHICLE,0.879281,0.120719
OTHER,0.90421,0.09579
JUVENILE VIOLATION,0.911538,0.088462
CITIZEN CONTACT,0.923058,0.076942
TRAFFIC VIOLATION,0.941862,0.058138


#### Comments:

It is interesting to see how for same "violations" of the reason for the stop, black people get searched more often than their white counterparts. For example, only ~25% of white people were searched for criminal violation compared to ~43% of black people. Same with a normal traffic violation were ~11% of black people were search compared to only ~6% of white people. 

### Q7.Which district had the most number of arrest and what is the proportion of races in that district?

In [74]:
data_dist = data.copy()

In [75]:
data_dist.district.unique()

array(['6', '7', '5', '8', '3', '2', '4', '1', '1|7', '5|3|3', 6, 5, 3, 4,
       7, 8, 1, 2, '3|2', '6|2'], dtype=object)

In [76]:
data_dist.district = data_dist.district.apply(lambda x: x if len(str(x)) < 2 else np.nan)

In [77]:
data_dist.district = pd.to_numeric(data_dist.district, downcast='integer', errors='ignore')

In [78]:
districs_pivot = (pd.pivot_table(data_dist, values=['raw_row_number', 'arrest_made'], 
                index='district',
                aggfunc={'raw_row_number': 'count','arrest_made': np.sum})
                .rename(columns={'raw_row_number':'total_stops'})
                .sort_values(by='total_stops',ascending=False)
                )

In [79]:
districs_pivot.arrest_made = districs_pivot.arrest_made.astype(int)

In [80]:
districs_pivot['proportions'] = round(districs_pivot.arrest_made / districs_pivot.total_stops * 100,2)

In [81]:
districs_pivot['proportions'] = districs_pivot['proportions'].apply("{0:.2f}%".format)

In [82]:
(districs_pivot.sort_values(by='proportions', ascending=False)
.style.apply(lambda x: ['background: tomato' if c == x.max() else "" for c in x] , axis=0))

Unnamed: 0_level_0,arrest_made,total_stops,proportions
district,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8.0,22015,99266,22.18%
1.0,11099,50718,21.88%
4.0,9348,45344,20.62%
2.0,10467,52778,19.83%
6.0,12356,67537,18.30%
7.0,9015,60409,14.92%
3.0,11644,78879,14.76%
5.0,8220,57157,14.38%


#### Checking the race distribution among the top 3 districts with more proportion of arrest by stops

In [83]:
(data_dist[data_dist.district == 8.0]
.subject_race.value_counts(normalize=True)
.apply(lambda x: x*100)
.to_frame()
.applymap("{0:.2f}%".format)
.style.apply(lambda x: ['background: tomato' if c == x.max() else "" for c in x] , axis=0)
)

Unnamed: 0,subject_race
black,49.13%
white,46.47%
hispanic,2.62%
unknown,1.02%
asian/pacific islander,0.66%
other,0.10%


In [84]:
(data_dist[data_dist.district == 1.0]
.subject_race.value_counts(normalize=True)
.apply(lambda x: x*100)
.to_frame()
.applymap("{0:.2f}%".format)
.style.apply(lambda x: ['background: tomato' if c == x.max() else "" for c in x] , axis=0)
)

Unnamed: 0,subject_race
black,72.90%
white,21.71%
hispanic,4.26%
unknown,0.64%
asian/pacific islander,0.42%
other,0.06%


In [85]:
(data_dist[data_dist.district == 4.0]
.subject_race.value_counts(normalize=True)
.apply(lambda x: x*100)
.to_frame()
.applymap("{0:.2f}%".format)
.style.apply(lambda x: ['background: tomato' if c == x.max() else "" for c in x] , axis=0)
)

Unnamed: 0,subject_race
black,84.60%
white,12.17%
hispanic,2.02%
asian/pacific islander,0.85%
unknown,0.30%
other,0.06%


#### Comments:

District 8 had the most number of arrest per person stopped by the police. This might indicate that this district is not very safe as the police found reasons to arrest 22% of the people they stopped. Also, this district race distributions seem to be very balanced so it would be a good idea to see if there is a difference between black and white. 

### Q8. What are the main reasons a person is being stopped in District 8 that end up in arrests?


In [86]:
(data_dist[(data_dist.district == 8) & (data_dist.arrest_made == True)]
.reason_for_stop.value_counts(normalize=True)
.to_frame().sort_values(by='reason_for_stop', ascending=False)
.apply(lambda x: x*100)
.applymap("{0:.2f}%".format)
.style.apply(lambda x: ['background: tomato' if c == x.iloc[0] else "" for c in x] , axis=0))

Unnamed: 0,reason_for_stop
CRIMINAL VIOLATION,42.97%
CALL FOR SERVICE,20.28%
TRAFFIC VIOLATION,13.72%
SUSPECT PERSON,11.32%
FLAGGED DOWN,3.67%
OTHER,3.16%
CITIZEN CONTACT,2.33%
JUVENILE VIOLATION,2.08%
SUSPECT VEHICLE,0.28%
PRESENT AT CRIME SCENE,0.19%


#### Q8.1 Checking the differences between black and white people in all districts and then in district 8 to see differences.

In [87]:
data_b_w_arrest = (data_dist[((data_dist.subject_race == 'black') | (data_dist.subject_race == 'white'))
                                & (data_dist.arrest_made == True)])

data_d8_b_w_arrest = (data_dist[((data_dist.subject_race == 'black') | (data_dist.subject_race == 'white'))
                               & (data_dist.district == 8.0) & (data_dist.arrest_made == True)])

In [88]:
data_b_w_arrest = (data_b_w_arrest.groupby(['subject_race']).reason_for_stop.value_counts(normalize=True)
.unstack().apply(lambda x: x*100).transpose().sort_values(by='black', ascending=False))

data_d8_b_w_arrest = (data_d8_b_w_arrest.groupby(['subject_race']).reason_for_stop.value_counts(normalize=True)
.unstack().apply(lambda x: x*100).transpose().sort_values(by='black', ascending=False))

In [89]:
data_b_w_arrest['differences'] = abs(data_b_w_arrest.black - data_b_w_arrest.white)

data_d8_b_w_arrest['differences'] = abs(data_d8_b_w_arrest.black - data_d8_b_w_arrest.white)

In [90]:
data_b_w_arrest = (data_b_w_arrest.applymap("{0:.2f}".format).astype(float)
                      .sort_values(by='differences',ascending=False))

data_d8_b_w_arrest = (data_d8_b_w_arrest.applymap("{0:.2f}".format).astype(float)
                      .sort_values(by='differences',ascending=False))

In [91]:
data_b_w_arrest = data_b_w_arrest.applymap("{0:.2f}%".format)

data_d8_b_w_arrest = data_d8_b_w_arrest.applymap("{0:.2f}%".format)

In [92]:
data_b_w_arrest_style = (data_b_w_arrest.style.apply(lambda x: ['background: tomato' 
                    if c == x.iloc[0] else "" for c in x] , axis=0))

data_d8_b_w_arrest_style = (data_d8_b_w_arrest.style.apply(lambda x: ['background: tomato' 
                    if c == x.iloc[0] else "" for c in x] , axis=0))

#### Distinctions in reason of stop between black and white people that have been arrested in all districts and in district 8

In [93]:
# Taking into account all the districts
data_b_w_arrest_style

subject_race,black,white,differences
reason_for_stop,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CRIMINAL VIOLATION,14.02%,28.42%,14.39%
TRAFFIC VIOLATION,36.63%,22.99%,13.64%
CALL FOR SERVICE,24.42%,26.51%,2.09%
OTHER,5.77%,4.04%,1.73%
FLAGGED DOWN,1.06%,2.45%,1.39%
JUVENILE VIOLATION,1.93%,0.63%,1.30%
SUSPECT PERSON,13.07%,12.07%,1.00%
SUSPECT VEHICLE,0.97%,0.56%,0.41%
CITIZEN CONTACT,1.83%,2.15%,0.31%
PRESENT AT CRIME SCENE,0.30%,0.19%,0.10%


In [94]:
#Filtering by district 8 only
data_d8_b_w_arrest_style

subject_race,black,white,differences
reason_for_stop,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TRAFFIC VIOLATION,18.26%,8.16%,10.10%
CRIMINAL VIOLATION,38.86%,47.96%,9.10%
CALL FOR SERVICE,18.20%,22.46%,4.25%
JUVENILE VIOLATION,3.69%,0.31%,3.38%
SUSPECT PERSON,12.24%,10.58%,1.66%
FLAGGED DOWN,3.08%,4.28%,1.20%
CITIZEN CONTACT,1.90%,2.78%,0.88%
SUSPECT VEHICLE,0.38%,0.18%,0.19%
OTHER,3.21%,3.09%,0.13%
PRESENT AT CRIME SCENE,0.18%,0.21%,0.03%


#### Comments:

State Police in district 8 is characterized for arresting ~43% of the people they stop due to Criminal Violation. When we compare how this number is distributed among the white and black population, we see that white people represent the majority of arrests by Criminal Violation and the differences between races are ~9%.

Additionally, black people are more likely to be arrested for traffic violation than white people. In our sample, ~18% of the black people got arrested after a traffic violation compared to only ~8% of white people. This is a very important insight as we saw before, the probability of being search, if you are black, is significantly higher than if you are a white person. Also, the probability of carrying contraband for a black person is also higher, so maybe this is one of the reasons why this community gets more arrests for the same violation.

In [98]:
data_dist.to_csv('/Users/abreualberto91/IRONHACK/Datasets/new_orleans_police.csv')