# Traffic stops by Police in Los Angeles

#### Los Angeles
We use Stanford Open Policing Project dataset to find out Traffic stops by Police officers in <a href ="https://stacks.stanford.edu/file/druid:yg821jf8611/yg821jf8611_ca_los_angeles_2020_04_01.csv.zip">LA </a>. To analysis the dataset, we need to prepare the data.


#### Data Processing

In [1]:
import pandas as pd
df = pd.read_csv('ca_los_angeles_2020_04_01.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
df.head(2)

Unnamed: 0,raw_row_number,date,time,district,region,subject_race,subject_sex,officer_id_hash,type,raw_descent_description
0,5692933,,13:59:00,753,WILSHIRE,hispanic,male,15ecd81b00,pedestrian,HISPANIC
1,240731,2010-01-01,00:05:00,665,WEST TRAFFIC,other,male,b707de41e0,vehicular,OTHER


Each row reperesents one traffic stop.We need to locate missing values.

In [3]:
df.isnull()

Unnamed: 0,raw_row_number,date,time,district,region,subject_race,subject_sex,officer_id_hash,type,raw_descent_description
0,False,True,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
5418397,False,False,False,False,False,False,False,False,False,False
5418398,False,False,False,False,False,False,False,False,False,False
5418399,False,False,False,False,False,False,False,False,False,False
5418400,False,False,False,False,False,False,False,False,False,False


The existence of true indicates that we have missing data and we can find out how many datas are missing.

In [4]:
df.isnull().sum()

raw_row_number                0
date                          2
time                          0
district                      0
region                     1911
subject_race                  0
subject_sex                   0
officer_id_hash             161
type                          1
raw_descent_description       0
dtype: int64

Checking the dataset size

In [24]:
df.shape

(5418402, 10)

Region and officer ID hash has most missing data, so we drop them.

In [7]:
df.drop(['region', 'officer_id_hash'], axis = 1, inplace = True)

In [8]:
df.shape

(5418402, 8)

In [9]:
df.columns

Index(['raw_row_number', 'date', 'time', 'district', 'subject_race',
       'subject_sex', 'type', 'raw_descent_description'],
      dtype='object')

We still have to deal with missing values in data column and type column. We can drop the rows that contain missing information.

In [10]:
df.dropna(subset =['date','type'],inplace = True)

In [12]:
df.isnull().sum()

raw_row_number             0
date                       0
time                       0
district                   0
subject_race               0
subject_sex                0
type                       0
raw_descent_description    0
dtype: int64

#### Examining the data types

In [14]:
df.dtypes

raw_row_number             object
date                       object
time                       object
district                   object
subject_race               object
subject_sex                object
type                       object
raw_descent_description    object
dtype: object

Good news is that we have all the data in same format 'object type' (string or other type). But is it good after alll? Looking at the data, we observe that date and time are both object.So we will combine date and time into one column and convert it to datetime format.

In [15]:
# Concatenate 'date' and 'time' (separated by a space)
combined = df.date.str.cat(df.time, sep=' ')

In [18]:
# Convert 'combined' to datetime format
df['date'] = pd.to_datetime(combined)

In [19]:
df.head(2)

Unnamed: 0,raw_row_number,date,time,district,subject_race,subject_sex,type,raw_descent_description
1,240731,2010-01-01 00:05:00,00:05:00,665,other,male,vehicular,OTHER
2,240592|240593,2010-01-01 00:10:00,00:10:00,1258,hispanic,male,pedestrian,HISPANIC


In [21]:
# Examine the data types of the DataFrame
print(df.dtypes)

raw_row_number                     object
date                       datetime64[ns]
time                               object
district                           object
subject_race                       object
subject_sex                        object
type                               object
raw_descent_description            object
dtype: object


In [22]:
# Set 'stop_datetime' as the index
df.set_index('date', inplace=True)

In [23]:
df.head()

Unnamed: 0_level_0,raw_row_number,time,district,subject_race,subject_sex,type,raw_descent_description
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2010-01-01 00:05:00,240731,00:05:00,665,other,male,vehicular,OTHER
2010-01-01 00:10:00,240592|240593,00:10:00,1258,hispanic,male,pedestrian,HISPANIC
2010-01-01 00:10:00,241116,00:10:00,1635,hispanic,male,vehicular,HISPANIC
2010-01-01 00:15:00,240681,00:15:00,882,other,male,vehicular,OTHER
2010-01-01 00:20:00,240602,00:20:00,559,hispanic,male,vehicular,HISPANIC


In [25]:
df

Unnamed: 0_level_0,raw_row_number,time,district,subject_race,subject_sex,type,raw_descent_description
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2010-01-01 00:05:00,240731,00:05:00,665,other,male,vehicular,OTHER
2010-01-01 00:10:00,240592|240593,00:10:00,1258,hispanic,male,pedestrian,HISPANIC
2010-01-01 00:10:00,241116,00:10:00,1635,hispanic,male,vehicular,HISPANIC
2010-01-01 00:15:00,240681,00:15:00,882,other,male,vehicular,OTHER
2010-01-01 00:20:00,240602,00:20:00,559,hispanic,male,vehicular,HISPANIC
...,...,...,...,...,...,...,...
2017-12-31 23:54:00,6132737,23:54:00,1273,hispanic,male,vehicular,HISPANIC
2017-12-31 23:55:00,6135479,23:55:00,0646,hispanic,male,pedestrian,HISPANIC
2017-12-31 23:55:00,6134247,23:55:00,2134,hispanic,male,vehicular,HISPANIC
2017-12-31 23:57:00,6135064|6135065|6135066|6135067,23:57:00,1806,black,male,pedestrian,BLACK


In [26]:
df.drop('time', axis = 1, inplace = True)

In [28]:
df.head(2)

Unnamed: 0_level_0,raw_row_number,district,subject_race,subject_sex,type,raw_descent_description
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-01-01 00:05:00,240731,665,other,male,vehicular,OTHER
2010-01-01 00:10:00,240592|240593,1258,hispanic,male,pedestrian,HISPANIC


#### Race based crime

In [30]:
print(df.subject_race.value_counts())

hispanic                  2301825
black                     1297883
white                     1275788
other                      337342
asian/pacific islander     205561
Name: subject_race, dtype: int64


In [32]:
white = df[df.subject_race == 'white']
white.shape

(1275788, 6)

In [33]:
white.type.value_counts(normalize=True)

vehicular     0.807409
pedestrian    0.192591
Name: type, dtype: float64

In [34]:
black = df[df.subject_race == 'black']
black.type.value_counts(normalize=True)

vehicular     0.678953
pedestrian    0.321047
Name: type, dtype: float64

In [35]:
hispanic = df[df.subject_race == 'hispanic']
hispanic.type.value_counts(normalize=True)

vehicular     0.752602
pedestrian    0.247398
Name: type, dtype: float64

In [36]:
other = df[df.subject_race == 'other']
other.type.value_counts(normalize=True)

vehicular     0.901681
pedestrian    0.098319
Name: type, dtype: float64

In [37]:
misc = df[df.subject_race == 'asian/pacific islander']
misc.type.value_counts(normalize=True)

vehicular     0.912308
pedestrian    0.087692
Name: type, dtype: float64

Overall results of raced based studies,


| Race | Vehicular | Pedestrian |
| --- | --- | --- |
| Hispanice | 75% | 25% |
| White     | 80% | 20% |
| Other     | 90% | 10% |
| Asian     | 91% | 9%  |

#### Gender based Studies

In [38]:
print(df.subject_sex.value_counts())

male      3861190
female    1557209
Name: subject_sex, dtype: int64


In [39]:
# Create a DataFrame of female drivers
female = df[df.subject_sex=='female']

# Create a DataFrame of male drivers
male = df[df.subject_sex=='male']

In [40]:
# Compute the violations by male (as proportions)
print(male.type.value_counts(normalize=True))


vehicular     0.729301
pedestrian    0.270699
Name: type, dtype: float64


In [41]:
# Compute the violations by female (as proportions)
print(female.type.value_counts(normalize=True))

vehicular     0.847275
pedestrian    0.152725
Name: type, dtype: float64


Overall results of raced based studies,


| Gender | Vehicular | Pedestrian |
| --- | --- | --- |
| Female | 85% | 15% |
| Male    | 72% | 28% |


#### Correlation of Gender and Race

In [43]:
# Create a DataFrame of female
asian_female = df[(df.subject_sex=='female') & (df.subject_race=='asian/pacific islander')]

# Create a DataFrame of male 
asian_male = df[(df.subject_sex=='male') & (df.subject_race=='asian/pacific islander')]

In [44]:
# Compute the violations by asian female (as proportions)
print(asian_female.type.value_counts(normalize=True))

vehicular     0.922836
pedestrian    0.077164
Name: type, dtype: float64


In [45]:
# Compute the violations by asian female (as proportions)
print(asian_male.type.value_counts(normalize=True))

vehicular     0.90551
pedestrian    0.09449
Name: type, dtype: float64


In [46]:
# Create a DataFrame of female
hispanic_female = df[(df.subject_sex=='female') & (df.subject_race=='hispanic')]

# Create a DataFrame of male 
hispanic_male = df[(df.subject_sex=='male') & (df.subject_race=='hispanic')]
# Compute the violations by  female (as proportions)
print(hispanic_female.type.value_counts(normalize=True))
print(hispanic_male.type.value_counts(normalize=True))

vehicular     0.851749
pedestrian    0.148251
Name: type, dtype: float64
vehicular     0.720875
pedestrian    0.279125
Name: type, dtype: float64


In [47]:
# Create a DataFrame of female
white_female = df[(df.subject_sex=='female') & (df.subject_race=='white')]

# Create a DataFrame of male 
white_male = df[(df.subject_sex=='male') & (df.subject_race=='white')]
# Compute the violations by afemale (as proportions)
print(white_female.type.value_counts(normalize=True))
print(white.type.value_counts(normalize=True))

vehicular     0.863704
pedestrian    0.136296
Name: type, dtype: float64
vehicular     0.807409
pedestrian    0.192591
Name: type, dtype: float64


In [48]:
# Create a DataFrame of female
black_female = df[(df.subject_sex=='female') & (df.subject_race=='black')]

# Create a DataFrame of male 
black_male = df[(df.subject_sex=='male') & (df.subject_race=='black')]
# Compute the violations by a female (as proportions)
print(black_female.type.value_counts(normalize=True))
print(black.type.value_counts(normalize=True))

vehicular     0.782769
pedestrian    0.217231
Name: type, dtype: float64
vehicular     0.678953
pedestrian    0.321047
Name: type, dtype: float64


In [49]:
# Create a DataFrame of female
other_female = df[(df.subject_sex=='female') & (df.subject_race=='other')]

# Create a DataFrame of male 
other_male = df[(df.subject_sex=='male') & (df.subject_race=='other')]
# Compute the violations by afemale (as proportions)
print(other_female.type.value_counts(normalize=True))
print(other.type.value_counts(normalize=True))

vehicular     0.924577
pedestrian    0.075423
Name: type, dtype: float64
vehicular     0.901681
pedestrian    0.098319
Name: type, dtype: float64


Overall results of raced based studies,


|Race| Gender | Vehicular | Pedestrian |
|--- | --- | --- | --- |
|Asian| Female | 93% | 7% |
|  Asian   | Male    | 91% | 9% |
White| Female | 87% | 13% |
| White    | Male    | 80% | 20% |
Black| Female | 78% | 22% |
|Black     | Male    | 67% | 33% |
Hispanic| Female | 85% | 15% |
|     Hispanic| Male    | 77% | 23% |
Others| Female | 92% | 78% |
|    Others | Male    | 90% | 10% |

