<h1><p style="text-align: center;">Data Analysis with Python <br>Project</p><h1> - Traffic Police Stops <img src="https://docs.google.com/uc?id=17CPCwi3_VvzcS87TOsh4_U8eExOhL6Ki" class="img-fluid" alt="CLRSWY" width="200" height="100"> 

Does the ``gender`` of a driver have an impact on police behavior during a traffic stop? **In this chapter**, you will explore that question while practicing filtering, grouping, method chaining, Boolean math, string methods, and more!

***

## Examining traffic violations

Before comparing the violations being committed by each gender, you should examine the ``violations`` committed by all drivers to get a baseline understanding of the data.

In this exercise, you'll count the unique values in the ``violation`` column, and then separately express those counts as proportions.

> Before starting your work in this section **repeat the steps which you did in the previos chapter for preparing the data.** Continue to this chapter based on where you were in the end of the previous chapter.

In [2]:
import pandas as pd

ri = pd.read_csv("police.csv", nrows=50000)
ri.head(5)

Unnamed: 0,id,state,stop_date,stop_time,location_raw,county_name,county_fips,fine_grained_location,police_department,driver_gender,...,search_conducted,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district
0,RI-2005-00001,RI,2005-01-02,01:55,Zone K1,,,,600,M,...,False,,,False,Citation,False,0-15 Min,False,False,Zone K1
1,RI-2005-00002,RI,2005-01-02,20:30,Zone X4,,,,500,M,...,False,,,False,Citation,False,16-30 Min,False,False,Zone X4
2,RI-2005-00003,RI,2005-01-04,11:30,Zone X1,,,,0,,...,False,,,False,,,,,False,Zone X1
3,RI-2005-00004,RI,2005-01-04,12:55,Zone X4,,,,500,M,...,False,,,False,Citation,False,0-15 Min,False,False,Zone X4
4,RI-2005-00005,RI,2005-01-06,01:30,Zone X4,,,,500,M,...,False,,,False,Citation,False,0-15 Min,False,False,Zone X4


In [3]:
ri.drop(["county_name", "state"], axis = 1, inplace = True)

In [4]:
ri.dropna(subset = ["driver_gender"], inplace=True)

In [5]:
ri["is_arrested"] = ri["is_arrested"].astype(bool)

In [6]:
ri['combined'] = ri["stop_date"] + " " + ri["stop_time"]

In [7]:
ri["stop_datetime"] = pd.to_datetime(ri["combined"])  

In [8]:
ri.set_index("stop_datetime", inplace = True)

In [9]:
ri.index

DatetimeIndex(['2005-01-02 01:55:00', '2005-01-02 20:30:00',
               '2005-01-04 12:55:00', '2005-01-06 01:30:00',
               '2005-01-12 08:05:00', '2005-01-18 08:15:00',
               '2005-01-18 17:13:00', '2005-01-23 23:15:00',
               '2005-01-24 20:32:00', '2005-02-09 03:05:00',
               ...
               '2006-08-08 22:22:00', '2006-08-08 22:25:00',
               '2006-08-08 22:30:00', '2006-08-08 22:30:00',
               '2006-08-08 22:45:00', '2006-08-08 22:45:00',
               '2006-08-08 22:45:00', '2006-08-08 22:53:00',
               '2006-08-08 23:00:00', '2006-08-08 23:00:00'],
              dtype='datetime64[ns]', name='stop_datetime', length=48010, freq=None)

In [10]:
ri.columns

Index(['id', 'stop_date', 'stop_time', 'location_raw', 'county_fips',
       'fine_grained_location', 'police_department', 'driver_gender',
       'driver_age_raw', 'driver_age', 'driver_race_raw', 'driver_race',
       'violation_raw', 'violation', 'search_conducted', 'search_type_raw',
       'search_type', 'contraband_found', 'stop_outcome', 'is_arrested',
       'stop_duration', 'out_of_state', 'drugs_related_stop', 'district',
       'combined'],
      dtype='object')

In [11]:
ri.head()

Unnamed: 0_level_0,id,stop_date,stop_time,location_raw,county_fips,fine_grained_location,police_department,driver_gender,driver_age_raw,driver_age,...,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,combined
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-01-02 01:55:00,RI-2005-00001,2005-01-02,01:55,Zone K1,,,600,M,1985.0,20.0,...,,,False,Citation,False,0-15 Min,False,False,Zone K1,2005-01-02 01:55
2005-01-02 20:30:00,RI-2005-00002,2005-01-02,20:30,Zone X4,,,500,M,1987.0,18.0,...,,,False,Citation,False,16-30 Min,False,False,Zone X4,2005-01-02 20:30
2005-01-04 12:55:00,RI-2005-00004,2005-01-04,12:55,Zone X4,,,500,M,1986.0,19.0,...,,,False,Citation,False,0-15 Min,False,False,Zone X4,2005-01-04 12:55
2005-01-06 01:30:00,RI-2005-00005,2005-01-06,01:30,Zone X4,,,500,M,1978.0,27.0,...,,,False,Citation,False,0-15 Min,False,False,Zone X4,2005-01-06 01:30
2005-01-12 08:05:00,RI-2005-00006,2005-01-12,08:05,Zone X1,,,0,M,1973.0,32.0,...,,,False,Citation,False,30+ Min,True,False,Zone X1,2005-01-12 08:05


In [12]:
ri["violation"].unique()

array(['Speeding', 'Equipment', 'Other', 'Moving violation',
       'Registration/plates'], dtype=object)

In [13]:
ri["violation"].nunique()

5

In [14]:
ri.groupby('violation').size()

violation
Equipment               3022
Moving violation        6522
Other                    892
Registration/plates     1463
Speeding               36111
dtype: int64

In [15]:
v = ri["violation"].value_counts()
v

Speeding               36111
Moving violation        6522
Equipment               3022
Registration/plates     1463
Other                    892
Name: violation, dtype: int64

In [16]:
for i, j in v.items():
    print(f"{[i]}:  {j / v.sum()}")

['Speeding']:  0.7521558008748177
['Moving violation']:  0.13584669860445742
['Equipment']:  0.06294521974588628
['Registration/plates']:  0.030472818162882734
['Other']:  0.01857946261195584


***

## Comparing violations by gender

The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.

You'll first create a ``DataFrame`` for each gender, and then analyze the ``violations`` in each ``DataFrame`` separately.

In [17]:
ri_female_dg = ri[ri["driver_gender"] == "F"]
ri_female_dg.head(3)

Unnamed: 0_level_0,id,stop_date,stop_time,location_raw,county_fips,fine_grained_location,police_department,driver_gender,driver_age_raw,driver_age,...,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,combined
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-02-24 01:20:00,RI-2005-00016,2005-02-24,01:20,Zone X3,,,200,F,1983.0,22.0,...,,,False,Citation,False,0-15 Min,True,False,Zone X3,2005-02-24 01:20
2005-03-14 10:00:00,RI-2005-00019,2005-03-14,10:00,Zone K3,,,300,F,1984.0,21.0,...,,,False,Citation,False,0-15 Min,False,False,Zone K3,2005-03-14 10:00
2005-03-29 23:20:00,RI-2005-00026,2005-03-29,23:20,Zone K3,,,300,F,1971.0,34.0,...,,,False,Citation,False,0-15 Min,True,False,Zone K3,2005-03-29 23:20


In [18]:
ri_male_dg = ri[ri["driver_gender"] == "M"]
ri_male_dg.head(3)

Unnamed: 0_level_0,id,stop_date,stop_time,location_raw,county_fips,fine_grained_location,police_department,driver_gender,driver_age_raw,driver_age,...,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,combined
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-01-02 01:55:00,RI-2005-00001,2005-01-02,01:55,Zone K1,,,600,M,1985.0,20.0,...,,,False,Citation,False,0-15 Min,False,False,Zone K1,2005-01-02 01:55
2005-01-02 20:30:00,RI-2005-00002,2005-01-02,20:30,Zone X4,,,500,M,1987.0,18.0,...,,,False,Citation,False,16-30 Min,False,False,Zone X4,2005-01-02 20:30
2005-01-04 12:55:00,RI-2005-00004,2005-01-04,12:55,Zone X4,,,500,M,1986.0,19.0,...,,,False,Citation,False,0-15 Min,False,False,Zone X4,2005-01-04 12:55


In [19]:
fml = ri_female_dg["violation"].value_counts()
fml

Speeding               10796
Moving violation        1318
Equipment                607
Registration/plates      367
Other                    221
Name: violation, dtype: int64

In [20]:
for i, j in fml.items():
    print(f"{[i]}:  {j / fml.sum()}")

['Speeding']:  0.8111804042377339
['Moving violation']:  0.09903073108422872
['Equipment']:  0.045608235028927795
['Registration/plates']:  0.027575324968066722
['Other']:  0.016605304681042904


In [21]:
ml = ri_male_dg["violation"].value_counts()
ml

Speeding               25315
Moving violation        5204
Equipment               2415
Registration/plates     1096
Other                    671
Name: violation, dtype: int64

In [22]:
for i, j in ml.items():
    print(f"{[i]}:  {j / ml.sum()}")

['Speeding']:  0.7295178813290684
['Moving violation']:  0.14996685974467594
['Equipment']:  0.06959453618051353
['Registration/plates']:  0.03158410420448978
['Other']:  0.019336618541252414


***

## Comparing speeding outcomes by gender

When a driver is pulled over for speeding, many people believe that gender has an impact on whether the driver will receive a ticket or a warning. Can you find evidence of this in the dataset?

First, you'll create two ``DataFrames`` of drivers who were stopped for ``speeding``: one containing ***females*** and the other containing ***males***.

Then, for each **gender**, you'll use the ``stop_outcome`` column to calculate what percentage of stops resulted in a ``"Citation"`` (meaning a ticket) versus a ``"Warning"``.

In [23]:
ri_female_dg.sample(5)

Unnamed: 0_level_0,id,stop_date,stop_time,location_raw,county_fips,fine_grained_location,police_department,driver_gender,driver_age_raw,driver_age,...,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,combined
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-11-28 20:20:00,RI-2005-09586,2005-11-28,20:20,Zone K2,,,900,F,1957.0,48.0,...,,,False,Citation,False,0-15 Min,False,False,Zone K2,2005-11-28 20:20
2005-11-16 18:56:00,RI-2005-07642,2005-11-16,18:56,Zone K1,,,600,F,1955.0,50.0,...,,,False,Citation,False,0-15 Min,False,False,Zone K1,2005-11-16 18:56
2005-11-16 12:57:00,RI-2005-07578,2005-11-16,12:57,Zone K2,,,202,F,1987.0,18.0,...,,,False,Citation,False,0-15 Min,False,False,Zone K2,2005-11-16 12:57
2006-04-02 20:38:00,RI-2006-15138,2006-04-02,20:38,Zone K3,,,300,F,1986.0,20.0,...,,,False,Citation,False,0-15 Min,True,False,Zone K3,2006-04-02 20:38
2005-12-20 23:20:00,RI-2005-12702,2005-12-20,23:20,Zone X4,,,500,F,1985.0,20.0,...,,,False,Citation,False,0-15 Min,False,False,Zone X4,2005-12-20 23:20


In [24]:
female_and_speeding = ri_female_dg[ri_female_dg["violation"] == "Speeding"]

In [25]:
f = female_and_speeding["stop_outcome"].value_counts()
f

Citation            10509
Arrest Driver          80
N/D                    39
Arrest Passenger       25
No Action               3
Name: stop_outcome, dtype: int64

In [26]:
for i, j in f.items():
    print(f"{[i]}:  {j / f.sum()}")

['Citation']:  0.9734160800296406
['Arrest Driver']:  0.007410151908114116
['N/D']:  0.0036124490552056315
['Arrest Passenger']:  0.0023156724712856615
['No Action']:  0.00027788069655427936


In [27]:
male_and_speeding = ri_male_dg[ri_male_dg["violation"] == "Speeding"]

In [28]:
m = male_and_speeding["stop_outcome"].value_counts()
m

Citation            24234
Arrest Driver         664
N/D                    86
Arrest Passenger       51
No Action              10
Name: stop_outcome, dtype: int64

In [29]:
for i, j in m.items():
    print(f"{[i]}:  {j / m.sum()}")

['Citation']:  0.9572980446375666
['Arrest Driver']:  0.02622950819672131
['N/D']:  0.003397195338731977
['Arrest Passenger']:  0.002014615840410824
['No Action']:  0.00039502271380604387


***

## Calculating the search rate

During a traffic stop, the police officer sometimes conducts a search of the vehicle. In this exercise, you'll calculate the percentage of all stops that result in a vehicle search, also known as the **search rate**.

In [30]:
ri["search_conducted"].dtype

dtype('bool')

In [31]:
ri["search_conducted"] = ri["search_conducted"].astype(bool)

In [32]:
ri["search_conducted"].sample(10)

stop_datetime
2006-02-16 13:25:00    False
2006-07-31 14:23:00    False
2006-01-04 00:20:00     True
2005-12-15 20:17:00    False
2006-03-09 08:40:00    False
2005-10-10 22:30:00    False
2005-10-02 06:10:00    False
2006-04-23 03:30:00    False
2006-04-10 15:55:00    False
2005-12-30 07:32:00    False
Name: search_conducted, dtype: bool

In [33]:
a = ri["search_conducted"].value_counts()
a

False    45998
True      2012
Name: search_conducted, dtype: int64

In [34]:
for i, j in a.items():
    print(f"{[i]}:  {j / a.sum()}")

[False]:  0.9580920641533014
[True]:  0.04190793584669861


In [35]:
v = ri["search_conducted"]

In [36]:
sum([i for i in v.values]) / len(v)

0.04190793584669861

***

## Comparing search rates by gender

You'll compare the rates at which **female** and **male** drivers are searched during a traffic stop. Remember that the vehicle search rate across all stops is about **4.1%**.

First, you'll filter the ``DataFrame`` by gender and calculate the search rate for each group separately. Then, you'll perform the same calculation for both genders at once using a ``.groupby()``.

In [37]:
ri_female_dg["search_conducted"]

stop_datetime
2005-02-24 01:20:00    False
2005-03-14 10:00:00    False
2005-03-29 23:20:00    False
2005-06-06 13:20:00    False
2005-06-18 16:30:00    False
                       ...  
2006-08-08 22:20:00    False
2006-08-08 22:25:00    False
2006-08-08 22:45:00    False
2006-08-08 22:45:00    False
2006-08-08 23:00:00    False
Name: search_conducted, Length: 13309, dtype: bool

In [38]:
sf = ri_female_dg["search_conducted"].value_counts()
sf

False    13072
True       237
Name: search_conducted, dtype: int64

In [39]:
for i, j in sf.items():
    print(f"{[i]}:  {j / sf.sum()}")

[False]:  0.9821925013148997
[True]:  0.017807498685100308


In [40]:
sm = ri_male_dg["search_conducted"].value_counts()

In [41]:
for i, j in sm.items():
    print(f"{[i]}:  {j / sm.sum()}")

[False]:  0.9488487363476557
[True]:  0.05115126365234431


In [42]:
ri.groupby("driver_gender")[["search_conducted"]].mean()

Unnamed: 0_level_0,search_conducted
driver_gender,Unnamed: 1_level_1
F,0.017807
M,0.051151


***

## Adding a second factor to the analysis

Even though the search rate for males is much higher than for females, it's possible that the difference is mostly due to a second factor.

For example, you might hypothesize that the search rate varies by violation type, and the difference in search rate between males and females is because they tend to commit different violations.

You can test this hypothesis by examining the search rate for each combination of gender and violation. If the hypothesis was true, you would find that males and females are searched at about the same rate for each violation. Find out below if that's the case!

In [43]:
ri.groupby(["driver_gender","violation"])[["search_conducted"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,search_conducted
driver_gender,violation,Unnamed: 2_level_1
F,Equipment,0.079077
F,Moving violation,0.0478
F,Other,0.045249
F,Registration/plates,0.114441
F,Speeding,0.006854
M,Equipment,0.123395
M,Moving violation,0.088778
M,Other,0.154993
M,Registration/plates,0.171533
M,Speeding,0.02856


In [44]:
ri.groupby(["violation", "driver_gender"])[["search_conducted"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,search_conducted
violation,driver_gender,Unnamed: 2_level_1
Equipment,F,0.079077
Equipment,M,0.123395
Moving violation,F,0.0478
Moving violation,M,0.088778
Other,F,0.045249
Other,M,0.154993
Registration/plates,F,0.114441
Registration/plates,M,0.171533
Speeding,F,0.006854
Speeding,M,0.02856


***

## Counting protective frisks

During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a ``"protective frisk."``

You'll first check to see how many times "Protective Frisk" was the only search type. Then, you'll use a string method to locate all instances in which the driver was frisked.

In [46]:
ri["search_type"].value_counts()

Incident to Arrest                                          958
Probable Cause                                              244
Protective Frisk                                            204
Inventory                                                   117
Incident to Arrest,Inventory                                116
Incident to Arrest,Probable Cause                            76
Incident to Arrest,Protective Frisk                          63
Reasonable Suspicion                                         43
Probable Cause,Protective Frisk                              36
Incident to Arrest,Inventory,Protective Frisk                33
Inventory,Protective Frisk                                   23
Incident to Arrest,Probable Cause,Protective Frisk           20
Incident to Arrest,Inventory,Probable Cause                  19
Protective Frisk,Reasonable Suspicion                        16
Inventory,Probable Cause                                     16
Probable Cause,Reasonable Suspicion     

In [47]:
len(ri[ri["search_type"] == "Protective Frisk"])

204

In [48]:
ri['frisk'] = ri["search_type"].str.contains('Protective Frisk', na = False)

In [49]:
ri["frisk"].value_counts()

False    47607
True       403
Name: frisk, dtype: int64

In [50]:
sum(ri["frisk"])

403

***

## Comparing frisk rates by gender

You'll compare the rates at which female and male drivers are frisked during a search. Are males frisked more often than females, perhaps because police officers consider them to be higher risk?

Before doing any calculations, it's important to filter the ``DataFrame`` to only include the relevant subset of data, namely stops in which a search was conducted.

In [51]:
searched = ri[ri["search_conducted"] == True]

In [52]:
searched["frisk"].mean()

0.20029821073558648

In [53]:
searched.groupby("driver_gender")[["frisk"]].mean()

Unnamed: 0_level_0,frisk
driver_gender,Unnamed: 1_level_1
F,0.164557
M,0.20507
