___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

<h1><p style="text-align: center;">Data Analysis with Python <br>Project - 1</p><h1> - Traffic Police Stops <img src="https://docs.google.com/uc?id=17CPCwi3_VvzcS87TOsh4_U8eExOhL6Ki" class="img-fluid" alt="CLRSWY" width="200" height="100"> 

Does the ``gender`` of a driver have an impact on police behavior during a traffic stop? **In this chapter**, you will explore that question while practicing filtering, grouping, method chaining, Boolean math, string methods, and more!

***

## Examining traffic violations

Before comparing the violations being committed by each gender, you should examine the ``violations`` committed by all drivers to get a baseline understanding of the data.

In this exercise, you'll count the unique values in the ``violation`` column, and then separately express those counts as proportions.

> Before starting your work in this section **repeat the steps which you did in the previos chapter for preparing the data.** Continue to this chapter based on where you were in the end of the previous chapter.

In [3]:
# Reading the data
import pandas as pd
ri = pd.read_csv('police.csv')
ri.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,state,stop_date,stop_time,location_raw,county_name,county_fips,fine_grained_location,police_department,driver_gender,...,search_conducted,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district
0,RI-2005-00001,RI,2005-01-02,01:55,Zone K1,,,,600,M,...,False,,,False,Citation,False,0-15 Min,False,False,Zone K1
1,RI-2005-00002,RI,2005-01-02,20:30,Zone X4,,,,500,M,...,False,,,False,Citation,False,16-30 Min,False,False,Zone X4
2,RI-2005-00003,RI,2005-01-04,11:30,Zone X1,,,,0,,...,False,,,False,,,,,False,Zone X1
3,RI-2005-00004,RI,2005-01-04,12:55,Zone X4,,,,500,M,...,False,,,False,Citation,False,0-15 Min,False,False,Zone X4
4,RI-2005-00005,RI,2005-01-06,01:30,Zone X4,,,,500,M,...,False,,,False,Citation,False,0-15 Min,False,False,Zone X4


In [4]:
# Dropping the columns due to existing of many nan valuses
ri.drop(['county_name', 'county_fips','search_type_raw', 'search_type', 'state', 'fine_grained_location'], axis=1, inplace=True)
ri.shape

(509681, 20)

In [5]:
# Dropping the rows which has missing values
ri.dropna(axis=0, inplace=True)
ri.shape

(478238, 20)

In [6]:
# converting type of "is_arrested" variable to bool
ri.astype({'is_arrested': 'bool'}).dtypes

id                     object
stop_date              object
stop_time              object
location_raw           object
police_department      object
driver_gender          object
driver_age_raw        float64
driver_age            float64
driver_race_raw        object
driver_race            object
violation_raw          object
violation              object
search_conducted       object
contraband_found         bool
stop_outcome           object
is_arrested              bool
stop_duration          object
out_of_state           object
drugs_related_stop       bool
district               object
dtype: object

In [7]:
# Creating datetime feature from "stop_date" and "stop_time" columns
ri['stop_datetime'] = pd.to_datetime(ri['stop_date'] + ' ' + ri['stop_time'])

In [8]:
# Drop the useless columns 
ri.drop(['stop_date', 'stop_time'], axis=1, inplace=True)
ri.shape

(478238, 19)

**INSTRUCTIONS**

*   Count the unique values in the ``violation`` column, to see what violations are being committed by all drivers.
*   Express the violation counts as proportions of the total.

In [9]:
ri['violation'].value_counts()

Speeding               267931
Moving violation        89738
Equipment               61068
Other                   23431
Registration/plates     19780
Seat belt               16290
Name: violation, dtype: int64

In [10]:
ri['violation'].value_counts(normalize=True) * 100

Speeding               56.024615
Moving violation       18.764297
Equipment              12.769374
Other                   4.899443
Registration/plates     4.136016
Seat belt               3.406254
Name: violation, dtype: float64

***

## Comparing violations by gender

The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.

You'll first create a ``DataFrame`` for each gender, and then analyze the ``violations`` in each ``DataFrame`` separately.

**INSTRUCTIONS**

*   Create a ``DataFrame``, female, that only contains rows in which ``driver_gender`` is ``'F'``.
*   Create a ``DataFrame``, male, that only contains rows in which ``driver_gender`` is ``'M'``.
*   Count the ``violations`` committed by female drivers and express them as proportions.
*   Count the violations committed by male drivers and express them as proportions.

In [11]:
# Create a DataFrame, female, that only contains rows in which driver_gender is 'F'.
ri[ri['driver_gender']=='F'].head(3)

Unnamed: 0,id,location_raw,police_department,driver_gender,driver_age_raw,driver_age,driver_race_raw,driver_race,violation_raw,violation,search_conducted,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,stop_datetime
15,RI-2005-00016,Zone X3,200,F,1983.0,22.0,W,White,Speeding,Speeding,False,False,Citation,False,0-15 Min,True,False,Zone X3,2005-02-24 01:20:00
18,RI-2005-00019,Zone K3,300,F,1984.0,21.0,W,White,Speeding,Speeding,False,False,Citation,False,0-15 Min,False,False,Zone K3,2005-03-14 10:00:00
25,RI-2005-00026,Zone K3,300,F,1971.0,34.0,W,White,Speeding,Speeding,False,False,Citation,False,0-15 Min,True,False,Zone K3,2005-03-29 23:20:00


In [12]:
# Create a DataFrame, female, that only contains rows in which driver_gender is 'M'.
ri[ri['driver_gender']=='M'].head(3)

Unnamed: 0,id,location_raw,police_department,driver_gender,driver_age_raw,driver_age,driver_race_raw,driver_race,violation_raw,violation,search_conducted,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,stop_datetime
0,RI-2005-00001,Zone K1,600,M,1985.0,20.0,W,White,Speeding,Speeding,False,False,Citation,False,0-15 Min,False,False,Zone K1,2005-01-02 01:55:00
1,RI-2005-00002,Zone X4,500,M,1987.0,18.0,W,White,Speeding,Speeding,False,False,Citation,False,16-30 Min,False,False,Zone X4,2005-01-02 20:30:00
3,RI-2005-00004,Zone X4,500,M,1986.0,19.0,W,White,Equipment/Inspection Violation,Equipment,False,False,Citation,False,0-15 Min,False,False,Zone X4,2005-01-04 12:55:00


In [13]:
# Count the violations committed by female drivers and express them as proportions.
ri[ri['driver_gender']=='F']['violation'].value_counts(normalize=True) * 100

Speeding               65.767482
Moving violation       13.654591
Equipment              10.715215
Registration/plates     4.314554
Other                   2.838361
Seat belt               2.709797
Name: violation, dtype: float64

In [14]:
# Count the violations committed by male drivers and express them as proportions.
ri[ri['driver_gender']=='M']['violation'].value_counts(normalize=True) * 100

Speeding               52.361579
Moving violation       20.685399
Equipment              13.541679
Other                   5.674351
Registration/plates     4.068891
Seat belt               3.668101
Name: violation, dtype: float64

In [15]:
## Analysis
### Although speeding is a bit higher compared to male drivers there is no significant difference in their violation trends.

In [16]:
#import numpy as np
#ri.groupby(['driver_gender', 'violation']).sum().transform(lambda x: x/np.sum(x)*100)

***

## Comparing speeding outcomes by gender

When a driver is pulled over for speeding, many people believe that gender has an impact on whether the driver will receive a ticket or a warning. Can you find evidence of this in the dataset?

First, you'll create two ``DataFrames`` of drivers who were stopped for ``speeding``: one containing ***females*** and the other containing ***males***.

Then, for each **gender**, you'll use the ``stop_outcome`` column to calculate what percentage of stops resulted in a ``"Citation"`` (meaning a ticket) versus a ``"Warning"``.

**INSTRUCTIONS**

*   Create a ``DataFrame``, ``female_and_speeding``, that only includes female drivers who were stopped for speeding.
*   Create a ``DataFrame``, ``male_and_speeding``, that only includes male drivers who were stopped for speeding.
*   Count the **stop outcomes** for the female drivers and express them as proportions.
*   Count the **stop outcomes** for the male drivers and express them as proportions.

In [18]:
# female drivers who were stopped for speeding
df = ri[ri['driver_gender']=='F'] 
df[df.violation== "Speeding"].head(3)

Unnamed: 0,id,location_raw,police_department,driver_gender,driver_age_raw,driver_age,driver_race_raw,driver_race,violation_raw,violation,search_conducted,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,stop_datetime
15,RI-2005-00016,Zone X3,200,F,1983.0,22.0,W,White,Speeding,Speeding,False,False,Citation,False,0-15 Min,True,False,Zone X3,2005-02-24 01:20:00
18,RI-2005-00019,Zone K3,300,F,1984.0,21.0,W,White,Speeding,Speeding,False,False,Citation,False,0-15 Min,False,False,Zone K3,2005-03-14 10:00:00
25,RI-2005-00026,Zone K3,300,F,1971.0,34.0,W,White,Speeding,Speeding,False,False,Citation,False,0-15 Min,True,False,Zone K3,2005-03-29 23:20:00


In [19]:
# female drivers who were stopped for speeding
dfM = ri[ri['driver_gender']=='M'] 
dfM[dfM.violation== "Speeding"].head(3)

Unnamed: 0,id,location_raw,police_department,driver_gender,driver_age_raw,driver_age,driver_race_raw,driver_race,violation_raw,violation,search_conducted,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,stop_datetime
0,RI-2005-00001,Zone K1,600,M,1985.0,20.0,W,White,Speeding,Speeding,False,False,Citation,False,0-15 Min,False,False,Zone K1,2005-01-02 01:55:00
1,RI-2005-00002,Zone X4,500,M,1987.0,18.0,W,White,Speeding,Speeding,False,False,Citation,False,16-30 Min,False,False,Zone X4,2005-01-02 20:30:00
6,RI-2005-00007,Zone K3,300,M,1965.0,40.0,W,White,Speeding,Speeding,False,False,Citation,False,0-15 Min,True,False,Zone K3,2005-01-18 08:15:00


In [20]:
# stop outcomes for the female drivers as proportions
df[df.violation== "Speeding"]['stop_outcome'].value_counts(normalize=True) * 100

Citation            95.430586
Arrest Driver        0.529433
Arrest Passenger     0.103559
N/D                  0.089596
No Action            0.052362
Name: stop_outcome, dtype: float64

In [21]:
# stop outcomes for the male drivers as proportions
dfM[dfM.violation== "Speeding"]['stop_outcome'].value_counts(normalize=True) * 100

Citation            94.574427
Arrest Driver        1.578658
Arrest Passenger     0.125831
N/D                  0.118138
No Action            0.104401
Name: stop_outcome, dtype: float64

## There is no impact of gender on receiving ticket or warning.

***

## Calculating the search rate

During a traffic stop, the police officer sometimes conducts a search of the vehicle. In this exercise, you'll calculate the percentage of all stops that result in a vehicle search, also known as the **search rate**.

**INSTRUCTIONS**

*   Check the data type of ``search_conducted`` to confirm that it's a ``Boolean Series``.
*   Calculate the search rate by counting the ``Series`` values and expressing them as proportions.
*   Calculate the search rate by taking the mean of the ``Series``. (It should match the proportion of ``True`` values calculated above.)

In [28]:
# search rate as proportions
ri['search_conducted'].value_counts(normalize=True) * 100

False    96.301423
True      3.698577
Name: search_conducted, dtype: float64

In [31]:
# the search rate by taking the mean
ri['search_conducted'].mean() * 100

3.698576859220723

***

## Comparing search rates by gender

You'll compare the rates at which **female** and **male** drivers are searched during a traffic stop. Remember that the vehicle search rate across all stops is about **3.8%**.

First, you'll filter the ``DataFrame`` by gender and calculate the search rate for each group separately. Then, you'll perform the same calculation for both genders at once using a ``.groupby()``.

**INSTRUCTIONS 1/3**

*   Filter the ``DataFrame`` to only include **female** drivers, and then calculate the search rate by taking the mean of ``search_conducted``.

In [32]:
df['search_conducted'].mean() * 100

1.8748947763135744

**INSTRUCTIONS 2/3**

*   Filter the ``DataFrame`` to only include **male** drivers, and then repeat the search rate calculation.

In [40]:
dfM['search_conducted'].mean() * 100

4.384228516186947

**INSTRUCTIONS 3/3**

*   Group by driver gender to calculate the search rate for both groups simultaneously. (It should match the previous results.)

In [41]:
ri.groupby('driver_gender')['search_conducted'].value_counts(normalize=True) * 100

driver_gender  search_conducted
F              False               98.125105
               True                 1.874895
M              False               95.615771
               True                 4.384229
Name: search_conducted, dtype: float64

***

## Adding a second factor to the analysis

Even though the search rate for males is much higher than for females, it's possible that the difference is mostly due to a second factor.

For example, you might hypothesize that the search rate varies by violation type, and the difference in search rate between males and females is because they tend to commit different violations.

You can test this hypothesis by examining the search rate for each combination of gender and violation. If the hypothesis was true, you would find that males and females are searched at about the same rate for each violation. Find out below if that's the case!

**INSTRUCTIONS 1/2**

*   Use a ``.groupby()`` to calculate the search rate for each combination of gender and violation. Are males and females searched at about the same rate for each violation?

**INSTRUCTIONS 2/2**

*   Reverse the ordering to group by violation before gender. The results may be easier to compare when presented this way.

***

## Counting protective frisks

During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a ``"protective frisk."``

You'll first check to see how many times "Protective Frisk" was the only search type. Then, you'll use a string method to locate all instances in which the driver was frisked.

**INSTRUCTIONS**

*   Count the ``search_type`` values to see how many times ``"Protective Frisk"`` was the only search type.
*   Create a new column, frisk, that is ``True`` if ``search_type`` contains the string ``"Protective Frisk"`` and ``False`` otherwise.
*   Check the data type of frisk to confirm that it's a ``Boolean Series``.
*   Take the sum of frisk to count the total number of frisks.

***

## Comparing frisk rates by gender

You'll compare the rates at which female and male drivers are frisked during a search. Are males frisked more often than females, perhaps because police officers consider them to be higher risk?

Before doing any calculations, it's important to filter the ``DataFrame`` to only include the relevant subset of data, namely stops in which a search was conducted.

**INSTRUCTIONS**

*   Create a ``DataFrame``, searched, that only contains rows in which ``search_conducted`` is ``True``.
*   Take the mean of the frisk column to find out what percentage of searches included a frisk.
*   Calculate the frisk rate for each gender using a ``.groupby()``.