In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_columns = 200
plt.style.use('seaborn-whitegrid')

## Standford Open Policing Project Data Analysis
- 1. [Create dataframe from imported data](#-1.Create-dataframe-from-imported-data)
- 2. [Exlore violations](#-2.Explore-violations)
    - i. [What are the top three violation types a driver is pulled over based on gender?](#-i.What-are-the-top-three-violation-types-a-driver-is-pulled-over-based-on-gender?)
    - ii. [Compare the violation type stats based on gender.](#-ii.Compare-the-violation-type-stats-based-on-gender.)

### 1.Create dataframe from imported data

In [2]:
police_df = pd.read_csv('../data/police_data.csv')
police_df.shape

(91741, 15)

In [3]:
police_df.head()

Unnamed: 0,stop_date,stop_time,county_name,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,2005-01-02,01:55,,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,2005-01-18,08:15,,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
3,2005-02-20,17:15,,M,1986.0,19.0,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False
4,2005-03-14,10:00,,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


In [4]:
police_df.dtypes

stop_date              object
stop_time              object
county_name           float64
driver_gender          object
driver_age_raw        float64
driver_age            float64
driver_race            object
violation_raw          object
violation              object
search_conducted         bool
search_type            object
stop_outcome           object
is_arrested            object
stop_duration          object
drugs_related_stop       bool
dtype: object

- Check the value range for the `stop_date` column.

In [5]:
police_df['stop_date'].min(), police_df['stop_date'].max()

('2005-01-02', '2015-12-31')

- Check for null value counts.

In [6]:
police_df.isna().sum()

stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5335
driver_age_raw         5327
driver_age             5621
driver_race            5333
violation_raw          5333
violation              5333
search_conducted          0
search_type           88545
stop_outcome           5333
is_arrested            5333
stop_duration          5333
drugs_related_stop        0
dtype: int64

In [7]:
police_df.shape[0] == police_df.county_name.isnull().sum()

True

- `county_name` column is composed of only null values. Therefore, remove that column.

In [8]:
police_df = police_df.drop('county_name', axis=1)

police_df.columns

Index(['stop_date', 'stop_time', 'driver_gender', 'driver_age_raw',
       'driver_age', 'driver_race', 'violation_raw', 'violation',
       'search_conducted', 'search_type', 'stop_outcome', 'is_arrested',
       'stop_duration', 'drugs_related_stop'],
      dtype='object')

### 2.Explore violations

- Which gender speed more?

In [9]:
police_df.violation.value_counts()

Speeding               48463
Moving violation       16224
Equipment              11020
Other                   4317
Registration/plates     3432
Seat belt               2952
Name: violation, dtype: int64

In [24]:
police_df.query('violation == "Speeding"').driver_gender.value_counts()

M    32979
F    15482
Name: driver_gender, dtype: int64

In [25]:
police_df.query('violation == "Speeding"').driver_gender.value_counts(normalize=True)

M    0.680527
F    0.319473
Name: driver_gender, dtype: float64

- Alternative methods

In [29]:
police_df[police_df.violation == 'Speeding']

Unnamed: 0,stop_date,stop_time,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,2005-01-02,01:55,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,2005-01-18,08:15,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
4,2005-03-14,10:00,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
6,2005-04-01,17:30,M,1969.0,36.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91734,2015-12-31,20:20,M,1993.0,22.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
91735,2015-12-31,20:25,M,1992.0,23.0,Hispanic,Speeding,Speeding,False,,Citation,False,0-15 Min,False
91736,2015-12-31,20:27,M,1986.0,29.0,White,Speeding,Speeding,False,,Warning,False,0-15 Min,False
91739,2015-12-31,21:42,M,1993.0,22.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


In [30]:
police_df.loc[police_df.violation == 'Speeding', 'driver_gender']

0        M
1        M
2        M
4        F
6        M
        ..
91734    M
91735    M
91736    M
91739    M
91740    M
Name: driver_gender, Length: 48463, dtype: object

In [32]:
police_df.loc[police_df.violation == "Speeding", "driver_gender"].value_counts(normalize=True)

M    0.680527
F    0.319473
Name: driver_gender, dtype: float64

#### i.What are the top three violation types a driver is pulled over based on gender?

In [38]:
police_df.query('driver_gender == "M"').violation.value_counts(normalize=True).head(3)

Speeding            0.524350
Moving violation    0.207012
Equipment           0.135671
Name: violation, dtype: float64

In [39]:
police_df.query('driver_gender == "F"').violation.value_counts(normalize=True)

Speeding               0.658500
Moving violation       0.136277
Equipment              0.105780
Registration/plates    0.043086
Other                  0.029348
Seat belt              0.027009
Name: violation, dtype: float64

- Alternative methods

In [42]:
police_df.loc[police_df.driver_gender == "M"]['violation'].value_counts()

Speeding               32979
Moving violation       13020
Equipment               8533
Other                   3627
Registration/plates     2419
Seat belt               2317
Name: violation, dtype: int64

#### ii.Compare the violation type stats based on gender.