In [148]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_columns = 200
plt.style.use('seaborn-whitegrid')

## Standford Open Policing Project Data Analysis
- 1. [Create dataframe from imported data](#-1.Create-dataframe-from-imported-data)
- 2. [Exlore violations](#-2.Explore-violations)
    - i. [What are the top three violation types a driver is pulled over based on gender?](#-i.What-are-the-top-three-violation-types-a-driver-is-pulled-over-based-on-gender?)
    - ii. [Compare the violation type stats based on gender.](#-ii.Compare-the-violation-type-stats-based-on-gender.)
    - ii. [ii.Compare the violation type stats based on gender.](#-ii.Compare-the-violation-type-stats-based-on-gender.)
    - iii. [What is the relationship between gender and violation stops?](#-iii.What-is-the-relationship-between-gender-and-violation-stops?)
    - iv. [What is the percentage of police stops in which a search was conducted?](#-iv.What-is-the-percentage-of-police-stops-in-which-a-search-was-conducted?)
- 3. [How does pandas handle NULL values?](#-3.How-does-pandas-handle-NULL-values?)

### 1.Create dataframe from imported data

In [149]:
police_df = pd.read_csv('../data/police_data.csv')
police_df.shape

(91741, 15)

In [150]:
police_df.head()

Unnamed: 0,stop_date,stop_time,county_name,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,2005-01-02,01:55,,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,2005-01-18,08:15,,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
3,2005-02-20,17:15,,M,1986.0,19.0,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False
4,2005-03-14,10:00,,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


In [151]:
police_df.dtypes

stop_date              object
stop_time              object
county_name           float64
driver_gender          object
driver_age_raw        float64
driver_age            float64
driver_race            object
violation_raw          object
violation              object
search_conducted         bool
search_type            object
stop_outcome           object
is_arrested            object
stop_duration          object
drugs_related_stop       bool
dtype: object

- Check the value range for the `stop_date` column.

In [152]:
police_df['stop_date'].min(), police_df['stop_date'].max()

('2005-01-02', '2015-12-31')

- Check for null value counts.

In [153]:
police_df.isna().sum()

stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5335
driver_age_raw         5327
driver_age             5621
driver_race            5333
violation_raw          5333
violation              5333
search_conducted          0
search_type           88545
stop_outcome           5333
is_arrested            5333
stop_duration          5333
drugs_related_stop        0
dtype: int64

In [154]:
police_df.shape[0] == police_df.county_name.isnull().sum()

True

- `county_name` column is composed of only null values. Therefore, remove that column.

In [155]:
police_df = police_df.drop('county_name', axis=1)

police_df.columns

Index(['stop_date', 'stop_time', 'driver_gender', 'driver_age_raw',
       'driver_age', 'driver_race', 'violation_raw', 'violation',
       'search_conducted', 'search_type', 'stop_outcome', 'is_arrested',
       'stop_duration', 'drugs_related_stop'],
      dtype='object')

### 2.Explore violations

- Which gender speed more?

In [156]:
police_df.violation.value_counts()

Speeding               48463
Moving violation       16224
Equipment              11020
Other                   4317
Registration/plates     3432
Seat belt               2952
Name: violation, dtype: int64

In [157]:
police_df.query('violation == "Speeding"').driver_gender.value_counts()

M    32979
F    15482
Name: driver_gender, dtype: int64

In [158]:
police_df.query('violation == "Speeding"').driver_gender.value_counts(normalize=True)

M    0.680527
F    0.319473
Name: driver_gender, dtype: float64

- Alternative methods

In [159]:
police_df[police_df.violation == 'Speeding']

Unnamed: 0,stop_date,stop_time,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,2005-01-02,01:55,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,2005-01-18,08:15,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
4,2005-03-14,10:00,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
6,2005-04-01,17:30,M,1969.0,36.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91734,2015-12-31,20:20,M,1993.0,22.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
91735,2015-12-31,20:25,M,1992.0,23.0,Hispanic,Speeding,Speeding,False,,Citation,False,0-15 Min,False
91736,2015-12-31,20:27,M,1986.0,29.0,White,Speeding,Speeding,False,,Warning,False,0-15 Min,False
91739,2015-12-31,21:42,M,1993.0,22.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


In [160]:
police_df.loc[police_df.violation == 'Speeding', 'driver_gender']

0        M
1        M
2        M
4        F
6        M
        ..
91734    M
91735    M
91736    M
91739    M
91740    M
Name: driver_gender, Length: 48463, dtype: object

In [161]:
police_df.loc[police_df.violation == "Speeding", "driver_gender"].value_counts(normalize=True)

M    0.680527
F    0.319473
Name: driver_gender, dtype: float64

#### i.What are the top three violation types a driver is pulled over based on gender?

In [162]:
police_df.query('driver_gender == "M"').violation.value_counts(normalize=True).head(3)

Speeding            0.524350
Moving violation    0.207012
Equipment           0.135671
Name: violation, dtype: float64

In [163]:
police_df.query('driver_gender == "F"').violation.value_counts(normalize=True)

Speeding               0.658500
Moving violation       0.136277
Equipment              0.105780
Registration/plates    0.043086
Other                  0.029348
Seat belt              0.027009
Name: violation, dtype: float64

- Alternative methods

In [164]:
police_df.loc[police_df.driver_gender == "M"]['violation'].value_counts()

Speeding               32979
Moving violation       13020
Equipment               8533
Other                   3627
Registration/plates     2419
Seat belt               2317
Name: violation, dtype: int64

[BACK TO TOP][def]

[def]: ##-Standford-Open-Policing-Project-Data-Analysis

#### ii.Compare the violation type stats based on gender.

In [165]:
police_df.groupby('driver_gender').violation

<pandas.core.groupby.generic.SeriesGroupBy object at 0x14796cb80>

In [166]:
police_df.groupby('driver_gender').violation.value_counts(normalize=True)

driver_gender  violation          
F              Speeding               0.658500
               Moving violation       0.136277
               Equipment              0.105780
               Registration/plates    0.043086
               Other                  0.029348
               Seat belt              0.027009
M              Speeding               0.524350
               Moving violation       0.207012
               Equipment              0.135671
               Other                  0.057668
               Registration/plates    0.038461
               Seat belt              0.036839
Name: violation, dtype: float64

- Generate markdown

In [173]:
%run ../playground/00_generate_markdown.ipynb import generate_markdown_text
generate_markdown_text('iii.What is the relationship between gender and violation stops?')

'- iii. [What is the relationship between gender and violation stops?](#-iii.What-is-the-relationship-between-gender-and-violation-stops?)'

### iii.What is the relationship between gender and violation stops?

In [175]:
police_df.search_conducted.value_counts()

False    88545
True      3196
Name: search_conducted, dtype: int64

In [179]:
police_df.search_conducted.value_counts(normalize=True)

False    0.965163
True     0.034837
Name: search_conducted, dtype: float64

- What is the percentage of police stops in which a search was conducted?

In [184]:
generate_markdown_text("iv.What is the percentage of police stops in which a search was conducted?")

'- iv. [What is the percentage of police stops in which a search was conducted?](#-iv.What-is-the-percentage-of-police-stops-in-which-a-search-was-conducted?)'

### iv.What is the percentage of police stops in which a search was conducted?

In [182]:
police_df.search_conducted.mean()

0.03483720473942948

- Find the search_conducted values based on gender.

In [181]:
police_df.groupby('driver_gender').search_conducted.value_counts(normalize=True)

driver_gender  search_conducted
F              False               0.979967
               True                0.020033
M              False               0.956674
               True                0.043326
Name: search_conducted, dtype: float64

In [187]:
police_df.groupby(['driver_gender', 'violation']).search_conducted.value_counts(normalize=True)

driver_gender  violation            search_conducted
F              Equipment            False               0.957378
                                    True                0.042622
               Moving violation     False               0.963795
                                    True                0.036205
               Other                False               0.943478
                                    True                0.056522
               Registration/plates  False               0.933860
                                    True                0.066140
               Seat belt            False               0.987402
                                    True                0.012598
               Speeding             False               0.991280
                                    True                0.008720
M              Equipment            False               0.929919
                                    True                0.070081
               Moving violation     F

In [188]:
police_df.groupby(['driver_gender', 'violation']).search_conducted.mean()

driver_gender  violation          
F              Equipment              0.042622
               Moving violation       0.036205
               Other                  0.056522
               Registration/plates    0.066140
               Seat belt              0.012598
               Speeding               0.008720
M              Equipment              0.070081
               Moving violation       0.059831
               Other                  0.047146
               Registration/plates    0.110376
               Seat belt              0.037980
               Speeding               0.024925
Name: search_conducted, dtype: float64

- Conclusions:
    - 96% of the stops, search was not conducted.
    - There is no direct causation between gender and search. 

In [190]:
generate_markdown_text('3.How does pandas handle "Null" values?')

'- 3. [How does pandas handle "Null" values?](#-3.How-does-pandas-handle-"Null"-values?)'

### 3.How does pandas handle NULL values?

In [207]:
police_df.query('search_conducted == True').search_type.value_counts()

Incident to Arrest                                          1219
Probable Cause                                               891
Inventory                                                    220
Reasonable Suspicion                                         197
Protective Frisk                                             161
Incident to Arrest,Inventory                                 129
Incident to Arrest,Probable Cause                            106
Probable Cause,Reasonable Suspicion                           75
Incident to Arrest,Inventory,Probable Cause                   34
Incident to Arrest,Protective Frisk                           33
Probable Cause,Protective Frisk                               33
Inventory,Probable Cause                                      22
Incident to Arrest,Reasonable Suspicion                       13
Inventory,Protective Frisk                                    11
Incident to Arrest,Inventory,Protective Frisk                 11
Protective Frisk,Reasonab

In [208]:
police_df.query('search_conducted == True').search_type.value_counts(dropna=False)

Incident to Arrest                                          1219
Probable Cause                                               891
Inventory                                                    220
Reasonable Suspicion                                         197
Protective Frisk                                             161
Incident to Arrest,Inventory                                 129
Incident to Arrest,Probable Cause                            106
Probable Cause,Reasonable Suspicion                           75
Incident to Arrest,Inventory,Probable Cause                   34
Incident to Arrest,Protective Frisk                           33
Probable Cause,Protective Frisk                               33
Inventory,Probable Cause                                      22
Incident to Arrest,Reasonable Suspicion                       13
Inventory,Protective Frisk                                    11
Incident to Arrest,Inventory,Protective Frisk                 11
Protective Frisk,Reasonab

In [203]:
police_df.query('search_conducted == True').search_type

24       Incident to Arrest,Protective Frisk
40                            Probable Cause
41                            Probable Cause
80                        Incident to Arrest
106                           Probable Cause
                        ...                 
91494                     Incident to Arrest
91548                     Incident to Arrest
91672                     Incident to Arrest
91700    Probable Cause,Reasonable Suspicion
91708                     Incident to Arrest
Name: search_type, Length: 3196, dtype: object

In [205]:
police_df.query('search_conducted == True').search_type.isnull().sum()

0

- Conclusion:
    - Pandas methods ignore missing values by default. 