# Analyzing Police Activity with pandas

Exercise created by Kevin Markham, founder of Data School
Download full dataset for any of the 31 states involved at: [https://openpolicing.stanford.edu/](https://openpolicing.stanford.edu/)

**Goals:**
* Explore the Stanford Open Policing Project dataset and analyze the impact of gender on police behavior. 
* Gain more practice cleaning messy data, creating visualizations, combining and reshaping datasets, and manipulating time series data.

**About the data:**

* Each row represents one traffic stop
* `NaN` values present

* **`.isnull()`** : returns a boolean array; see also **`.isna()`**
* **`.drop()`** : `ri.drop('column_name', axis = 'columns', inplace = True)`
* **`.dropna()`** : `ri.dropna(subset=['stop_date', 'stop_time'], inplace = True)` # Only drops rows with`NaN`s specifically in the columns `stop_date` and `stop_time`
* **`.astype()`** : to change data type

* **Note:** you must use **bracket notation** on the left side of an assignment statement to create a new series or overwrite an existing one

* `datetime` columns provide date-based attributes
* **`.str.cat()`** concatenate strings (as opposed to `pd.concat()` for pd.Series); use argument `sep=' '` to separate with a space (or character of your choice)
* **`pd.to_datetime()`** : convert a string to a datetime object; if no time is entered, defaults to midnight
* **`.set_index()`** to set a column as the index; use `inplace = True` to avoid assignment statement
* **Note:** When an existing column becomes the index, it is no longer considered to be one of the dataframe columns.
* **`value_counts()`** : counts the unique values in a Series; best suited for categorical rather than numerical data; use **`normalize = True`** to express counts as proportions

#### Filtering on a single condition
`female = ri[ri.driver_gender == 'F']`


#### Filtering on multiple conditions
`female_and_arrested = ri[(ri.driver_gender == 'F') & (ri.is_arrested == True)]`
* Each condition contained in parethesis
* ampersand in between conditions (or = | )

#### Other rules for filtering on multiple conditions
* Can use more than two conditions
* Conditions can also check for equality, inequality, greater than, less than, etc


```
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri.driver_gender =='F') & (ri.violation == 'Speeding')]

# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri.driver_gender =='M') & (ri.violation == 'Speeding')]

# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding.stop_outcome.value_counts(normalize=True))

# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding.stop_outcome.value_counts(normalize=True))
```

#### Math with boolean values
* **`np.mean([0, 1, 0, 0])` = `np.mean([False, True, False, False])`**
* True = 1; False = 2
* **The mean of a boolean series represents the percentage of true values**
* `ri.is_arrested.mean()` works because `ri.is_arrested` is type `bool`

#### Comparing groups using groupby
* `ri.groupby('district').is_arrested.mean()`

#### Grouping by multiple categories 
* `ri.groupby(['district', 'driver_gender']).is_arrested.mean()`
    * Note that if you switch ordering of 'district' and 'driver_gender', the results will end up being the same, but the presentation will be different