# Analyzing Police Activity with pandas

Exercise created by Kevin Markham, founder of Data School
Download full dataset for any of the 31 states involved at: [https://openpolicing.stanford.edu/](https://openpolicing.stanford.edu/)

**Goals:**
* Explore the Stanford Open Policing Project dataset and analyze the impact of gender on police behavior. 
* Gain more practice cleaning messy data, creating visualizations, combining and reshaping datasets, and manipulating time series data.

**About the data:**

* Each row represents one traffic stop
* `NaN` values present

* **`.isnull()`** : returns a boolean array; see also **`.isna()`**
* **`.drop()`** : `ri.drop('column_name', axis = 'columns', inplace = True)`
* **`.dropna()`** : `ri.dropna(subset=['stop_date', 'stop_time'], inplace = True)` # Only drops rows with`NaN`s specifically in the columns `stop_date` and `stop_time`
* **`.astype()`** : to change data type

* **Note:** you must use **bracket notation** on the left side of an assignment statement to create a new series or overwrite an existing one

* `datetime` columns provide date-based attributes
* **`.str.cat()`** concatenate strings (as opposed to `pd.concat()` for pd.Series); use argument `sep=' '` to separate with a space (or character of your choice)
* **`pd.to_datetime()`** : convert a string to a datetime object; if no time is entered, defaults to midnight
* **`.set_index()`** to set a column as the index; use `inplace = True` to avoid assignment statement
* **Note:** When an existing column becomes the index, it is no longer considered to be one of the dataframe columns.
* **`value_counts()`** : counts the unique values in a Series; best suited for categorical rather than numerical data; use **`normalize = True`** to express counts as proportions
    * excludes missing values by default
    * to change this: **`.value_counts(dropna=False)`**
    * sometimes multiple values are relevant for a single query, in which case they're separated by columns

#### Filtering on a single condition
`female = ri[ri.driver_gender == 'F']`


#### Filtering on multiple conditions
`female_and_arrested = ri[(ri.driver_gender == 'F') & (ri.is_arrested == True)]`
* Each condition contained in parethesis
* ampersand in between conditions (or = | )

#### Other rules for filtering on multiple conditions
* Can use more than two conditions
* Conditions can also check for equality, inequality, greater than, less than, etc


```
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri.driver_gender =='F') & (ri.violation == 'Speeding')]

# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri.driver_gender =='M') & (ri.violation == 'Speeding')]

# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding.stop_outcome.value_counts(normalize=True))

# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding.stop_outcome.value_counts(normalize=True))
```

#### Math with boolean values
* **`np.mean([0, 1, 0, 0])` = `np.mean([False, True, False, False])`**
* True = 1; False = 2
* **The mean of a boolean series represents the percentage of true values**
* `ri.is_arrested.mean()` works because `ri.is_arrested` is type `bool`

#### Comparing groups using groupby
* `ri.groupby('district').is_arrested.mean()`

#### Grouping by multiple categories 
* `ri.groupby(['district', 'driver_gender']).is_arrested.mean()`
    * Note that if you switch ordering of 'district' and 'driver_gender', the results will end up being the same, but the presentation will be different
    
* `print(ri[ri.driver_gender == 'F'].search_conducted.mean())`
* `print(ri[ri.driver_gender == 'M'].search_conducted.mean())`
* `print(ri.groupby('driver_gender').search_conducted.mean())`
* `print(ri.groupby(['driver_gender', 'violation']).search_conducted.mean())`
* `print(ri.groupby(['violation', 'driver_gender']).search_conducted.mean())`

#### Examining search types
* Goal: locate "Inventory" among multiple search types.
* to do this: search for a string

#### Searching for a string
* `ri['Inventory'] = ri.search_type.str.contains('Inventory', na=False)`
* **`str.contains()`** : returns `True` if string is found and `False` if not found.
* argument `na=False` when it finds a missing value
* as should be expected, the data type for `ri.Inventory` in `bool`
* a `True` value means an inventory was done; `False` = no inventory done
* `ri.Inventory.sum()` = number of `True` values; this number of `True` values represents the number of times a search was done for Inventory as the sole purpose **and/or** as one of multiple purposes
* Calculate percentage of searches which included an inventory?

```
searched = ri[ri.search_conducted == True]
searched.inventory.mean()
```

```
# Count the 'search_type' values
print(ri.search_type.value_counts())

# Check if 'search_type' contains the string 'Protective Frisk'
ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)

# Check the data type of 'frisk'
print(ri.dtypes)

# Take the sum of 'frisk'
print(ri.frisk.sum())
```

```
# Create a DataFrame of stops in which a search was conducted
searched = ri[ri.search_conducted == True]

# Calculate the overall frisk rate by taking the mean of 'frisk'
print(searched.frisk.mean())

# Calculate the frisk rate for each gender
print(searched.groupby('driver_gender').frisk.mean())
```

#### Analyzing datetime data


* get special datetime attributes via the `dt` accessor
    * For example: `apple.date_time.dt.month`
* If you have a datetimeIndex, you can access same attributes, but *don't* need the `dt` accessor
#### Calculating monthly mean price
* `monthly_price = apple.groupby(apple.index.month).price.mean()`

```
# Calculate the overall arrest rate
print(ri.is_arrested.mean())

# Calculate the hourly arrest rate
print(ri.groupby(ri.index.hour).is_arrested.mean())

# Save the hourly arrest rate
hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.mean()
```

#### Plotting analyzed datetime data

```
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Create a line plot of 'hourly_arrest_rate'
hourly_arrest_rate.plot()

# Add the xlabel, ylabel, and title
plt.xlabel('Hour')
plt.ylabel('Arrest Rate')
plt.title('Arrest Rate by Time of Day')

# Display the plot
plt.show()
```