# Analyzing Police Activity with pandas

Exercise created by Kevin Markham, founder of Data School
Download full dataset for any of the 31 states involved at: [https://openpolicing.stanford.edu/](https://openpolicing.stanford.edu/)

**Goals:**
* Explore the Stanford Open Policing Project dataset and analyze the impact of gender on police behavior. 
* Gain more practice cleaning messy data, creating visualizations, combining and reshaping datasets, and manipulating time series data.

**About the data:**

* Each row represents one traffic stop
* `NaN` values present

* **`.isnull()`** : returns a boolean array; see also **`.isna()`**
* **`.drop()`** : `ri.drop('column_name', axis = 'columns', inplace = True)`
* **`.dropna()`** : `ri.dropna(subset=['stop_date', 'stop_time'], inplace = True)` # Only drops rows with`NaN`s specifically in the columns `stop_date` and `stop_time`
* **`.astype()`** : to change data type

* **Note:** you must use **bracket notation** on the left side of an assignment statement to create a new series or overwrite an existing one

* `datetime` columns provide date-based attributes
* **`.str.cat()`** concatenate strings (as opposed to `pd.concat()` for pd.Series); use argument `sep=' '` to separate with a space (or character of your choice)
* **`pd.to_datetime()`** : convert a string to a datetime object; if no time is entered, defaults to midnight
* **`.set_index()`** to set a column as the index; use `inplace = True` to avoid assignment statement
* **Note:** When an existing column becomes the index, it is no longer considered to be one of the dataframe columns.
* **`value_counts()`** : counts the unique values in a Series; best suited for categorical rather than numerical data; use **`normalize = True`** to express counts as proportions
    * excludes missing values by default
    * to change this: **`.value_counts(dropna=False)`**
    * sometimes multiple values are relevant for a single query, in which case they're separated by columns

#### Filtering on a single condition
`female = ri[ri.driver_gender == 'F']`


#### Filtering on multiple conditions
`female_and_arrested = ri[(ri.driver_gender == 'F') & (ri.is_arrested == True)]`
* Each condition contained in parethesis
* ampersand in between conditions (or = | )

#### Other rules for filtering on multiple conditions
* Can use more than two conditions
* Conditions can also check for equality, inequality, greater than, less than, etc


```
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri.driver_gender =='F') & (ri.violation == 'Speeding')]

# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri.driver_gender =='M') & (ri.violation == 'Speeding')]

# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding.stop_outcome.value_counts(normalize=True))

# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding.stop_outcome.value_counts(normalize=True))
```

#### Math with boolean values
* **`np.mean([0, 1, 0, 0])` = `np.mean([False, True, False, False])`**
* True = 1; False = 2
* **The mean of a boolean series represents the percentage of true values**
* `ri.is_arrested.mean()` works because `ri.is_arrested` is type `bool`

#### Comparing groups using groupby
* `ri.groupby('district').is_arrested.mean()`

#### Grouping by multiple categories 
* `ri.groupby(['district', 'driver_gender']).is_arrested.mean()`
    * Note that if you switch ordering of 'district' and 'driver_gender', the results will end up being the same, but the presentation will be different
    
* `print(ri[ri.driver_gender == 'F'].search_conducted.mean())`
* `print(ri[ri.driver_gender == 'M'].search_conducted.mean())`
* `print(ri.groupby('driver_gender').search_conducted.mean())`
* `print(ri.groupby(['driver_gender', 'violation']).search_conducted.mean())`
* `print(ri.groupby(['violation', 'driver_gender']).search_conducted.mean())`

#### Examining search types
* Goal: locate "Inventory" among multiple search types.
* to do this: search for a string

#### Searching for a string
* `ri['Inventory'] = ri.search_type.str.contains('Inventory', na=False)`
* **`str.contains()`** : returns `True` if string is found and `False` if not found.
* argument `na=False` when it finds a missing value
* as should be expected, the data type for `ri.Inventory` in `bool`
* a `True` value means an inventory was done; `False` = no inventory done
* `ri.Inventory.sum()` = number of `True` values; this number of `True` values represents the number of times a search was done for Inventory as the sole purpose **and/or** as one of multiple purposes
* Calculate percentage of searches which included an inventory?

```
searched = ri[ri.search_conducted == True]
searched.inventory.mean()
```

```
# Count the 'search_type' values
print(ri.search_type.value_counts())

# Check if 'search_type' contains the string 'Protective Frisk'
ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)

# Check the data type of 'frisk'
print(ri.dtypes)

# Take the sum of 'frisk'
print(ri.frisk.sum())
```

```
# Create a DataFrame of stops in which a search was conducted
searched = ri[ri.search_conducted == True]

# Calculate the overall frisk rate by taking the mean of 'frisk'
print(searched.frisk.mean())

# Calculate the frisk rate for each gender
print(searched.groupby('driver_gender').frisk.mean())
```

#### Analyzing datetime data


* get special datetime attributes via the `dt` accessor
    * For example: `apple.date_time.dt.month`
* If you have a datetimeIndex, you can access same attributes, but *don't* need the `dt` accessor
#### Calculating monthly mean price
* `monthly_price = apple.groupby(apple.index.month).price.mean()`

```
# Calculate the overall arrest rate
print(ri.is_arrested.mean())

# Calculate the hourly arrest rate
print(ri.groupby(ri.index.hour).is_arrested.mean())

# Save the hourly arrest rate
hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.mean()
```

#### Plotting analyzed datetime data

```
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Create a line plot of 'hourly_arrest_rate'
hourly_arrest_rate.plot()

# Add the xlabel, ylabel, and title
plt.xlabel('Hour')
plt.ylabel('Arrest Rate')
plt.title('Arrest Rate by Time of Day')

# Display the plot
plt.show()
```

#### Using subplots to examine the relationship between multiple variables over time
* `apple.groupby(apple.index.month).price.mean()`
* Alternative method: Resampling
* **Resampling** is when you change the frequency of your time series operations
* Resample the price columm by month:
    * `apple.price.resample('M').mean()`
* Resample to find the mean daily volume for each month:
    * `apple.volume.resample('M').mean()`
* **Concatenating price and volume:**

```
monthly_price = apple.price.resample('M').mean()
monthly_volume = apple.volume.resample('M').mean()
monthly = pd.concat([monthly_price, monthly_volume], axis='columns')
monthly.plot(subplots=True)
plt.show()
```
* Above results in two separate plots with independent y-axes.

```
# Calculate the annual rate of drug-related stops
print(ri.drugs_related_stop.resample('A').mean())

# Save the annual rate of drug-related stops
annual_drug_rate = ri.drugs_related_stop.resample('A').mean()

# Create a line plot of 'annual_drug_rate'
annual_drug_rate.plot()

# Display the plot
plt.show()
```

```
# Calculate and save the annual search rate
annual_search_rate = ri.search_conducted.resample('A').mean()

# Concatenate 'annual_drug_rate' and 'annual_search_rate'
annual = pd.concat([annual_drug_rate, annual_search_rate], axis='columns')

# Create subplots from 'annual'
annual.plot(subplots=True)

# Display the subplots
plt.show()
```

#### Computing a frequency table
* **`pd.crosstab()`** : pass two pandas series that represent categories and this outputs a frequency table in the form of a dataframe
    * `pd.crosstab(ri.driver_race, ri.driver_gender)`
    * first argument will be along the index, second argument will be along the columns
    * switch order of arguments to transpose df
* A **frequency table** is like a tally of how many times each combination of values occurs in the dataset

* `table = pd.crosstab(ri.driver_race, ri.driver_gender)`
* **`.loc[]`** accessor allows you to select from a dataframe by label

* `table.loc['Asian':'Hispanic']`
* `table.plot(kind= 'bar') #Because categorical, not continuous, values`

* For a variation of the bar plot: **Stacked Bar Plot**
    * `table.plot(kind='bar', stacked=True)`
    * The strength of this plot is that it helps you to see the totals for each category
    * However, this plot slightly deemphasizes the individual components of each bar and makes them harder to compare against one another
    
```
# Create a frequency table of districts and violations
print(pd.crosstab(ri.district, ri.violation))

# Save the frequency table as 'all_zones'
all_zones = pd.crosstab(ri.district, ri.violation)

# Select rows 'Zone K1' through 'Zone K3'
print(all_zones.loc['Zone K1':'Zone K3'])

# Save the smaller table as 'k_zones'
k_zones = all_zones.loc['Zone K1':'Zone K3']
```

## Mapping one set of values to another
* Dictionary maps the values you have to the values you want

```
mapping = {'up': True, 'down': False)
apple['is_up'] = apple.change.map(mapping)
```
* Now that we have a Boolean column, we can calculate how often apple stock prices went up by taking the mean of the Boolean column.

* `apple.is_up.mean()`
* Outputs: 0.5 = 50%

#### Calculating the search rate
* Visualize how often searches were done after each violation type
* `search_rate = ri.groupby('violation').search_conducted.mean()`
* This calculates the search rate for each of the six violation types and returns a series that is sorted in alphabetically order
* Visualize with: (since we're comparing the search rate across categories)
    * `search_rate.plot(kind='bar')`

#### Ordering the bars left to right by size:
* Makes the plot easier to understand
* Makes values easier to compare
* `search_rate.sort_values.plot(kind='bar')`

#### Rotating the bars
* Transpose bars along y axis
* `search_rate.sort_values.plot(kind='barh')`

```
# Print the unique values in 'stop_duration'
print(ri.stop_duration.unique())

# Create a dictionary that maps strings to integers
mapping = {'0-15 Min': 8, '16-30 Min': 23, '30+ Min': 45}

# Convert the 'stop_duration' strings to integers using the 'mapping'
ri['stop_minutes'] = ri.stop_duration.map(mapping)

# Print the unique values in 'stop_minutes'
print(ri.stop_minutes.unique())
```

```
# Calculate the mean 'stop_minutes' for each value in 'violation_raw'
print(ri.groupby('violation_raw').stop_minutes.mean())

# Save the resulting Series as 'stop_length'
stop_length = ri.groupby('violation_raw').stop_minutes.mean()

# Sort 'stop_length' by its values and create a horizontal bar plot
stop_length.sort_values().plot(kind='barh')

# Display the plot
plt.show()
```


# Analyzing the effect of weather on policing
* Hypothesis: Weather impacts poice behavior during traffic stops

#### Creating a box plot 
* A box plot is essentially a visual representation of the summary statistics
* Box plots in place of `.describe()`
* **Many natural phenomena have a normal distribution

```
# Read 'weather.csv' into a DataFrame named 'weather'
weather = pd.read_csv('weather.csv')

# Describe the temperature columns
print(weather[['TMIN', 'TAVG', 'TMAX']].describe())

# Create a box plot of the temperature columns
weather[['TMIN', 'TAVG', 'TMAX']].plot(kind='box')

# Display the plot
plt.show()
```
`weather.WDIFF.plot(kind='hist', bins=20)`

```
# Create a 'TDIFF' column that represents temperature difference
weather['TDIFF'] = weather.TMAX - weather.TMIN

# Describe the 'TDIFF' column
print(weather.TDIFF.describe())

# Create a histogram with 20 bins to visualize 'TDIFF'
weather.TDIFF.plot(kind='hist', bins=20)

# Display the plot
plt.show()
```

#### Categorizing the weather
* `temp = weather.loc[:,'TAVG':'TMAX']`
* if you use `.sum()` on a dataframe, pandas will return the sum of each of the columns in the dataframe
* to sum along rows: `temp.sum(axis='columns')` (weird bc rows $\Rightarrow$ 'columns', but true, bc axis specifies array dimension that is being aggregated, and to sum along the rows, you're actually aggregating columns)
* Whenever you have an object column with a small number of possible values, you may want to change its data type to category (this stores the data more efficiently than the object type)
* Also: **changing to category type allows you to specify a logical order for the categories**
* To calculate memory usage: `ri.stop_length.memory_usage(deep=True)`
* changing data type from object to category:

```
cats = ['short', 'medium', 'long'] #in logical order of the categories
ri['stop_length'] = ri.stop_length.astype('category', ordered = True, categories = cats)
```
* When categories are ordered, you can use the comparison operators
    * Additionally, pandas will automatically sort categories logically instead of alphabeticaclly (as it would with unordered categories)
    
```
# Copy 'WT01' through 'WT22' to a new DataFrame
WT = weather.loc[:, 'WT01':'WT22']

# Calculate the sum of each row in 'WT'
weather['bad_conditions'] = WT.sum(axis ='columns')

# Replace missing values in 'bad_conditions' with '0'
weather['bad_conditions'] = weather.bad_conditions.fillna(0).astype('int')

# Create a histogram to visualize 'bad_conditions'
weather.bad_conditions.plot(kind='hist')

# Display the plot
plt.show()
```
    

#### Merging datasets

* `apple_high = pd.merge(left = apple, right = high, left_on = 'date', right_on = 'DATE', how = 'left')`

#### Driver gender and vehicle searches
* `ri.search_conducted.mean()`
* `ri.groupby('driver_gender').search_conducted.mean()`
* `ri.groupby(['violation', 'gender']).search_conducted.mean()`
----------------
* `search_rate=ri.groupby(['violation', 'driver_gender']).search_conducted.mean()`
* `type(search_rate)` = `pandas.core.series.Series` (with a multi-index)
* violation and driver_gender are not columns but indices
* With a dataframe, which usually has 2 dimensions, a multi-index adds a 3rd dimension
* With a pandas.series, which usually has 1 dimensin, a multi-index adds a 2nd dimension
* Working with a multi-index series is actually very similar to working with a dataframe
    * You can think of the outer index level as the dataframe rows and the inner index level as the dataframe column 
    * Can use `.loc` accessor on multi-index series
* To convert a multi-index series into a DataFrame, use **`.unstack()`**
    * `search_rate.unstack()`
* **OR**
* Use **`.pivot_table()`**
    * `ri.pivot_table(index='violation', columns='driver_gender', values='search_conducted')`
        * Note: values are the **mean of search_conducted**
        * mean is the default aggregation function for a pivot table, but you can choose another function instead