<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M1-Policing_EDA_rev23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Traffic Stops Data Analysis Assignment

In this session we will explore a **real** dataset from the Stanford Open Policing Project.
The project is collecting and standardizing data on vehicle and pedestrian stops from law enforcement departments across the US — and making that information freely available. They have already gathered 130 million records from 31 state police agencies and have begun collecting data on stops from law enforcement agencies in major cities, as well.

You can read more about the project [here](https://openpolicing.stanford.edu)

![open police](https://openpolicing.stanford.edu/img/logo_large_w_padding.png?width=20pc)


## Objective

In this assignment, you will be exploring and analyzing a dataset containing information on traffic stops. Your objective is to answer questions related to various themes like Demographic Disparities, Types and Outcomes of Violations, Temporal Patterns, Stop Duration and Location, and Hypothesis Testing and Social Implications.

## Data

You will be using the `RI-clean.csv.gz` dataset which is a cleaned dataset containing information on traffic stops, the drivers involved, and the outcomes. You will be provided with a code snippet to load and preprocess this dataset.

## Requirements

- Import required libraries
- Load and preprocess the dataset
- Perform exploratory data analysis (EDA)
- Answer specific questions related to the themes provided
- Use visualizations to support your findings
- Conduct hypothesis testing
- Interpret the results

## Libraries to Import

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
```

## Load and Preprocess Data

Use the provided code to load and preprocess the dataset. This includes handling missing values, irrelevant columns, and converting data to the appropriate data types.

## Questions to Answer

### Theme 1: Demographic Disparities in Traffic Stops

- What is the distribution of driver genders and races among the traffic stops?
- Are certain races disproportionately represented?
- What is the distribution of stops by driver age?
- Are certain groups more likely to be searched than others?
- Does the likelihood of arrest vary by race, gender, or age?

### Theme 2: Types and Outcomes of Violations

- What types of violations lead to stops?
- What are the most common outcomes of a stop (e.g., warning, citation, arrest)?
- Does the outcome of stops vary by violation type?
- Does the type of violation differ by driver gender or driver race?

### Theme 3: Temporal Patterns in Traffic Stops

- Has the number of stops increased or decreased over time?
- Are there any seasonal trends in the number or type of stops?
- Have arrest rates or search rates changed over time?
- At what times of the day are stops most frequent?

### Theme 4: Stop Duration and Location

- What is the average duration of a stop?
- Are certain areas more prone to specific types of stops or outcomes?

### Theme 5: Hypothesis Testing and Social Implications

- Is there a significant difference in stop outcomes or search rates between two specific demographic groups? (Use chi-squared tests)
- Can you disprove any commonly held beliefs or biases about traffic stops? (Use chi-squared tests)

## Visualization

Use Matplotlib and Seaborn libraries to visualize your findings.

## Interpretation

For hypothesis testing, interpret the p-values obtained from the chi-squared tests to conclude whether there is a significant difference between the groups in question.


In [None]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency

# Set the Seaborn style to 'ggplot'
sns.set(style="whitegrid", palette="pastel", color_codes=True)

# Load data
data_path = 'https://sds-aau.github.io/SDS-master/M1/data/RI-clean.csv.gz'
data = pd.read_csv(data_path, low_memory=False)

In [None]:
# Check data
data.info()

In [None]:
# Handle missing and irrelevant columns
data.drop(['id', 'state', 'county_name', 'county_fips', 'fine_grained_location'], axis='columns', inplace=True)
data.dropna(subset=['stop_date', 'stop_time', 'driver_gender'], inplace=True)

In [None]:
# Convert to proper data types and create datetime index
data['is_arrested'] = data['is_arrested'].astype('bool')
data['search_conducted'] = data['search_conducted'].astype('bool')
data['datetime'] = pd.to_datetime(data['stop_date'] + ' ' + data['stop_time'])
data.set_index('datetime', inplace=True)

In [None]:
# Handle missing values
data['out_of_state'].fillna(data['out_of_state'].mode()[0], inplace=True)
data[['driver_age', 'driver_age_raw']].fillna(data[['driver_age', 'driver_age_raw']].median(), inplace=True)

In [None]:
bins = pd.IntervalIndex.from_tuples([(10, 20), (20, 30), (30, 40), (40, 50), (50, 100)])

In [None]:
# Create age categories
labels = ["teen", "20s", "30s", "40s", "50+"]
bins = pd.IntervalIndex.from_tuples([(10, 20), (20, 30), (30, 40), (40, 50), (50, 100)])
data['age_cat'] = pd.cut(data.driver_age, bins=bins)

In [None]:
# Theme 1: Demographic Disparities in Traffic Stops
## Distribution of driver genders and races among the traffic stops
gender_distribution = data['driver_gender'].value_counts(normalize=True) * 100
race_distribution = data['driver_race'].value_counts(normalize=True) * 100

In [None]:
gender_distribution

In [None]:
race_distribution

In [None]:
# Plot the gender distribution
plt.figure(figsize=(8, 6))
sns.barplot(x=gender_distribution.index, y=gender_distribution.values)
plt.title('Gender Distribution in Traffic Stops (%)')
plt.xlabel('Gender')
plt.ylabel('Percentage')
plt.show()

In [None]:
# Create subfigures
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# Plot the gender distribution on the first subplot
sns.barplot(x=gender_distribution.index, y=gender_distribution.values, ax=axes[0])
axes[0].set_title('Gender Distribution in Traffic Stops (%)')
axes[0].set_xlabel('Gender')
axes[0].set_ylabel('Percentage')

# Plot the race distribution on the second subplot
sns.barplot(x=race_distribution.index, y=race_distribution.values, ax=axes[1])
axes[1].set_title('Race Distribution in Traffic Stops (%)')
axes[1].set_xlabel('Race')
axes[1].set_ylabel('Percentage')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
## Distribution of stops by driver age
age_distribution_stops = data['age_cat'].value_counts(normalize=True) * 100
age_distribution_stops

In [None]:
## Are certain groups more likely to be searched than others?
search_rate_by_group = data.groupby(['driver_race', 'driver_gender', 'age_cat'])['search_conducted'].mean() * 100
search_rate_by_group

In [None]:
search_rate_by_group.sort_values(ascending=False)

In [None]:
# Convert the groupby object to a DataFrame and reset the index for plotting
search_rate_df = search_rate_by_group.reset_index()
search_rate_df.columns = ['Driver Race', 'Driver Gender', 'Age Category', 'Search Rate']

plt.figure(figsize=(12, 8))
sns.barplot(x='Driver Race', y='Search Rate', hue='Age Category', data=search_rate_df)
plt.title('Search Rate by Race and Age (%)')
plt.xlabel('Driver Race')
plt.ylabel('Search Rate (%)')
plt.show()

In [None]:
## Does the likelihood of arrest vary by race, gender, or age?
arrest_rate_by_group = data.groupby(['driver_race', 'driver_gender', 'age_cat'])['is_arrested'].mean() * 100
arrest_rate_by_group

In [None]:
# Theme 2: Types and Outcomes of Violations
## What types of violations lead to stops?
violation_distribution = data['violation'].value_counts(normalize=True) * 100
violation_distribution

In [None]:
## What are the most common outcomes of a stop?
outcome_distribution = data['stop_outcome'].value_counts(normalize=True) * 100
outcome_distribution

In [None]:
## Does the outcome of stops vary by violation type?
outcome_by_violation = data.groupby('violation')['stop_outcome'].value_counts(normalize=True) * 100
outcome_by_violation

In [None]:
## Does the type of violation differ by driver gender or driver race?
violation_by_demographics = data.groupby(['driver_race', 'driver_gender'])['violation'].value_counts(normalize=True) * 100
violation_by_demographics

In [None]:
# Theme 3: Temporal Patterns in Traffic Stops
## Has the number of stops increased or decreased over time?
plt.figure(figsize=(12, 6))
data.resample('Y').size().plot()
plt.title('Number of Traffic Stops Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Stops')
plt.show()

In [None]:
## Are there any seasonal trends in the number of stops?
plt.figure(figsize=(12, 6))
data.resample('M').size().plot()
plt.title('Seasonal Trends in Traffic Stops')
plt.xlabel('Month')
plt.ylabel('Number of Stops')
plt.show()

In [None]:
## What's the difference here?
plt.figure(figsize=(12, 6))
data.groupby(data.index.month).size().plot()
plt.title('Seasonal Trends in Traffic Stops')
plt.xlabel('Month')
plt.ylabel('Number of Stops')
plt.show()

In [None]:
## Have arrest rates or search rates changed over time?
# Omitted for brevity, can be implemented similarly to the above plots

## At what times of the day are stops most frequent?
data['hour'] = data.index.hour
hourly_stops = data['hour'].value_counts().sort_index()
hourly_stops

In [None]:
data['stop_duration'].value_counts()

In [None]:
# Theme 4: Stop Duration and Location
## What is the average duration of a stop?

# Create a mapping for 'stop_duration' to numerical values
duration_mapping = {
    '0-15 Min': 7.5,
    '16-30 Min': 23,
    '30+ Min': 45,
    '2': None,  # Unclear what this represents
    '1': None   # Unclear what this represents
}

# Apply the mapping to create a new 'stop_duration_num' column
data['stop_duration_num'] = data['stop_duration'].map(duration_mapping)

# Drop rows where 'stop_duration_num' or 'district' is NaN
data = data.dropna(subset=['stop_duration_num', 'district'])

# Check the first few rows to see if the mapping worked
data[['stop_duration', 'stop_duration_num']].head()

In [None]:
average_duration = data['stop_duration_num'].mean()
average_duration

In [None]:
# Create a boxplot to visualize 'stop_duration_num' against 'district'
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='district', y='stop_duration_num')
plt.title('Stop Duration by District')
plt.xlabel('District')
plt.ylabel('Stop Duration (minutes)')
plt.show()


In [None]:
# Theme 5: Hypothesis Testing and Social Implications

# Question 1: Is there a significant difference in stop outcomes between White and Black drivers?
## Chi-Squared Test: Race and Stop Outcomes
filtered_data_race = data[data['driver_race'].isin(['White', 'Black'])]
crosstab_race_outcome = pd.crosstab(filtered_data_race['driver_race'], filtered_data_race['stop_outcome'])
crosstab_race_outcome

In [None]:
chi2, p_value_race_outcome, _, _ = chi2_contingency(crosstab_race_outcome)
print("Chi-squared test result for race and stop outcomes:")
print("Chi2:", chi2)
print("P-value:", p_value_race_outcome)

In [None]:
# Question 2: Are younger drivers more likely to be searched? (Commonly held belief)
## Chi-Squared Test: Age and Search Rates
crosstab_age_search = pd.crosstab(data['age_cat'], data['search_conducted'])
chi2, p_value_age_search, _, _ = chi2_contingency(crosstab_age_search)
print("Chi-squared test result for age categories and search rates:")
print("Chi2:", chi2)
print("P-value:", p_value_age_search)

In [None]:
# Interpretation of the results
print("\nInterpretation:")
if p_value_race_outcome < 0.05:
    print("There is a significant difference in stop outcomes between White and Black drivers.")
else:
    print("There is no significant difference in stop outcomes between White and Black drivers.")

if p_value_age_search < 0.05:
    print("There is a significant difference in search rates among different age groups.")
else:
    print("There is no significant difference in search rates among different age groups.")