## **Exploratory Data Analysis in Python- Stanford Open Policing Project**

* This is a dataset on Traffic and Pedestrian Stops by Police in Rhode Island from [The Stanford Open Policing Project ](https://openpolicing.stanford.edu/)

* This project is made for practicing exploratory data analysis by using pandas in Python

* In this project, I'm going to answer the following questions:

1. Do men or women speed more often? 
2. Does gender affect who gets a ticket for speeding?
3. Does gender affect whose vehicle is searched?
4. Does gender affect who is frisked during a search?
5. Does time of day affect arrest rate?
6. Are drug-related stops on the rise?
7. What violations are caught in each race?
8. How long might you be stopped for a violation?
9. Which year had the least number of stops?

* Most of the analysis is based on the Course- [Analyzing Police Activity with pandas](https://campus.datacamp.com/courses/analyzing-police-activity-with-pandas/preparing-the-data-for-analysis?ex=1)


In [None]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
pd.set_option("display.precision", 3)
pd.set_option("display.expand_frame_repr", False)
pd.set_option("display.max_rows", 30)

In [None]:
Input = '/kaggle/input/stanford-open-policing-project/police_project.csv'
data= pd.read_csv(Input)
data.head()

In [None]:
print(data.isnull().sum())
print(data.shape)

* The data contains 91741 rows and 15 columns, which is showed by the shape of the data above.

In [None]:
print(data.columns)

In [None]:
profile = ProfileReport(data, 
                        title="Policing Profiling Report",
                       interactions=None,
                       duplicates=None)
profile

## **Preparing the Data(ETL)**  

Before beginning your analysis, it is critical that you first examine and clean the dataset.

### 1. Drop county_name column

We can see that county_name has no values(91741 missings which is the same amount of rows in the data), so I decided to drop the county_name column.

In [None]:
data.drop('county_name', axis='columns', inplace=True)
print(data.shape)

Now, there are  91741 rows and 14 columns in the data.

### 2. Drop missing rows

In [None]:
print(data.isnull().sum())

Since 'driver_gender' will be critical to many of this analyses, and only a small fraction of rows are missing, I decided to delete the missing rows.

In [None]:
data.dropna(subset=['driver_gender'],inplace= True)

In [None]:
print(data.isnull().sum())

In [None]:
data.driver_age.fillna(data.driver_age.mean(), inplace=True)
data.driver_age_raw.fillna(data.driver_age_raw.mean(), inplace= True)
print(data.isnull().sum())
print(data.shape)

### 3. Fix Data Type

In [None]:
print(data.dtypes)

In [None]:
# When assigning to columns, only the square brackets notation works.
data['is_arrested'] = data.is_arrested.astype('bool')

In [None]:
print(data.dtypes)

### 4. Create a Date-time Index

1. Combine `stop_date` and `stop_time` into one column
2. Convert it to the date-time format
3. Set it as index
4. Drop the redundant columns

In [None]:
combine= data.stop_date.str.cat(data.stop_time, ' ')

# Convert to datetime
data['stop_datetime']= pd.to_datetime(combine)
print(data.stop_datetime.dtypes)

# Set as Index
data.set_index('stop_datetime', inplace=True)
print(data.index)

In [None]:
# Drop relundant columns
data.drop(['stop_date','stop_time'], axis='columns', inplace= True)

View the data again before analyzing.

In [None]:
print(data.info())

## Q1. Do men or women speed more often?

Plot the number of men and women :

In [None]:
sns.catplot(x='driver_gender', data=data, kind='count',height=5)
plt.title('Number of Male and Female',fontsize=15)
plt.show()

print(data.driver_gender.value_counts())

This is a non-equalivent distribution of male and female, so we should use **fraction**.

* Create dataframes by different genders:

In [None]:
female= data[data.driver_gender == 'F']
male= data[data.driver_gender == 'M']

* Compute the violations by different drivers as proportions: 

data.column.value_counts(normalize= True)  -> show the proportion of each category in the column

In [None]:
print('Female Violations')
print(female.violation.value_counts(normalize= True))

In [None]:
print('Male Violations')
print(male.violation.value_counts(normalize= True))

About 2/3 of female traffic stops are for speeding, whereas for males is about half.
This doesn't mean that females speed more often than males since we didn't take into account the number of stops or drivers.

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2, 2, 1)
female.violation.value_counts(normalize=True).plot(kind="bar")
plt.title("Violation of Women")
plt.xticks(rotation=90)

plt.subplot(2, 2, 2)
male.violation.value_counts(normalize=True).plot(kind="bar")
plt.title("Violation of Men")
plt.xticks(rotation=90)

plt.show()

### Filtering by multiple conditions

When filtering my multiple conditions, add logical operators as:
* `&` represents an `and` operation
* `|` represents an `or` operation 

The conditions should be enclosed by parenthesis`( )` in the brackets`[]`

## Q2. Does gender affect who gets a ticket for speeding?

### Speeding

In [None]:
# Create DataFrame
female_speeding= data[(data.driver_gender == 'F') & (data.violation == 'Speeding')]
male_speeding= data[(data.driver_gender == 'M') & (data.violation == 'Speeding')]

In [None]:
# Print the count result
print("Female Stop")
print(female_speeding.stop_outcome.value_counts(normalize= True))

In [None]:
print('Male Stop')
print(male_speeding.stop_outcome.value_counts(normalize= True))

The numbers are similar for males and females: about 95% of stops for speeding result in a ticket. </br>
The data doesn't show that gender has an impact on who gets a ticket for speeding.

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2, 2, 1)
female_speeding.stop_outcome.value_counts(normalize= True).plot(kind="bar")
plt.title("Women")
plt.xticks(rotation=90)

plt.subplot(2, 2, 2)
female_speeding.stop_outcome.value_counts(normalize= True).plot(kind="bar")
plt.title("Men")
plt.xticks(rotation=90)

plt.show()

## Q3. Does gender affect whose vehicle is searched?

### Search rate

During a traffic stop, the police officer sometimes conducts a search of the vehicle. </br>
Calculate the percentage of all stops in the data that result in a vehicle search, also known as the search rate.

In [None]:
search_rate= data.search_conducted.value_counts(normalize= True)
print(search_rate)

It can also be calaulated by mean:

In [None]:
search_rate= data.search_conducted.mean()
print(search_rate)

**It is shown that the percentage of searched vehicle is about 3.7%.**

Comparing search rates by gender. </br>
Use the `groupby()` function to do this: 

In [None]:
search_rate_gender= data.groupby(data.driver_gender).search_conducted.mean()
print(search_rate_gender)
search_rate_gender.plot(kind='bar')
plt.title('Search Rate by Genders')
plt.show()

**It is shown that male drivers are searched more than twice as often as female drivers! ( 4% & 2%, respectively.)**

### Adding a second factor

Even though the search rate for males is much higher than for females, it's possible that the difference is mostly due to a second factor.

For example, you might hypothesize that the search rate varies by **violation** type, and the difference in search rate between males and females is because they tend to commit different violations.

**You can test this hypothesis by examining the search rate for each combination of gender and violation**. If the hypothesis was true, you would find that males and females are searched at about the same rate for each violation.

In [None]:
second= data.groupby(['violation', 'driver_gender']).search_conducted.mean()
print(second)

The search rate is higher for males than for females for all types of specified violations, disproving our hypothesis, at least for this dataset.

## Q4. Does gender affect who is frisked during a search?

### Protective frisks

During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a "protective frisk."

1. Check to see how many times "Protective Frisk" was the only search type. 
2. Locate all instances in which the driver was frisked.

In [None]:
print(data.search_type.value_counts())

In [None]:
data['frisk']= data.search_type.str.contains('Protective Frisk', na=False)
print(data.frisk.sum())

**It can be seen that there were 274 drivers who were frisked.**

Comparing frisk rates by gender:

In [None]:
searched= data[data.search_conducted == True]
print(searched.groupby('driver_gender').frisk.mean())

Males are frisked more often than females, **though we can't conclude that this difference is caused by the driver's gender.**

## Q5. Does time of day affect arrest rate?

### Hourly arrest rate

When a police officer stops a driver, a small percentage of those stops ends in an arrest. This is known as the arrest rate. Let's check whether the arrest rate varies by time of day.

In [None]:
print(data.is_arrested.mean())

In [None]:
hourly_arrest_rate = data.groupby(data.index.hour).is_arrested.mean()
print(hourly_arrest_rate)

In [None]:
hourly_arrest_rate.plot()

plt.xlabel('Hour')
plt.ylabel('Arrest Rate')
plt.title('Arrest Rate by Time of Day')
plt.grid(alpha=0.5)
plt.show()

**The arrest rate has a significant spike overnight, and then dips in the early morning hours.**

## Q6. Are drug-related stops on the rise?

### Drug-related stops

In a small portion of traffic stops, drugs are found in the vehicle during a search.

1. Calculate the annual rate of drug-related stops: Use `resample`
2. Plot the annual drug rate

In [None]:
annual_drug= data.drugs_related_stop.resample("A").mean()
print(annual_drug)

In [None]:
annual_drug.plot()
plt.grid(alpha=0.5)
plt.show()

**The rate of drug-related stops nearly doubled over the course of 10 years. Why?**

Comparing drug and search rates:

In [None]:
annual_search= data.search_conducted.resample('A').mean()

# Concat the two columns
annual= pd.concat([annual_drug, annual_search],axis='columns')
print(annual)

In [None]:
plt.figure(figsize=(20,20))
annual.plot(subplots=True)
plt.show()

I hypothesize that the rate of vehicle searches was also increasing, which would have led to an increase in drug-related stops; however, the rate of drug-related stops increased even though the search rate decreased. It can be seen that this disapproved my hypothesis.

## Q7. What violations are caught in each race?

### Violation By Race

* Frequency Table: `pd.crosstab()`

In [None]:
pd.crosstab(data.driver_race, data.violation)

In [None]:
race= pd.crosstab(data.driver_race, data.violation)

In [None]:
race.plot(kind='bar')
plt.show()

## Q8. How long might you be stopped for a violation?

### Time of Stop

The `stop_duration` column tells you approximately how long the driver was detained by the officer. 

Since the duration was stored by strings (`'0-15 Min'`), we should map them to a numeric number by using `map` in a **dictionary**.

* Convert `'0-15 Min'` to 8
* Convert `'16-30 Min'` to 23
* Convert `'30+ Min'` to 45

In [None]:
mapping = {'0-15 Min':8, '16-30 Min': 23, '30+ Min': 45}

In [None]:
print(data.stop_duration.unique())
data['stop_minutes']= data.stop_duration.map(mapping)
print(data.stop_minutes.unique())

In [None]:
data.dropna(subset=['stop_minutes'],inplace=True)
print(data.stop_minutes.unique())

**Plotting stop length:**

In [None]:
stop_length= data.groupby(data.violation_raw).stop_minutes.mean()
print(stop_length)

In [None]:
stop_length.sort_values().plot(kind='barh')

plt.xlabel('Stopping time')
plt.ylabel('Violation')
plt.title('Stopping Time by Violation')
plt.show()

In [None]:
# Plot the top 10 
top10= stop_length.sort_values(ascending=False).head(10)
top10.sort_values().plot(kind='barh')

plt.xlabel('Stopping time')
plt.ylabel('Violation')
plt.title('Stopping Time by Violation (Top 10)')
plt.show()

In [None]:
data.reset_index(inplace=True)
data.head()

## Q9. Which year had the least number of stops?

In [None]:
data['year']= data.stop_datetime.dt.year
year= data.year.value_counts()
print(year)

year.sort_values().plot(kind='barh')
plt.xlabel('Number of Stops')
plt.ylabel('Year')
plt.title('Number of Stops in each Year')
plt.show()