At first: 
* wanted to identify factors associated with the actual outcomes ('warning', 'citation', 'arrest') 
* but a quick glimpse of the raw dataset, I realized this inference analysis might be boring since most warnings and citations are related to either moving or mechanical/non-moving "reason_for_stop" 
* i.e. it would seem like "reason_for_stop" almost always determine the "outcome," so there wouldn't really be a point in looking at other factors (gender, age, etc) as "predictors"

Now: 
* another column I want to analyze more closely was whether or not a search was conducted ("search_conducted")
* after glancing at some of the corresponding values in "reason_for_stop," I see variety in the reasons for stops
    * e.g for a moving violation or mechanical/non-moving etc, one will get searched vs another (perhaps due to human bias etc)
* so this column seems to have more promising/interesting results (hopefully)

---

## Overview of Plan 

**Guiding Question for Doing EDA (purpose):**

After accounting for the legal reason for the stop (moving violation, DUI, etc), *what other factors (gender, location,hour of day, etc) can statistically explain the choice to conduct a search?*

e.g: 
* I will control for the "reason_for_stop" by grouping, then let's say I did some analysis to see if gender is associated with getting a search or not
* I find that male has a higher search rate compared to females in the group "moving violation" 
* ==> even when the reason for being stopped is the same, analysis shows that gender is associated w/ higher likelihood of being searched

**Rough Workflow**

Recall specs:
1. Data Overview
    * Load the dataset.
    * Summarize rows, columns, variable types, missing values, and duplicates.
2. Descriptive Statistics
    * Calculate summary metrics (mean, median, standard deviation, min, max, counts).
3. Visual Exploration
    * Create histograms, boxplots, scatterplots, and a correlation heatmap.
4. Data Quality Review
    * Identify missing data, outliers, and unusual values.
5. Model-Relevant Insights
    * Identify which variables may be useful predictors.
    * Provide a plain-language summary of takeaways.

* instead of deeply analyzing every column (i.e following step-by-step specs above ^) to identify the best features/predictors, I will pick 4-5 features after briefly analyzing all the columns w/ steps 1, 2, 4
* from the chosen features, I will explore them visually to explore their distribution and identify possible relationships with being searched (step 3, 4)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
sf_stops_df = pd.read_csv("ca_san_francisco_2020_04_01.csv")
sf_stops_df.head()

  sf_stops_df = pd.read_csv("ca_san_francisco_2020_04_01.csv")


Unnamed: 0,raw_row_number,date,time,location,lat,lng,district,subject_age,subject_race,subject_sex,...,citation_issued,warning_issued,outcome,contraband_found,search_conducted,search_vehicle,search_basis,reason_for_stop,raw_search_vehicle_description,raw_result_of_contact_description
0,869921,2014-08-01,00:01:00,MASONIC AV & FELL ST,37.773004,-122.445873,,,asian/pacific islander,female,...,False,True,warning,,False,False,,Mechanical or Non-Moving Violation (V.C.),No Search,Warning
1,869922,2014-08-01,00:01:00,GEARY&10TH AV,37.780898,-122.468586,,,black,male,...,True,False,citation,,False,False,,Mechanical or Non-Moving Violation (V.C.),No Search,Citation
2,869923,2014-08-01,00:15:00,SUTTER N OCTAVIA ST,37.786919,-122.426718,,,hispanic,male,...,True,False,citation,,False,False,,Mechanical or Non-Moving Violation (V.C.),No Search,Citation
3,869924,2014-08-01,00:18:00,3RD ST & DAVIDSON,37.74638,-122.392005,,,hispanic,male,...,False,True,warning,,False,False,,Mechanical or Non-Moving Violation (V.C.),No Search,Warning
4,869925,2014-08-01,00:19:00,DIVISADERO ST. & BUSH ST.,37.786348,-122.440003,,,white,male,...,True,False,citation,,False,False,,Mechanical or Non-Moving Violation (V.C.),No Search,Citation


In [7]:
sf_stops_df["district"].unique()

array([nan, 'C', 'B', 'I', 'A', 'J', 'E', 'H', 'G', 'F', 'D', 'K', 'T',
       'S'], dtype=object)

In [8]:
sf_stops_df["outcome"].unique()



In [10]:
sf_stops_df["reason_for_stop"].unique()

array(['Mechanical or Non-Moving Violation (V.C.)', 'Moving Violation',
       'MPC Violation', 'DUI Check', nan, 'Traffic Collision',
       'Assistance to Motorist', 'BOLO/APB/Warrant',
       'Moving Violation|Mechanical or Non-Moving Violation (V.C.)',
       'DUI Check|MPC Violation', 'Moving Violation|NA',
       'Mechanical or Non-Moving Violation (V.C.)|Moving Violation',
       'Moving Violation|Assistance to Motorist',
       'Moving Violation|DUI Check', 'Moving Violation|BOLO/APB/Warrant',
       'NA|Traffic Collision',
       'Mechanical or Non-Moving Violation (V.C.)|Assistance to Motorist',
       'Mechanical or Non-Moving Violation (V.C.)|DUI Check',
       'Moving Violation|MPC Violation',
       'Moving Violation|Mechanical or Non-Moving Violation (V.C.)|MPC Violation',
       'Moving Violation|MPC Violation|MPC Violation',
       'Moving Violation|Mechanical or Non-Moving Violation (V.C.)|Mechanical or Non-Moving Violation (V.C.)',
       'Moving Violation|Traffic Co

In [9]:
sf_stops_df.columns

Index(['raw_row_number', 'date', 'time', 'location', 'lat', 'lng', 'district',
       'subject_age', 'subject_race', 'subject_sex', 'type', 'arrest_made',
       'search_conducted', 'search_vehicle', 'search_basis', 'reason_for_stop',
       'raw_search_vehicle_description', 'raw_result_of_contact_description'],
      dtype='object')

In [17]:
sf_stops_df[sf_stops_df["reason_for_stop"] == "Moving Violation|Mechanical or Non-Moving Violation (V.C.)|Mechanical or Non-Moving Violation (V.C.)|Mechanical or Non-Moving Violation (V.C.)|Mechanical or Non-Moving Violation (V.C.)"]


Unnamed: 0,raw_row_number,date,time,location,lat,lng,district,subject_age,subject_race,subject_sex,...,citation_issued,warning_issued,outcome,contraband_found,search_conducted,search_vehicle,search_basis,reason_for_stop,raw_search_vehicle_description,raw_result_of_contact_description
808810,818282|818283|818284|818285|818286,2015-07-16,21:50:00,OCEAN AVE & PLYMOUTH AVE,37.723888,-122.456147,I,41.0,other,male,...,True,False,citation,,False,False,,Moving Violation|Mechanical or Non-Moving Viol...,No Search,Citation
