# COGS 108 - Capstone Project

## Project links, files, and basic information

### Websites with datasets:
- San Diego Vehicle Stops:  https://data.sandiego.gov/datasets/police-vehicle-stops/
- Dan Diego Population Data:  http://www.city-data.com/city/San-Diego-California.html

### Websites of needed information:
- San Diego police service areas https://www.sandiego.gov/police/services/divisions (vehcle stop data only records the first two digits)
- San Diego zip code map: http://www.city-data.com/zipmaps/San-Diego-California.html

### Names of datasets
#### *Vehicle Stops*
- 'vehicle_stops_2017.csv'
- 'vehicle_stops_2016.csv'
- 'vehicle_stops_2015.csv'
- 'vehicle_stops_2014.csv'

#### *Vehicle Stops Details*
- 'vehicle_stops_search_details_2017.csv'
- 'vehicle_stops_search_details_2016.csv'
- 'vehicle_stops_search_details_2015.csv'
- 'vehicle_stops_search_details_2014.csv'

#### *Files needed to read Vehicle Stops information*
- Race Codes: 'vehicle_stops_race_codes.csv'    
- Title explanations for Vehicle Stops data: 'vehicle_stops_dictionary.csv'
- Title explanations for Vehicle Stops Details data: 'vehicle_stops_search_details_dictionary.csv'
- Possible actions taken when stopped for Vehicle Stops Details data: 'vehicle_stops_search_details_description_list.csv'

# TODO
 - Map zip-code and police area 
 - Combine stops and info datasets (careful not loosing information/counting duplicates)
 - Web scrape zip-code demographics
 - Plan/start thinking about how we are correlate the data

In [2]:
# Imports
%matplotlib inline

# Basics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Web scrapping
import sys
#!conda install --yes --prefix {sys.prefix} beautifulsoup4

# Data analysis
import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

  from pandas.core import datetools


In [6]:
# Read in 2017 datasets
df_stops = pd.read_csv('vehicle_stops_2017.csv')
df_stops_info = pd.read_csv('vehicle_stops_search_details_2017.csv')

In [17]:
df_stops['service_area'].unique()

array(['120', '520', '720', '440', '510', '310', '930', '110', '620',
       '230', '240', 'Unknown', ' Bulletin', '710', '320', '430', '830',
       '820', '810', '610', ' County', '840', '130', '630', '530'], dtype=object)

In [10]:
df_stops_info.head(20)

Unnamed: 0,stop_id,search_details_id,search_details_type,search_details_description,Unnamed: 4,Unnamed: 5
0,1444799,1628067.0,ActionTaken,Citation,,
1,1444821,1628097.0,ActionTaken,Verbal Warning,,
2,1447102,1630482.0,ActionTaken,Citation,,
3,1444801,1628069.0,ActionTaken,Verbal Warning,,
4,1444802,1628070.0,ActionTaken,Verbal Warning,,
5,1444912,1628192.0,ActionTaken,Verbal Warning,,
6,1444804,1628072.0,ActionTaken,Verbal Warning,,
7,1444805,1628073.0,ActionTaken,Other,,
8,1444805,1628074.0,ActionTakenOther,DL 310,,
9,1444811,1628083.0,ActionTaken,Citation,,


In [None]:
# Drop garbage columns
df_stops.drop(['Unnamed: 15', 'Unnamed: 16', 'timestamp'], axis=1, inplace=True)
df_stops_info.drop(['Unnamed: 5', 'Unnamed: 4'], axis=1, inplace=True)

In [None]:
# Merge stop information based on 'stop_id'
df_stops_comb = df_stops_info.merge(df_stops, on='stop_id')

In [None]:
print("Stops:\t", df_stops.shape)
print("Info:\t", df_stops_info.shape)
print("Comb:\t", df_stops_comb.shape)

In [23]:
list(df_stops)

['stop_id',
 'stop_cause',
 'service_area',
 'subject_race',
 'subject_sex',
 'subject_age',
 'stop_date',
 'stop_time',
 'sd_resident',
 'arrested',
 'searched',
 'obtained_consent',
 'contraband_found',
 'property_seized']