# MIDS W207 Fall 2017 Final Project¶
## Exploratory Data Analysis
Laura Williams, Kim Vignola, Cyprian Gascoigne  
SF Crime Classification

This notebook comprises the Exploratory Data Analysis that will inform data set up and modeling decisions. Recommendations based on this EDA include:
* adding buckets for month, holidays, and dayparts
* assessing inclusion of day of month, especially the 1st of the month 
* reassigning cases with latitude > 38 to a latitude based on district
* addressing the year 2015 as we only have data through May
* expanding the definition of holidays to include holiday weeks/weekends and eve's

In [96]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
import zipfile

In [3]:
# Unzip raw data into a subdirectory 
unzip_files = zipfile.ZipFile("raw_data.zip", "r")
unzip_files.extractall("raw_data")
unzip_files.close()

In [4]:
# Read CSV files into pandas dataframes
train = pd.read_csv("raw_data/train.csv")
test = pd.read_csv("raw_data/test.csv")
weather = pd.read_csv("raw_data/SF_county.csv")

In [7]:
# Most of the fields in the training dataset need to be transformed to numerical values.
print(train.dtypes)

Dates          object
Category       object
Descript       object
DayOfWeek      object
PdDistrict     object
Resolution     object
Address        object
X             float64
Y             float64
dtype: object


In [8]:
# The dataset contains erroneous data, with maximum latitude registering 90 degrees (at The North Pole).
train.describe()

Unnamed: 0,X,Y
count,878049.0,878049.0
mean,-122.422616,37.77102
std,0.030354,0.456893
min,-122.513642,37.707879
25%,-122.432952,37.752427
50%,-122.41642,37.775421
75%,-122.406959,37.784369
max,-120.5,90.0


In [26]:
# The number of erroneous Y values is small, but the magnitude of difference is large and could create noise 
# in the data. To fix this we will reassess Lat/Long for these cases based on Districts.
print(len(train.loc[train['Y'] > 40]))

67


In [33]:
# The Southern, Northern and Mission Districts have the highest overall crime rates.
train["PdDistrict"].value_counts()

SOUTHERN      157182
MISSION       119908
NORTHERN      105296
BAYVIEW        89431
CENTRAL        85460
TENDERLOIN     81809
INGLESIDE      78845
TARAVAL        65596
PARK           49313
RICHMOND       45209
Name: PdDistrict, dtype: int64

In [71]:
# Larceny/Theft is the largest category, and "Non_Criminal" activity ranks third among crime categories.
train["Category"].value_counts()

LARCENY/THEFT                  174900
OTHER OFFENSES                 126182
NON-CRIMINAL                    92304
ASSAULT                         76876
DRUG/NARCOTIC                   53971
VEHICLE THEFT                   53781
VANDALISM                       44725
WARRANTS                        42214
BURGLARY                        36755
SUSPICIOUS OCC                  31414
MISSING PERSON                  25989
ROBBERY                         23000
FRAUD                           16679
FORGERY/COUNTERFEITING          10609
SECONDARY CODES                  9985
WEAPON LAWS                      8555
PROSTITUTION                     7484
TRESPASS                         7326
STOLEN PROPERTY                  4540
SEX OFFENSES FORCIBLE            4388
DISORDERLY CONDUCT               4320
DRUNKENNESS                      4280
RECOVERED VEHICLE                3138
KIDNAPPING                       2341
DRIVING UNDER THE INFLUENCE      2268
RUNAWAY                          1946
LIQUOR LAWS 

In [97]:
# Other offfenses" is a broad category, driven by driving/traffic violations, and Evasion of Legal Requirements 
# (eg, Probation Violation, Parole Violation, Violation of Restraining Orders, etc.). It also includes some very
# random categories such as "Danger of Leading Immoral Life." Given such breadth, this category may prove difficult 
# to classify.

print(((train['Descript'].loc[train['Category'] == "OTHER OFFENSES"]).value_counts()).head(20))

DRIVERS LICENSE, SUSPENDED OR REVOKED                    26839
TRAFFIC VIOLATION                                        16471
RESISTING ARREST                                          8983
MISCELLANEOUS INVESTIGATION                               8389
PROBATION VIOLATION                                       8016
LOST/STOLEN LICENSE PLATE                                 6424
VIOLATION OF RESTRAINING ORDER                            5816
PAROLE VIOLATION                                          5119
TRAFFIC VIOLATION ARREST                                  5051
CONSPIRACY                                                3114
OBSCENE PHONE CALLS(S)                                    2492
FALSE PERSONATION TO RECEIVE MONEY OR PROPERTY            2339
VIOLATION OF MUNICIPAL CODE                               2308
HARASSING PHONE CALLS                                     2194
POSSESSION OF BURGLARY TOOLS                              2085
VIOLATION OF MUNICIPAL POLICE CODE                     

In [19]:
# Some differences in crimes by day of week; crimes peak on Friday see lowest levels on Sunday and rise from Mon-Wed.
train["DayOfWeek"].value_counts()

Friday       133734
Wednesday    129211
Saturday     126810
Thursday     125038
Tuesday      124965
Monday       121584
Sunday       116707
Name: DayOfWeek, dtype: int64

In [86]:
# Top crimes are likely to occur on New Year's Day; the first day of the month also has a propensity for crime.
train["Dates"].value_counts().head(20)

2011-01-01 00:01:00    185
2006-01-01 00:01:00    136
2012-01-01 00:01:00     94
2006-01-01 12:00:00     63
2007-06-01 00:01:00     61
2006-06-01 00:01:00     58
2010-06-01 00:01:00     56
2010-08-01 00:01:00     55
2008-04-01 00:01:00     53
2013-11-01 00:01:00     52
2010-11-01 00:01:00     51
2008-11-01 00:01:00     51
2006-07-01 00:01:00     51
2013-05-01 00:01:00     51
2011-06-01 00:01:00     50
2005-06-01 00:01:00     50
2005-07-01 00:01:00     49
2008-06-01 00:01:00     48
2009-09-01 00:01:00     46
2012-06-01 00:01:00     46
Name: Dates, dtype: int64

Create new buckets for month, day, year, hour and sesaons to further explore this data:

In [50]:
from datetime import datetime, timedelta, date
import holidays

# extract month, year and hour from both datasets
train["month"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").month)
train["year"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").year)
train["hour"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").hour)
train["day"] = train["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").day)

test["month"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").month)
test["year"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").year)
test["hour"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").hour)
test["day"] = test["Dates"].map(lambda x: datetime.strptime(x,"%Y-%m-%d %H:%M:%S").day)

# map holidays
US_Holidays = holidays.UnitedStates()
train["holidays"] = train["Dates"].map(lambda x: x in US_Holidays)
test["holidays"] = test["Dates"].map(lambda x: x in US_Holidays)

In [51]:
# create a dictionary for dayparts
time_periods = {6:"early_morning", 7:"early_morning", 8:"early_morning", 
               9:"late_morning", 10:"late_morning", 11:"late_morning",
              12:"early_afternoon", 13:"early_afternoon", 14:"early_afternoon",
              15:"late_afternoon", 16:"late_afternoon", 17:"late_afternoon",
              18:"early_evening",  19:"early_evening",  20:"early_evening",
              21:"late_evening", 22:"late_evening", 23:"late_evening",
              0:"late_night", 1:"late_night", 2:"late_night",
              3:"late_night", 4:"late_night", 5:"late_night"}

# map time periods
train["dayparts"] = train["hour"].map(time_periods)
test["dayparts"] = test["hour"].map(time_periods)

In [57]:
# There is no clear pattern in crime trends by year, although the most recent year appears to be an outlier.
train["year"].value_counts()

2013    75606
2014    74766
2003    73902
2004    73422
2012    71731
2005    70779
2008    70174
2006    69909
2009    69000
2007    68015
2011    66619
2010    66542
2015    27584
Name: year, dtype: int64

In [58]:
# There is a pattern of increasing Larceny/Theft from 2012-2014, but again a steep drop in 2015.
print((train['year'].loc[train['Category'] == "LARCENY/THEFT"]).value_counts())

2014    18901
2013    18152
2012    15639
2006    13798
2011    13084
2003    12990
2008    12800
2007    12760
2009    12538
2005    12402
2010    12214
2004    12111
2015     7511
Name: year, dtype: int64


In [73]:
# Further exploration reveals that we only have data through May of 2015.
print((train['month'].loc[train['year'] == 2015]).value_counts())

3    6851
4    6609
2    6008
1    5866
5    2250
Name: month, dtype: int64


In [88]:
# Crimes are most likely to occur on the first day of the month. 
# But, in isolation (lacking month info) this metric may be flawed given different number of days in various months?
train["day"].value_counts()

1     32167
22    30589
8     30339
21    30038
19    30012
20    29963
4     29905
18    29793
7     29685
5     29557
23    29547
9     29502
6     29482
17    29031
3     28691
13    28580
10    28395
15    28224
12    28223
16    28146
24    27987
11    27952
14    27670
27    27577
2     27471
28    27269
29    27108
25    26932
26    26870
30    26589
31    14755
Name: day, dtype: int64

In [91]:
# Larceny/Theft is the #1 crime type, and other categories are driving crimes on the 1st day of the month.
print(((train['day'].loc[train['Category'] == "LARCENY/THEFT"]).value_counts()).head(15))

# Therefore, day of the month may help explain other types of offenses.
print(((train['Category'].loc[train['day'] == 1]).value_counts()).head(15))

19    6267
22    6259
21    6111
23    6101
18    6019
20    5991
8     5884
13    5856
17    5823
24    5812
5     5805
7     5768
27    5747
9     5732
29    5692
Name: day, dtype: int64
LARCENY/THEFT             5515
OTHER OFFENSES            4794
NON-CRIMINAL              3716
ASSAULT                   3017
DRUG/NARCOTIC             1821
VEHICLE THEFT             1658
VANDALISM                 1571
SUSPICIOUS OCC            1526
WARRANTS                  1210
BURGLARY                  1193
MISSING PERSON             998
FRAUD                      917
FORGERY/COUNTERFEITING     748
ROBBERY                    742
SECONDARY CODES            446
Name: Category, dtype: int64


In [92]:
# There is no clear pattern for crimes by month of the year (but again this could correlate with specific crime types.)
train["month"].value_counts()

10    80274
5     79644
4     78096
3     76320
1     73536
11    72975
9     71982
6     70892
2     70813
7     69971
8     68540
12    65006
Name: month, dtype: int64

In [94]:
# Looking at specific hours, crimes are somewhat more likely to occur after work and during lunchtime.
(train["hour"].value_counts().head(10))  

18    55104
17    53553
12    51934
16    50137
19    49475
15    48058
22    45741
0     44865
20    44694
14    44424
Name: hour, dtype: int64

In [93]:
# Aggregated into buckets, crime skews slightly toward daytime/early evening hours.
train["dayparts"].value_counts()

late_afternoon     151748
early_evening      149273
early_afternoon    139503
late_evening       131862
late_night         125848
late_morning       111734
early_morning       68081
Name: dayparts, dtype: int64

In [85]:
# Overall, 3.2% of all crimes occur on holidays, slightly higher than 10/365 holidays per year (or 2.7%)
train["holidays"].value_counts()

False    850015
True      28034
Name: holidays, dtype: int64

In [98]:
# Crime is more prevalent during Winter holidays (excluding Christmas Day), but this data does not account for 
# Christmas eve, New Year's eve and the weekends/weeks surrounding national holidays.
print(((train['month'].loc[train['holidays'] == True]).value_counts()).head(30))

11    4790
1     4638
2     4477
5     3875
9     3862
12    3286
10    2388
7      718
Name: month, dtype: int64
