### Import our libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

### Read our csv files into a dataframe

In [None]:
rawCallData = pd.read_csv('Data/Seattle_Police_Department_911_Incident_Response.csv')

In [46]:
rawWeatherData = pd.read_csv('Data/Seattle_Weather.csv')

rawEventData = pd.read_csv('Data/Special_Events_Permits.csv')

### Lets take a look at our data frames to see what we've got 

In [3]:
rawCallData.head()

Unnamed: 0,CAD CDW ID,CAD Event Number,General Offense Number,Event Clearance Code,Event Clearance Description,Event Clearance SubGroup,Event Clearance Group,Event Clearance Date,Hundred Block Location,District/Sector,Zone/Beat,Census Tract,Longitude,Latitude,Incident Location,Initial Type Description,Initial Type Subgroup,Initial Type Group,At Scene Time
0,﻿15736,10000246357,2010246357,242.0,FIGHT DISTURBANCE,DISTURBANCES,DISTURBANCES,07/17/2010 08:49:00 PM,3XX BLOCK OF PINE ST,M,M2,8100.2001,-122.338147,47.610975,"(47.610975163, -122.338146748)",,,,
1,15737,10000246471,2010246471,65.0,THEFT - MISCELLANEOUS,THEFT,OTHER PROPERTY,07/17/2010 08:50:00 PM,36XX BLOCK OF DISCOVERY PARK BLVD,Q,Q1,5700.1012,-122.404613,47.658325,"(47.658324899, -122.404612874)",,,,
2,15738,10000246255,2010246255,250.0,"MISCHIEF, NUISANCE COMPLAINTS","NUISANCE, MISCHIEF COMPLAINTS","NUISANCE, MISCHIEF",07/17/2010 08:55:00 PM,21XX BLOCK OF 3RD AVE,M,M2,7200.2025,-122.342843,47.613551,"(47.613551471, -122.342843234)",,,,
3,15739,10000246473,2010246473,460.0,TRAFFIC (MOVING) VIOLATION,TRAFFIC RELATED CALLS,TRAFFIC RELATED CALLS,07/17/2010 09:00:00 PM,7XX BLOCK OF ROY ST,D,D1,7200.1002,-122.341847,47.625401,"(47.625401388, -122.341846999)",,,,
4,15740,10000246330,2010246330,250.0,"MISCHIEF, NUISANCE COMPLAINTS","NUISANCE, MISCHIEF COMPLAINTS","NUISANCE, MISCHIEF",07/17/2010 09:00:00 PM,9XX BLOCK OF ALOHA ST,D,D1,6700.1009,-122.339709,47.627425,"(47.627424837, -122.339708605)",,,,


In [4]:
rawWeatherData.head()

Unnamed: 0,dt,dt_iso,city_id,city_name,lat,lon,temp,temp_min,temp_max,pressure,...,rain_today,snow_1h,snow_3h,snow_24h,snow_today,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,1349096400,2012-10-01 13:00:00 +0000 UTC,5809844,,,,281.8,278.15,287.59,1027,...,,,,,,1,800,Clear,sky is clear,01n
1,1349186400,2012-10-02 14:00:00 +0000 UTC,5809844,,,,281.62,278.15,286.48,1046,...,,,,,,66,800,Clear,sky is clear,02d
2,1349190000,2012-10-02 15:00:00 +0000 UTC,5809844,,,,282.71,279.82,289.82,1026,...,,,,,,1,800,Clear,sky is clear,01d
3,1349193600,2012-10-02 16:00:00 +0000 UTC,5809844,,,,285.05,281.48,293.15,1026,...,,,,,,1,800,Clear,sky is clear,01d
4,1349197200,2012-10-02 17:00:00 +0000 UTC,5809844,,,,287.97,282.59,296.48,1027,...,,,,,,1,800,Clear,sky is clear,01d


### Let's start by cleaning up those column names in rawCallData

In [5]:
rawCallData.columns = rawCallData.columns.str.lower().str.replace(" ", "_").str.replace("/", "_") 
# Remove the white space and slashes in our column names
rawCallData.columns # Check our work

Index(['cad_cdw_id', 'cad_event_number', 'general_offense_number',
       'event_clearance_code', 'event_clearance_description',
       'event_clearance_subgroup', 'event_clearance_group',
       'event_clearance_date', 'hundred_block_location', 'district_sector',
       'zone_beat', 'census_tract', 'longitude', 'latitude',
       'incident_location', 'initial_type_description',
       'initial_type_subgroup', 'initial_type_group', 'at_scene_time'],
      dtype='object')

### Thats better, now lets examine the NaN values, starting with rawCallData

In [6]:
print(rawCallData.isnull().sum())

cad_cdw_id                           0
cad_event_number                     0
general_offense_number               0
event_clearance_code             10797
event_clearance_description      10798
event_clearance_subgroup         10798
event_clearance_group            10798
event_clearance_date             10951
hundred_block_location            3487
district_sector                   1162
zone_beat                            1
census_tract                      2792
longitude                            1
latitude                             1
incident_location                    1
initial_type_description        577813
initial_type_subgroup           577813
initial_type_group              577813
at_scene_time                  1029344
dtype: int64


### We have a significant number of records missing some information
They seem to be clustered. There is a cluster of 10,790+ records missing event data, 577,813 records missing initial data and a third large cluster where most of the records are missing  at scene time. The event clearance and initial type columns seem to all describe the same data, what the call was about. The at scene time and call clearance date similiarly overlap. The bad news is, most of the missing data pertains to what we care about, which is what happened and when. They good news is these columns seem to provide redundant information, so we can use one to impute the other. We can see a few columns provide no relevant information and can be dropped, subgroup and group. Let's see if there are any records that provide no relevant information.

In [7]:
rawCallData.drop(['event_clearance_subgroup', 'event_clearance_group', 'initial_type_subgroup', 
                  'initial_type_group'], axis=1,inplace=True)

In [8]:
mask = (rawCallData.event_clearance_description.isnull()) & (rawCallData.initial_type_description.isnull())
noEvent = rawCallData[mask] # Our mask selects records that have a null value in both description columns
print(noEvent.shape) # Check the size of our haul

(932, 15)


### We have 932 records with no event descriptor, we will have to remove them

In [9]:
print(rawCallData.shape) # Check original datafile shape
rawCallData = rawCallData[~mask] # Remove by selecting the inverse of our mask as subset
print(rawCallData.shape) # Verify our subtraction

(1445066, 15)
(1444134, 15)


In [10]:
mask2 = (rawCallData.at_scene_time.isnull()) & (rawCallData.event_clearance_date.isnull())
noTime = rawCallData[mask2] # Our mask selects records that have a null value in both time columns
print(noTime.shape) # Check the size of our haul

(9, 15)


### We have 9 records with no time, we will have to remove them

In [11]:
print(rawCallData.shape) # Check original datafile shape
rawCallData = rawCallData[~mask2] # Remove by selecting the inverse of our mask as subset
print(rawCallData.shape) # Verify our subtraction

(1444134, 15)
(1444125, 15)


### Looking again at the time values,
We can only use the overlap in our data sets date range, so lets switch gears and make sure that our dataframes cover the same time period. Continuing to clean up the null values in rawCallData could be a waste if those records don't overlap with our weather data. We will start by cleaning up the assorted time columns and getting one formatted datetime column.

In [12]:
rawCallData['formatted_time'] = pd.to_datetime(rawCallData.event_clearance_date, errors='coerce', 
                                               infer_datetime_format=True)

In [13]:
rawCallData.formatted_time.head()

0   2010-07-17 20:49:00
1   2010-07-17 20:50:00
2   2010-07-17 20:55:00
3   2010-07-17 21:00:00
4   2010-07-17 21:00:00
Name: formatted_time, dtype: datetime64[ns]

In [47]:
rawWeatherData['formatted_time'] = pd.to_datetime(rawWeatherData.dt, unit='s')

In [48]:
rawWeatherData.formatted_time.head()

0   2012-10-01 13:00:00
1   2012-10-02 14:00:00
2   2012-10-02 15:00:00
3   2012-10-02 16:00:00
4   2012-10-02 17:00:00
Name: formatted_time, dtype: datetime64[ns]

### We can see this matches the UTC time, so we need to convert to PDT.
This is a three step process, first changing our naive time to UTC zone aware time, then converting to PDT through a lambda function. We can't use tz_convert inline because it defaults to changing the index rather than the value. We then undo making it zone aware with tz_localize as another lambda function.

In [49]:
rawWeatherData.formatted_time = rawWeatherData.formatted_time.dt.tz_localize('UTC')
rawWeatherData.formatted_time = rawWeatherData.formatted_time.apply(lambda x: x.tz_convert('America/Los_Angeles'))
rawWeatherData.formatted_time = rawWeatherData.formatted_time.apply(lambda x: x.tz_localize(None))

rawWeatherData.formatted_time.head()

0   2012-10-01 06:00:00
1   2012-10-02 07:00:00
2   2012-10-02 08:00:00
3   2012-10-02 09:00:00
4   2012-10-02 10:00:00
Name: formatted_time, dtype: datetime64[ns]

In [50]:
mask3 = (rawCallData.formatted_time > '2012-10-01 13:00:00')
rawCallData = rawCallData[mask3]

In [51]:
rawCallData.shape

(879393, 16)

In [52]:
rawCallData.isnull().sum()

cad_cdw_id                          0
cad_event_number                    0
general_offense_number              0
event_clearance_code                0
event_clearance_description         0
event_clearance_date                0
hundred_block_location              1
district_sector                   937
zone_beat                           0
census_tract                     1615
longitude                           0
latitude                            0
incident_location                   0
initial_type_description        72663
at_scene_time                  516351
formatted_time                      0
dtype: int64

# This is an important learning note:
# _Always start with a plan_
To be honest when cleaning this data, I just started with the 911 call data, looking for ways to clean out the NaN values. I'm leaving the original plan in this notebook, which was to use one column to impute the other, without regard to the bigger picture. This involved some neat data cleaning tricks, but in the end would have been completely wasted work. Without the corresponding weather data, those records are worthless to us. Luckily, I stopped and thought it out after hitting a few road blocks, realizing that the work might not be necessary. Turns out had I stopped and made a plan in the first place, I could've save quite a bit of time.
## _Always make a plan and then tackle your data, don't just start coding_
### Duly noted, on we go

#### We can drop the remaining columns with null values and redundant data

In [58]:
rawCallData.drop(['event_clearance_date', 'hundred_block_location', 'district_sector', 'census_tract',
                 'incident_location', 'initial_type_description', 'at_scene_time'], inplace=True, axis=1)
rawCallData.shape

(879393, 9)

In [59]:
rawCallData.head()

Unnamed: 0,cad_cdw_id,cad_event_number,general_offense_number,event_clearance_code,event_clearance_description,zone_beat,longitude,latitude,formatted_time
49,1658027,16000028163,201628163,245.0,"DISTURBANCE, OTHER",N2,-122.34777,47.731678,2016-01-24 11:54:55
70,1658028,16000028161,201628161,280.0,SUSPICIOUS PERSON,S1,-122.280685,47.523026,2016-01-24 11:57:35
105,1658029,16000028159,201628159,65.0,THEFT - MISCELLANEOUS,M1,-122.342,47.609535,2016-01-24 11:54:28
190,1658030,16000028134,201628134,200.0,ALACAD - COMMERCIAL BURGLARY (FALSE),U1,-122.31302,47.668995,2016-01-24 11:53:22
255,1658031,16000028114,201628114,161.0,TRESPASS,D2,-122.34467,47.64158,2016-01-24 11:59:45


### Now we can look to see if the event numbers and ID numbers offer anything other for us than redundant information

In [61]:
print(rawCallData.cad_cdw_id.nunique())
print(rawCallData.cad_event_number.nunique())
print(rawCallData.general_offense_number.nunique())

879367
878480
878480
