## Let's Fetch Some Data

Download the data set that we will be using for this course at [this link](http://bit.ly/da-crimes-sample). This data set is a collection of criminal activity in Chicago from January 1st, 2015 to October 24th, 2015.

In [37]:
import pandas as pd

In [38]:
crimes = pd.read_csv("data/crimes_chicago.csv")

Let's examine what the data set looks like using the `head` command.

In [39]:
crimes.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10288746,HY476724,10/24/2015 11:59:00 PM,016XX E HAYES DR,0610,BURGLARY,FORCIBLE ENTRY,PARK PROPERTY,False,False,...,5,42,05,1188497,1863537,2015,10/31/2015 03:56:20 PM,41.780617,-87.584477,"(41.780617401, -87.584477038)"
1,10288063,HY475407,10/24/2015 11:55:00 PM,003XX W 51ST ST,051A,ASSAULT,AGGRAVATED: HANDGUN,ALLEY,False,False,...,3,37,04A,1175009,1871144,2015,10/31/2015 03:56:20 PM,41.801803,-87.633699,"(41.801803486, -87.633699142)"
2,10287811,HY476065,10/24/2015 11:50:00 PM,117XX S MARSHFIELD AVE,0820,THEFT,$500 AND UNDER,GROCERY FOOD STORE,False,False,...,34,75,06,1167518,1826850,2015,10/31/2015 03:56:20 PM,41.680418,-87.662438,"(41.680418426, -87.662437948)"
3,10287226,HY475363,10/24/2015 11:50:00 PM,083XX S ELLIS AVE,0460,BATTERY,SIMPLE,SIDEWALK,False,False,...,8,44,08B,1184335,1850002,2015,10/31/2015 03:56:20 PM,41.743574,-87.600158,"(41.743574495, -87.600158418)"
4,10287210,HY475369,10/24/2015 11:50:00 PM,039XX S CALIFORNIA AVE,0454,BATTERY,AGG PO HANDS NO/MIN INJURY,JAIL / LOCK-UP FACILITY,True,False,...,14,58,08B,1158335,1878579,2015,10/31/2015 03:56:20 PM,41.822562,-87.694647,"(41.822562231, -87.694646782)"


How many data points do we have?

In [40]:
crimes.shape

(211926, 22)

Let's examine what columns are avaialble to us in the data set.

In [41]:
crimes.columns

Index(['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type',
       'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat',
       'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate',
       'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude',
       'Location'],
      dtype='object')

Let's rename the column header names to match Python's variable naming convention.

In [42]:
crimes.columns = [header.lower().replace(" ", "_") for header in crimes.columns]

In [43]:
crimes.columns

Index(['id', 'case_number', 'date', 'block', 'iucr', 'primary_type',
       'description', 'location_description', 'arrest', 'domestic', 'beat',
       'district', 'ward', 'community_area', 'fbi_code', 'x_coordinate',
       'y_coordinate', 'year', 'updated_on', 'latitude', 'longitude',
       'location'],
      dtype='object')

## Cleaning Things Up

In [44]:
def clean_block(block):
    crime_block_parts = block.split(" ")
    crime_block_parts[0] = crime_block_parts[0].replace("X", "0")
    return " ".join(crime_block_parts).title()

In [45]:
crimes["block"] = crimes.block.apply(clean_block)

## What's missing?

Let's find out what values are missing from the data set.

In [10]:
for column in crimes.columns:
    print(column, ":", crimes[column].count())

id : 211926
case_number : 211926
date : 211926
block : 211926
iucr : 211926
primary_type : 211926
description : 211926
location_description : 211842
arrest : 211926
domestic : 211926
beat : 211926
district : 211926
ward : 211924
community_area : 211926
fbi_code : 211926
x_coordinate : 207174
y_coordinate : 207174
year : 211926
updated_on : 211926
latitude : 207174
longitude : 207174
location : 207174


## Geocoding Data Samples

Get an API key for Google Maps. You can get one using Google's Developer Console.

In [69]:
API_KEY = "AIzaSyBMbe36QfMCSDi3wf_LJrzMxB-W8_oBMys"
GEOCODE_URL = lambda addr: "https://maps.googleapis.com/maps/api/geocode/json?address=" + \
                                addr.replace(" ", "+") + "&key=" + API_KEY

In [91]:
import urllib.request
import json

def geocode_addr(addr):
    response = urllib.request.urlopen(GEOCODE_URL(addr))
    data = json.loads(response.read().decode("utf-8"))
    if data['results'] and data['results'][0]:
        loc = data['results'][0]['geometry']['location']
        return loc["lat"], loc["lng"]
    return None, None

In [92]:
def apply_geocode(crime):
    print("Processing missing lat, lng data...")
    if not (pd.isnull(crime["latitude"]) and pd.isnull(crime["longitude"])):
        return crime["latitude"], crime["longitude"]
    addr = crime['block_clean'] + " Chicago, IL"
    return geocode_addr(addr)

In [94]:
crimes["latitude"], crimes["longitude"] = zip(*crimes.apply(apply_geocode, axis = 1))

Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng data...
Processing missing lat, lng 

URLError: <urlopen error [Errno 60] Operation timed out>

## Removing Unnecessary Columns

The `location` column contains the latitude and longitude in `(latitude, longitude)` form. We already have this data so we can go ahead and remove the column.

In [17]:
crimes.drop("location", axis=1, inplace=True)

In [19]:
crimes = crimes.drop(["x_coordinate", "y_coordinate"], axis = 1)

## Augment the Data Set

Let's augment our dataset by creating a column that marks when the event occurred. The column will be signified by numbers where 0 represents a weekday and 1 represents a weekend.

In [46]:
crimes["date"] = pd.to_datetime(crimes["date"])

In [61]:
def calculate_date_status(date):
    if date.dayofweek <= 4:
        return 0
    elif date.dayofweek <= 6:
        return 1
    else:
        return None

In [62]:
crimes["date_status"] = crimes.date.apply(calculate_date_status)

In [63]:
crimes["date_status"]

0         1
1         1
2         1
3         1
4         1
5         1
6         1
7         1
8         1
9         1
10        1
11        1
12        1
13        1
14        1
15        1
16        1
17        1
18        1
19        1
20        1
21        1
22        1
23        1
24        1
25        1
26        1
27        1
28        1
29        1
         ..
211896    0
211897    0
211898    0
211899    0
211900    0
211901    0
211902    0
211903    0
211904    0
211905    0
211906    0
211907    0
211908    0
211909    0
211910    0
211911    0
211912    0
211913    0
211914    0
211915    0
211916    0
211917    0
211918    0
211919    0
211920    0
211921    0
211922    0
211923    0
211924    0
211925    0
Name: date_status, dtype: int64