In [8]:
import csv
import numpy as np
from scipy.spatial import distance
from sklearn.cluster import DBSCAN, KMeans

In [19]:
def clusters_to_csv(labels, types, coords):
    '''
    Helper function to turn scikit-learn clusters into abbreviated CSVs
    '''
    for k in set(labels):
        class_members = [index[0] for index in np.argwhere(labels == k)]
        for index in class_members:
            print '%s,%s,%s' % (int(k), types[index], '{0},{1}'.format(*coords[index]))

## Preparing the data

After we import our CSV of crime data, we need to do a couple things to get it ready for clustering: extracting the coordinate pairs that we want to cluster, and pulling together some simple labels so we know which indcident each point refers to.

In [2]:
data = list(csv.DictReader(open('data/columbia_crime.csv', 'r').readlines()))

In [15]:
# This part just splits out the latitude and longitude coordinate fields for each incident, which we need for mapping.
coords = [(float(d['lat']), float(d['lng'])) for d in data if len(d['lat']) > 0]
print coords[:10]

[(38.9379, -92.3343), (38.9515, -92.3265), (38.93819, -92.35111), (38.96017, -92.32459), (38.99221, -92.31791), (38.9526, -92.3265), (38.9505, -92.3276), (38.95723, -92.34396), (38.9499, -92.3276), (38.96531, -92.34368)]


In [16]:
# And this creates a matching array of incident types
types = [d['ExtNatureDisplayName'] for d in data]
print types[:10]

['TRAFFIC', 'DWI', 'CHECK SUBJECT', 'TRAFFIC STOP', 'TRESPASSING', 'TRAFFIC STOP', 'LEAVING THE SCENE ACCIDENT', 'DISTURBANCE', 'DWI', 'LAW ALARM']


## K-means clustering

Here we'll review the idea of k-means clustering you discussed last week and see how it applies to our crime data. We'll start with three clusters.

In [17]:
number_of_clusters = 3
kmeans = KMeans(n_clusters=number_of_clusters)
kmeans.fit(coords)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [18]:
clusters_to_csv(kmeans.labels_, types, coords)

0,TRAFFIC,38.9379,-92.3343
0,DWI,38.9515,-92.3265
0,CHECK SUBJECT,38.93819,-92.35111
0,TRAFFIC STOP,38.96017,-92.32459
0,TRESPASSING,38.99221,-92.31791
0,TRAFFIC STOP,38.9526,-92.3265
0,LEAVING THE SCENE ACCIDENT,38.9505,-92.3276
0,DISTURBANCE,38.95723,-92.34396
0,DWI,38.9499,-92.3276
0,LAW ALARM,38.96531,-92.34368
0,LARCENY MV,38.96297,-92.32342
0,TRAFFIC,38.9381,-92.3593
0,TRAFFIC STOP,38.96474,-92.33143
0,CHECK SUBJECT,38.94987,-92.32676
0,SUSPICIOUS VEHICLE,38.9602,-92.32905
0,ASSIST CITIZEN,38.9633,-92.3387
0,911 CHECKS,38.954167,-92.33414
0,DISTURBANCE,38.9633,-92.3398
0,SUSPICIOUS INCIDENT,38.9585,-92.33271
0,LARCENY,39.00124,-92.31707
0,SUSPICIOUS PERSON,38.96419,-92.37769
0,DISTURBANCE,38.98455,-92.36872
0,911 CHECKS,38.993,-92.3174
0,VANDALISM,39.0061,-92.31822
0,ASSAULT,38.98646,-92.32027
0,911 CHECKS,38.95544,-92.36135
0,LAW ALARM,38.96772,-92.31883
0,MISCHIEF,38.96164,-92.32737
0,CHECK SUBJECT,38.96429,-92.32096
0,TRAFFIC,38.9564,-92.3598
0,TRESPASSING,38.99595,-92.31625
0

The data comes out in the format of **cluster_id,incident_type,lat,lng**. If we save it to a csv file, we can load it into Google's simple map viewer tool to see how it looks.

As you can [see](https://www.google.com/maps/d/u/1/edit?authuser=1&mid=z9S6reOYqCIE.kvcm3XzvvgEA), segmenting the data into only three clusters doesn't give us anything useful. Let's try a bigger number.