# Spatial clusters
## Lecture objectives
1. Adapt the idea of k-means cluster analysis to spatial clusters
2. Provide more practice with k-means clustering and data wrangling

The principles of clustering that we discussed in the previous lecture apply to identify clusters in space. The difference: we just cluster on the `x` and `y` coordinates rather than non-spatial variables such as voting on a particular proposition.

Let's look at [clusters of marijuana dispensaries](https://data.lacity.org/Administration-Finance/Medical-Marijuana-Collective-Map/rv73-2zf4) in LA.

In [None]:
import pandas as pd
import requests
import json

url = 'https://data.lacity.org/resource/bunu-zsyc.json'
r = requests.get(url)
mdf = pd.DataFrame(json.loads(r.content))
mdf.head()

Note the lat /lon information is a dictionary in the `location_1` column.

In [None]:
mdf.location_1.iloc[0]

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Extract the lat/lon information from the <strong>location_1</strong> column and place it in two new columns
</div>

In [None]:
mdf['lat'] = mdf.location_1.apply(lambda x: float(x['coordinates'][1]))
mdf['lon'] = mdf.location_1.apply(lambda x: float(x['coordinates'][0]))

What row is causing this problem? Let's look at them?

In [None]:
mdf.location_1.head(20)

OK, we need to drop the Null values for this column.

In [None]:
mdf.dropna(subset='location_1', inplace=True)
mdf['lat'] = mdf.location_1.apply(lambda x: float(x['coordinates'][1]))
mdf['lon'] = mdf.location_1.apply(lambda x: float(x['coordinates'][0]))

Let's have a look at the columns we just created. 

In [None]:
mdf[['lat','lon']].describe()

It looks like there are some errors in the input data. Some have a latitude and/or longitude of zero! There are also some with locations with improbable values.

Let's crudely limit the dataset to those in Los Angeles. I eyeballed the lat/lon limits from a map.

Then we can convert to a GeoDataFrame.

In [None]:
mdf = mdf[(mdf.lat>=33.6) & (mdf.lat<=34.8)]
mdf = mdf[(mdf.lon>=-119.0) & (mdf.lon<=-117.6)]

import geopandas as gpd

mgdf = gpd.GeoDataFrame(mdf, 
            geometry = gpd.points_from_xy(mdf.lon, mdf.lat), 
            crs='EPSG:4326')
mgdf.plot()

Now let's cluster. Again, we'd want to experiment with `K`, but let's identify 20 clusters to start.

Technically, we should reproject before we cluster because a degree of latitude is not the same distance as a degree of longitude. But we can overlook that for simplicity's sake.

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=20, random_state=1).fit(mgdf[['lat','lon']])
mgdf['cluster_id'] = kmeans.labels_
mgdf.head()

And map them! Again, this is almost identical code to what we used before. But let's ditch the legend because there are lots of clusters and the number is not as relevant.

In [None]:
import matplotlib.pyplot as plt
import contextily as ctx

fig, ax = plt.subplots(figsize=(5,5))
mgdf.to_crs('EPSG:3857').plot('cluster_id', categorical=True, legend=False, 
                                   ax=ax, alpha=0.4)

ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)
ax.set_title('Dispensary clusters', fontsize=10)                           
#ax.set_ylim([3.98e6, 4.14e6])
ax.set_xticks([])
ax.set_yticks([])

We could also map the centroid of each cluster. But I'll leave that for you as an exercise.

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Map the centroid of each cluster. <em>Hint</em>: The raw lat/lon values are found in <strong>kmeans.cluster_centers_</strong>.
</div>

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
    <li>Spatial clusters work in a very similar way to non-spatial clusters.</li>
    <li>They can identify different groupings and/or hotspots.</li>

</ul>
</div>