## Module 7 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class. We'll do an extension of the random forests classifier, looking at a continuous variable. Then, we'll do some cluster analysis.

Before you attempt any of these activities, make sure to watch the video lectures for this module.

We'll look at some k-means cluster analysis. The question: are there particular patterns of cruising for parking? You can see [my version of the analysis here](https://findingspress.org/article/28061-the-shape-of-cruising), joint with Robert Hampshire and Rachel Weinberger.

The data file that replicate the analysis is in your data folder. There is one row for each cruising trip (derived from the final portion of a GPS trace, once a driver is assumed to start looking for parking.)

You can load the data as follows.

In [3]:
import pandas as pd
cruisingDf = pd.read_csv('../classes/data/cruising_shapes.csv')

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Replicate the analysis in Millard-Ball, Hampshire and Weinberger (2021). First, add each cluster label to the dataframe.</div>

In the paper, we used 5 clusters and the following columns: `'matchdist', 'frc_repeat', 'n_crossings', 'convexhull_ratio', 'frc_right', 'frc_left', 'frc_uturn', 'frc_straight', 'n_turns'`

* `matchdist` is path length (technically, the map-matched distance)
* `frc_repeat` is the fraction of repeated blocks (a driver drives on them more than once while cruising for parking)
* `n_crossings` is the number of times that the driver crosses over their path
* `convexhull_ratio` is a measure of the compactness of the search area
* `frc_right`, `frc_left`, `frc_uturn` and `frc_straight` are the fraction of times that the driver turns right or left, makes a U-turn, or continues straight at an intersection
* `n_turns` is the number of turns in the cruising trace

You'll need to:
* standardize the variables
* drop Null values
* run the k-means algorithm
* add the cluster labels back to your dataframe

How many observations do you get in each cluster?

In [2]:
from sklearn.cluster import KMeans
from sklearn import preprocessing

# your code here
cols = ['matchdist', 'frc_repeat', 'n_crossings', 'convexhull_ratio', 'frc_right', 'frc_left', 'frc_uturn', 'frc_straight', 'n_turns']
scaler = preprocessing.StandardScaler().fit(cruisingDf[cols])
df_scaled = pd.DataFrame(scaler.transform(cruisingDf[cols]), 
                         columns=cols, index=cruisingDf.index)

kmeans = KMeans(n_clusters=5, random_state=1).fit(df_scaled)

df_scaled['cluster_id'] = kmeans.labels_
df_scaled.groupby('cluster_id').size()

NameError: name 'cruisingDf' is not defined

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Create a radar plot of your clusters.</div>

Here's the function to create the radar chart, which we used in the video lecture.

It takes two arguments: the `kmeans` object created by `KMeans`, and your standardized dataframe.

In [None]:
# code from https://matplotlib.org/stable/gallery/specialty_plots/radar_chart.html

import numpy as np

import matplotlib.pyplot as plt
from matplotlib.patches import Circle, RegularPolygon
from matplotlib.path import Path
from matplotlib.projections.polar import PolarAxes
from matplotlib.projections import register_projection
from matplotlib.spines import Spine
from matplotlib.transforms import Affine2D


def radar_factory(num_vars, frame='circle'):
    """
    Create a radar chart with `num_vars` axes.

    This function creates a RadarAxes projection and registers it.

    Parameters
    ----------
    num_vars : int
        Number of variables for radar chart.
    frame : {'circle', 'polygon'}
        Shape of frame surrounding axes.

    """
    # calculate evenly-spaced axis angles
    theta = np.linspace(0, 2*np.pi, num_vars, endpoint=False)

    class RadarAxes(PolarAxes):

        name = 'radar'
        # use 1 line segment to connect specified points
        RESOLUTION = 1

        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            # rotate plot such that the first axis is at the top
            self.set_theta_zero_location('N')

        def fill(self, *args, closed=True, **kwargs):
            """Override fill so that line is closed by default"""
            return super().fill(closed=closed, *args, **kwargs)

        def plot(self, *args, **kwargs):
            """Override plot so that line is closed by default"""
            lines = super().plot(*args, **kwargs)
            for line in lines:
                self._close_line(line)

        def _close_line(self, line):
            x, y = line.get_data()
            # FIXME: markers at x[0], y[0] get doubled-up
            if x[0] != x[-1]:
                x = np.append(x, x[0])
                y = np.append(y, y[0])
                line.set_data(x, y)

        def set_varlabels(self, labels):
            self.set_thetagrids(np.degrees(theta), labels)

        def _gen_axes_patch(self):
            # The Axes patch must be centered at (0.5, 0.5) and of radius 0.5
            # in axes coordinates.
            if frame == 'circle':
                return Circle((0.5, 0.5), 0.5)
            elif frame == 'polygon':
                return RegularPolygon((0.5, 0.5), num_vars,
                                      radius=.5, edgecolor="k")
            else:
                raise ValueError("Unknown value for 'frame': %s" % frame)

        def _gen_axes_spines(self):
            if frame == 'circle':
                return super()._gen_axes_spines()
            elif frame == 'polygon':
                # spine_type must be 'left'/'right'/'top'/'bottom'/'circle'.
                spine = Spine(axes=self,
                              spine_type='circle',
                              path=Path.unit_regular_polygon(num_vars))
                # unit_regular_polygon gives a polygon of radius 1 centered at
                # (0, 0) but we want a polygon of radius 0.5 centered at (0.5,
                # 0.5) in axes coordinates.
                spine.set_transform(Affine2D().scale(.5).translate(.5, .5)
                                    + self.transAxes)
                return {'polar': spine}
            else:
                raise ValueError("Unknown value for 'frame': %s" % frame)

    register_projection(RadarAxes)
    return theta

def radar_plot(kmeans, df_scaled):
    N  = kmeans.cluster_centers_.shape[1]  # number of columns / variables
    k = kmeans.n_clusters
    theta = radar_factory(N, frame='polygon')
    data = kmeans.cluster_centers_.T
    spoke_labels = [col for col in df_scaled.columns if col!='cluster_id']
    fig, ax = plt.subplots(figsize=(9, 9),
                                subplot_kw=dict(projection='radar'))
    fig.subplots_adjust(wspace=0.25, hspace=0.20, top=0.85, bottom=0.05)

    ax.plot(theta, data) #, color=color)
    ax.set_varlabels(spoke_labels)

    # add legend relative to top-left plot
    labels = ['Cluster {}'.format(kk) for kk in range(k)]
    ax.legend(labels, loc=(0.9, .95),
                                labelspacing=0.1, fontsize='small')
    
# your code here
radar_plot(kmeans, df_scaled)

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Experiment with different values of k (number of clusters), and using different columns.</div>

In [None]:
# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> The dataframe comes with x and y coordinates. Use them to identify any spatial clusters of cruising trips in San Francisco.</div>

In [None]:
# your code here
cols = ['x','y']

# let's not standardize, because we are working with latitude and longitude
# so the variables are already on (roughly) the same scale
#scaler = preprocessing.StandardScaler().fit(cruisingDf[cols])
#df_scaled = pd.DataFrame(scaler.transform(cruisingDf[cols]), 
#                         columns=cols, index=cruisingDf.index)

# restrict to San Francisco (otherwise we'll end up with a cluster in Ann Arbor)
sfDf = cruisingDf[cruisingDf.sf==True][cols].dropna()

# you can rerun this several times (through the map below) with different numbers of k
kmeans = KMeans(n_clusters=20, random_state=1).fit(sfDf)

sfDf['cluster_id'] = kmeans.labels_

In [None]:
# these are the long/lat of our cluster centers
kmeans.cluster_centers_

In [None]:
# create a geoDataFrame with just the geometry
import geopandas as gpd
centers = gpd.GeoDataFrame(geometry=
            gpd.points_from_xy(kmeans.cluster_centers_[:,0], 
                               kmeans.cluster_centers_[:,1]),
                          crs ='EPSG:4326')

# we could have done this in two steps, which might be easier to read
centers = pd.DataFrame(kmeans.cluster_centers_, columns=['lon','lat'])
centers = gpd.GeoDataFrame(geometry=
            gpd.points_from_xy(centers['lon'], 
                               centers['lat']),
                          crs ='EPSG:4326')

In [None]:
# check it looks ok
centers.head()

In [None]:
# create a geodataframe of the cruising points (so we can plot them as well)
cruisingGdf = gpd.GeoDataFrame(sfDf, 
                geometry=gpd.points_from_xy(sfDf.x, sfDf.y,
                    crs='EPSG:4326'))

In [None]:
import matplotlib.pyplot as plt
import contextily as ctx
fig, ax= plt.subplots(figsize=(10,10))

# plot the cluster centers with a large marker
centers.plot(ax=ax, markersize=100)

# plot the individual points with a small marker, and a different color for each cluster_id
cruisingGdf.plot(ax=ax, markersize=0.1, column='cluster_id')
ctx.add_basemap(ax=ax, crs='EPSG:4326', alpha=0.2)

<div class="alert alert-block alert-info">
<h3>What you should have learned</h3>
<ul>
  <li>Get more practice with standardizing data.</li>
  <li>Learn how to estimate a k-means cluster analysis.</li>
</ul>
</div>