## Homework 4: Cluster analysis

In this assignment, we'll do cluster analysis of transit agencies. Our goal: identify whether there are groups of "more similar" agencies. This type of analysis might help us identify a peer group for a particular agency, against which it can be benchmarked.

We'll use the [2023 National Transit Database](https://www.transit.dot.gov/ntd), which compiles the data that each transit agencies must report to the Federal Transit Administration.

The relevant spreadsheets are in your repository. 

Please help me grade by observing the following:
 
* Do not rename this notebook (that messes up the autograder)
* Do not include large sections of output (that makes it hard to find your code). For example, use `df.head()` to show the first few rows, rather than printing an entire dataframe. The same goes for printing long strings.
* Follow the same guidelines for ChatGPT / LLM usage as in previous assignments

Load in `2023 Agency Information_0.xlsx.xlsx` to a `pandas` DataFrame called `agency_info`.

You can use the `pd.read_excel()` command, which works in the same way as `pd.read_csv()`

In [None]:
agency_info = 999 # replace with your code

### BEGIN SOLUTION
import pandas as pd
fn = '2023 Agency Information_0.xlsx'
agency_info = pd.read_excel(fn)
### END SOLUTION

In [None]:
# Autograder tests - do not edit

print(len(agency_info))
print(agency_info.columns)
print(len(agency_info.columns))

assert len(agency_info)==2899
assert 'NTD ID' in agency_info.columns
assert len(agency_info.columns)==43

There are two duplicated NTD ids - it's not clear why. You will need to drop them to avoid double counting in subsequent steps. I suggest you do it like this.

In [None]:
agency_info = agency_info.drop_duplicates(subset='NTD ID') # will keep only the first, where there are duplicate ids
assert agency_info['NTD ID'].is_unique # check it worked

Load in the `service_bymode_2023.csv` file to a dataframe called `service` in the same way.

Most of the columns should be self explanatory, but you may need to refer to the [data dictionary](https://www.transit.dot.gov/ntd/data-product/2023-ntd-database-file-dictionary) or [glossary](https://www.transit.dot.gov/ntd/national-transit-database-ntd-glossary).


In [None]:
service = 999 # replace with your code

### BEGIN SOLUTION
fn = 'service_bymode_2023.csv'
service = pd.read_csv(fn)
### END SOLUTION

In [None]:
# Autograder tests - do not edit

print(len(service))
print(len(service.columns))

assert len(service)==3681
assert '_5_digit_ntd_id' in service.columns
assert len(service.columns)==46

You probably notice that there are many more rows in the `service` dataframe. If you look at the first few rows, you can see that each agency has different rows for:
* different modes (e.g. `MB` is motorbus)
* different types of service (directly operated is `DR` and contracted / purchased is `PT`)

If we do a join with `agency_info`, we'll end up with a 1:many join. That's not as useful if we want to cluster transit agencies.

So let's aggregate the `service` data first. Create a new dataframe, `service_agg`, that:
1. Keeps only the rows for Buses/Trolleybuses/Commuter Buses/Bus Rapid Transit (`mode` is `MB`, `TB`, `CB` or `RB`, so we are comparing like with like). Hint: the `in` operator is useful here.
2. Groups by the agency (NTD ID, called `_5_digit_ntd_id`) and sums these columns:

* Unlinked Passenger Trips (`sum_unlinked_passenger_trips_upt`)
* Passenger Miles (`sum_passenger_miles`)
* Revenue Miles (`sum_actual_vehicles_passenger_car_revenue_miles`), i.e., how distance traveled while in revenue service (vehicles/cars here refers to buses/train cars, not automobiles)
* Deadhead Miles (`sum_actual_vehicles_passenger_deadhead_miles`)



In [None]:
service_agg = 999 # replace with your code

### BEGIN SOLUTION
service = service[service['mode'].isin(['MB','CB','TB','RB'])]
service_agg = service.groupby('_5_digit_ntd_id')[['sum_unlinked_passenger_trips_upt', 
                                         'sum_passenger_miles', 'sum_actual_vehicles_passenger_car_revenue_miles',
                                         'sum_actual_vehicles_passenger_deadhead_miles']].sum()
### END SOLUTION

In [None]:
# Autograder tests - do not edit

print(len(service_agg))
print(service_agg.sum_unlinked_passenger_trips_upt.sum())

assert(len(service_agg)==1204)
assert service_agg.index.name=='_5_digit_ntd_id'
assert service_agg.sum_unlinked_passenger_trips_upt.sum() == 3475162210

Now, join your `service_agg` dataframe to your `agency_info` dataframe. Call the new dataframe `transit`. 

You should note that the `agency_info` has more rows that `service_agg`, because some small agencies aren't required to report service information. Drop those - you can do either an inner join, or a left join to `service_agg`.

In [None]:
transit = 999  # replace with your code
### BEGIN SOLUTION
transit = service_agg.join(agency_info.set_index('NTD ID'))
### END SOLUTION

In [None]:
# Autograder tests - do not edit

print(len(transit))
print(transit.sum_unlinked_passenger_trips_upt.sum())
print(transit.Population.sum())

assert len(transit)==1204
assert transit.sum_unlinked_passenger_trips_upt.sum() == 3475162210
assert transit.Population.sum() == 2109627865
#why isn't this the same sum as above?

The final data preparation step is to standardize the variables. Some of them are strings (use `transit.info()` to take a look). But let's standardize the numeric ones that we might want to use to cluster.

Create a data frame, `df_to_cluster`, with the following standardized variables: 
* sum_unlinked_passenger_trips_upt
* sum_passenger_miles
* sum_actual_vehicles_passenger_car_revenue_miles 
* sum_actual_vehicles_passenger_deadhead_miles 
* Population
* Density
* Total VOMS (vehicles operated in maximum service)

(See Lecture 14 on neural networks for how to standardize.)

It should still be indexed by NTD ID (`_5_digit_ntd_id`).

In [None]:
from sklearn import preprocessing

# your code here
df_to_cluster = 999 

### BEGIN SOLUTION
cols = ['sum_unlinked_passenger_trips_upt','sum_passenger_miles', 
        'sum_actual_vehicles_passenger_car_revenue_miles',
        'sum_actual_vehicles_passenger_deadhead_miles', 
        'Population','Density', 'Total VOMS']
scaler = preprocessing.StandardScaler().fit(transit[cols])

# convert to DataFrame and specify the column names and index
df_to_cluster = pd.DataFrame(scaler.transform(transit[cols]), 
                         columns=cols, index=transit.index)

### END SOLUTION

In [None]:
# Autograder tests - do not edit
print(len(df_to_cluster))
print(df_to_cluster.Population.mean())
print(df_to_cluster.sum_unlinked_passenger_trips_upt.mean())
print(len(df_to_cluster.columns))

assert len(df_to_cluster)==1204
assert df_to_cluster.Population.mean().round(5)==0
assert df_to_cluster.sum_unlinked_passenger_trips_upt.mean().round(5)==0
assert len(df_to_cluster.columns) == 7

Let's start with 5 clusters. Use the `KMeans` algorithm to assign each observation to a cluster.

Add the cluster number (id) to a new column in your original `df_to_cluster` dataframe. Call the new column `cluster_id`.

*Hint:* Drop the Null values before trying to cluster. And see lecture 15 (clustering) for an example.

In [None]:
from sklearn.cluster import KMeans

# your code here

### BEGIN SOLUTION
df_to_cluster.dropna(inplace=True)
kmeans = KMeans(n_clusters=5).fit(df_to_cluster)
df_to_cluster['cluster_id'] = kmeans.labels_
df_to_cluster.groupby('cluster_id').size()
### END SOLUTION

In [None]:
# Autograder tests - do not edit
print(df_to_cluster.groupby('cluster_id').size())

assert len(df_to_cluster.groupby('cluster_id').size())==5
assert df_to_cluster.groupby('cluster_id').size().min()==1
cmax = df_to_cluster.groupby('cluster_id').size().max()
assert cmax>700 and cmax<710

You should have one cluster that is a single transit agency. And another that is just 2 agencies.

What's the name of the agency that's in the cluster of one? (*Hint*: you can get its id from `df_to_cluster`, and its name from `transit`, your original dataframe.)
Comment on whether this seems reasonable.

Your answer here.

Create a radar plot that shows how your clusters relate to your 7 variables. Here's the function to create a radar plot that we used in the lecture.

*Hint*: You'll need to drop the `cluster_id` column first.

In [None]:
# code from https://matplotlib.org/stable/gallery/specialty_plots/radar_chart.html

import numpy as np

import matplotlib.pyplot as plt
from matplotlib.patches import Circle, RegularPolygon
from matplotlib.path import Path
from matplotlib.projections.polar import PolarAxes
from matplotlib.projections import register_projection
from matplotlib.spines import Spine
from matplotlib.transforms import Affine2D


def radar_factory(num_vars, frame='circle'):
    """
    Create a radar chart with `num_vars` axes.

    This function creates a RadarAxes projection and registers it.

    Parameters
    ----------
    num_vars : int
        Number of variables for radar chart.
    frame : {'circle', 'polygon'}
        Shape of frame surrounding axes.

    """
    # calculate evenly-spaced axis angles
    theta = np.linspace(0, 2*np.pi, num_vars, endpoint=False)

    class RadarAxes(PolarAxes):

        name = 'radar'
        # use 1 line segment to connect specified points
        RESOLUTION = 1

        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            # rotate plot such that the first axis is at the top
            self.set_theta_zero_location('N')

        def fill(self, *args, closed=True, **kwargs):
            """Override fill so that line is closed by default"""
            return super().fill(closed=closed, *args, **kwargs)

        def plot(self, *args, **kwargs):
            """Override plot so that line is closed by default"""
            lines = super().plot(*args, **kwargs)
            for line in lines:
                self._close_line(line)

        def _close_line(self, line):
            x, y = line.get_data()
            # FIXME: markers at x[0], y[0] get doubled-up
            if x[0] != x[-1]:
                x = np.append(x, x[0])
                y = np.append(y, y[0])
                line.set_data(x, y)

        def set_varlabels(self, labels):
            self.set_thetagrids(np.degrees(theta), labels)

        def _gen_axes_patch(self):
            # The Axes patch must be centered at (0.5, 0.5) and of radius 0.5
            # in axes coordinates.
            if frame == 'circle':
                return Circle((0.5, 0.5), 0.5)
            elif frame == 'polygon':
                return RegularPolygon((0.5, 0.5), num_vars,
                                      radius=.5, edgecolor="k")
            else:
                raise ValueError("Unknown value for 'frame': %s" % frame)

        def _gen_axes_spines(self):
            if frame == 'circle':
                return super()._gen_axes_spines()
            elif frame == 'polygon':
                # spine_type must be 'left'/'right'/'top'/'bottom'/'circle'.
                spine = Spine(axes=self,
                              spine_type='circle',
                              path=Path.unit_regular_polygon(num_vars))
                # unit_regular_polygon gives a polygon of radius 1 centered at
                # (0, 0) but we want a polygon of radius 0.5 centered at (0.5,
                # 0.5) in axes coordinates.
                spine.set_transform(Affine2D().scale(.5).translate(.5, .5)
                                    + self.transAxes)
                return {'polar': spine}
            else:
                raise ValueError("Unknown value for 'frame': %s" % frame)

    register_projection(RadarAxes)
    return theta

def radar_plot(kmeans, df_scaled):
    N  = kmeans.cluster_centers_.shape[1]  # number of columns / variables
    k = kmeans.n_clusters
    theta = radar_factory(N, frame='polygon')
    data = kmeans.cluster_centers_.T
    spoke_labels = [col for col in df_scaled.columns if col!='cluster_id']
    fig, ax = plt.subplots(figsize=(9, 9),
                                subplot_kw=dict(projection='radar'))
    fig.subplots_adjust(wspace=0.25, hspace=0.20, top=0.85, bottom=0.05)

    ax.plot(theta, data) #, color=color)
    ax.set_varlabels(spoke_labels)

    # add legend relative to top-left plot
    labels = ['Cluster {}'.format(kk) for kk in range(k)]
    ax.legend(labels, loc=(0.9, .95),
                                labelspacing=0.1, fontsize='small')

In [None]:
# your code here

### BEGIN SOLUTION
radar_plot(kmeans, df_to_cluster.drop(columns='cluster_id'))
### END SOLUTION


Comment in a few bullet points or sentences. How would you intepret and name each cluster?

Your answer here.

# Challenge Problem

Remember, you need to do at least two of these challenge problems this quarter.

This challenge problem is open ended for you to take in a direction that you are most interested in. Here are some suggestions (do 1 or 2 of these):
* Create some dummy variables and use them to cluster. For example, the reporter type and reporting module might be useful
* Analyze how your clusters vary by state (a field in your `transit` dataframe). For example, you might do a stacked bar chart of the number of transit agencies in each cluster by state. (Google "pandas stacked bar".)
* Explore different numbers of clusters
* Map your clusters. Your dataframe doesn't have geographic coordinates, but you could join the `Zip Code` field to [this handy dataset](https://hudgis-hud.opendata.arcgis.com/datasets/d032efff520b4bf0aa620a54a477c70e_0/about) that gives the centroids of each zip code.

Write some brief interpretation in a markdown cell.

In [None]:
# your code here