# Transit data and GTFS

## Lecture objectives

1. Learn how to access and use transit data in GTFS format
2. More practice in joins and plotting

The General Transit Feed Specification is a common format for sharing transit data on schedules, fares, and so on. Here we'll use the `partridge` to parse GTFS data files.

I included the data for [Santa Monica Big Blue Bus](http://transitfeeds.com/p/city-of-santa-monica/260) in the git repository. For other agencies, see this [useful compilation](https://github.com/andredarcie/awesome-gtfs).

[The partridge documentation](https://github.com/remix/partridge) gives some useful examples and code snippets. 

First, it seems that we can find the busiest date.

In [None]:
import partridge as ptg
path = 'data/gtfs' # this is the subfolder within your GitHub repository

date, service_ids = ptg.read_busiest_date(path)

And then on this date, load the feed. Again, this code snippet is from the `partridge` docs.

In [None]:
view = {'trips.txt': {'service_id': service_ids}}

feed = ptg.load_geo_feed(path, view)

Now we have an object called feed. Let's explore it.


In [None]:
feed.

In [None]:
feed.routes.head()

In [None]:
feed.stop_times.head()

In [None]:
feed.stops.head()

The stops has a `geometry` column. Is this a `GeoDataFrame`?

In [None]:
type(feed.stops)

Yes! So we can map it.

In [None]:
feed.stops.plot()

Let's compute and map a simple measure of transit accessibility (number of trips per day) at the stop level.

<div class="alert alert-block alert-info">
<strong>Thought exercise:</strong> How might you go about this?
</div>

We want a count of the number of trips at each stop. We saw above that `stop_times` had the stop id. So let's do the following:
* Aggregate `stop_times` by `stop_id` to generate counts
* Join this to the stops data (which has the geometry)
* Map the results

In [None]:
# we could count any (non-Null) column. I chose trip_id
# we could also use size() rather than count()
freqs = feed.stop_times.groupby('stop_id').trip_id.count()
freqs

In [None]:
# join on the index
stops = feed.stops.set_index('stop_id').join(freqs)
stops.head()

In [None]:
# our joined column was called trip_id, so n_trips probably makes more sense
stops.rename(columns={'trip_id': 'n_trips'}, inplace=True)
stops.head()

In [None]:
# map the results
import matplotlib.pyplot as plt
import contextily as ctx

fig, ax = plt.subplots(figsize=(10,10))
stops.to_crs('EPSG:3857').plot(markersize='n_trips', ax=ax)
ctx.add_basemap(ax, zoom=12, alpha=0.5)
ax.set_xticks([])
ax.set_yticks([])

The markers are a little large. Let's create a new column with a scaled version of the marker size, and plot that instead.

In [None]:
stops['n_trips_scaled']= stops.n_trips / 3

# same code as before, except for plotting n_trips_scaled instead of n_trips
fig, ax = plt.subplots(figsize=(10,10))
stops.to_crs('EPSG:3857').plot(markersize='n_trips_scaled', ax=ax)
ctx.add_basemap(ax, zoom=12, alpha=0.5)
ax.set_xticks([])
ax.set_yticks([])

These accessibility measures are at the stop level, but you can imagine aggregating the combined frequencies to census tracts, and/or calculating the combined frequency within (say) 0.25 miles of a destination.

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>GTFS is the standard format for transit data</li>
  <li>GTFS is cumbersome to work with in raw form, but partridge makes it simpler</li>
</ul>
</div>