Skip to content
This repository has been archived by the owner on Dec 8, 2022. It is now read-only.

API usage tutorial

Laurent Grégoire edited this page Feb 12, 2016 · 10 revisions

The DAO facade

All interaction with the library is done through a DAO facade (data access object).

To create a DAO object, you should give an optional database to operate on. If no database is provided, the data will be loaded to memory (so will not be saved to disk of course).

from gtfslib.dao import Dao
# If db.sqlite exists, use it. Otherwise create a new one.
dao = Dao("db.sqlite")

If a simple filename is given, SQLite is assumed. In order to use other databases (PostgreSQL), use something like "postgresql://gtfs@localhost/gtfs" as parameter.

Loading GTFS data

To load a GTFS into the database, normalizing it (conversion of calendars, trips, stop times, frequencies...):

dao.load_gtfs("mygtfs.zip")

In order to load multiple GTFS at the same time, you need to provide a unique ID (here 'sncf'):

dao.load_gtfs("sncf.gtfs.zip", feed_id="sncf")

To delete an entire feed and all attached objects (safe to use when the feed does not exists), or to reload again on top of previous data:

dao.delete_feed("sncf")

Basic data access

Once the data is loaded, you can access objects to work on, for example a single object via it's ID (here route ID 'R1' from the default feed):

route = dao.route('R1')
print(route)

Or a list of all objects (here a list of all stops of all feeds):

for stop in dao.stops():
	print(stop.stop_name)

For each following types: Feed, Agency, Route, Stop, Calendar, Trip, Shape, FareAttribute; there are two accessor functions:

  • to get a unique object given it's ID (example: agency(id, feed_id=''))
  • to get a list of objects, with optional filters (example: routes())

For the following 4 types: ShapePoint, StopTime, CalendarDate and FareRule; only list of objects are implemented (the primary key of those objects are composed by several fields, there is probably no point in fetching them from their primary key).

On top of that, you can also access hops, which are consecutive pairs of stop times from a same trip. See relevant section for more information.

Filtering data

Each object list can be filtered, returning only objects corresponding to some criteria, for example:

# Will return all stops from all feeds containing "gare" in their name.
gares = dao.stops(fltr=Stop.stop_name.ilike("%gare%"))

Some methods also accepts filters on joined tables. Table joints are automatically added when necessary.

# Will return all trips for routes of type BUS (3)
trips = dao.trips(fltr=Route.type == Route.TYPE_BUS)

You can also combine expressions with the & (logical and) and | (logical or) operators:

from gtfslib.utils import gtfstime
stop42 = dao.stop('S42')
# Will return all departures from stop42 between 14:00 and 16:00
departures = dao.stoptimes(fltr=(StopTime.stop == stop42) \
                              & (StopTime.departure_time >= gtfstime(14, 0)) \
                              & (StopTime.departure_time <= gtfstime(16, 0)))

Another syntax is using the or_ and and_ functions (they both accept variable list of parameters or list / generators as arguments):

from sqlalchemy.sql.expression import and_, or_
# Will return all stations with name in the following list
STOP_NAMES = [ 'Ouarzazate', 'Tataouine' ]
stops = dao.stops(fltr=and_(or_(Stop.stop_name == name for name in STOP_NAMES),
                            Stop.location_type == Stop.TYPE_STATION))

You can also use the & and | logical operator with filters from different tables:

# Will return all wheelchair-accessible departures from stop42, any day
stoptimes = dao.stoptimes(fltr=(StopTime.stop == stop42) &
                               (Trip.wheelchair_accessible == Trip.WHEELCHAIR_YES))

Please see SQLAlchemy common operator functions for more information on the available operators and functions.

Filtering dates

By default, all calendar filters will apply on calendar_dates directly, not calendar, as there is not data on calendars to filter on.

For filtering on date, please use the following syntax:

import datetime
# You can use standard python date object
some_date = datetime.date(year, month, day)
calendars = dao.calendars(fltr=func.date(CalendarDate.date) == some_date)
# You can also use a CalendarDate:
other_date = CalendarDate.ymd(year, month, day)
calendars = dao.calendars(fltr=func.date(CalendarDate.date) == other_date.as_date())

Linked objects and pre-fetching

Linked objects are "transparently" accessible via fields (for example: route.trips). If they are not pre-loaded during the initial query, they will be lazily loaded at the time of first-access.

for route in dao.routes(fltr=Route.route_type == Route.TYPE_BUS):
	# The following will issue a SELECT per route:
	print(len(route.trips))

You can say which data to pre-fetch. The same query, here pre-fetching route trips (a total of TWO selects only):

for route in dao.routes(..., prefetch_trips=True):
	# Trips are pre-loaded
	print(len(route.trips))

Batching

For processing a large quantity of data, you can batch them (available only for stops and trips). The following will transparently issue a new SELECT every 1000 trips:

for trip in dao.trips(batch_size=1000):
	... do something with trip ...

Internally, batching is implemented by 2 selects: first loading all the IDs with filtering, then loading a batch of elements at a time (1k by default) from an ID list.

Complex queries

If the standard query function are not enough, you can always directly use the SQLAlchemy ORM layer by getting the associated session with the session() DAO method. You can then perform more complex custom queries using the full power of the SQLAlchemy syntax.

Here an example that count the number of trips for each date on a given date interval:

session = dao.session()
from_date = CalendarDate.ymd(2016, 1, 22)
to_date = CalendarDate.ymd(2016, 2, 22)
for date, trip_count in dao.session() \
           .query(CalendarDate.date, func.count(Trip.trip_id)) \
           .join(Calendar).join(Trip) \
           .filter((func.date(CalendarDate.date) >= from_date.date) & (func.date(CalendarDate.date) <= to_date.date)) \
           .group_by(CalendarDate.date) \
           .order_by(CalendarDate.date) \
           .all():
    print("%s : %d trips" % (date, trip_count))

This is equivalent, but a bit clearer, to the following SQL query:

SELECT calendar_dates.date AS calendar_dates_date, 
       COUNT(trips.trip_id) AS count_1
FROM calendar_dates 
  JOIN calendar ON calendar.feed_id = calendar_dates.feed_id AND calendar.service_id = calendar_dates.service_id
  JOIN trips ON calendar.feed_id = trips.feed_id AND calendar.service_id = trips.service_id
WHERE date(calendar_dates.date) >= '2016-01-22' 
  AND date(calendar_dates.date) <= '2016-02-22'
GROUP BY calendar_dates.date
ORDER BY calendar_dates.date;

Of course one can always use python to do the grouping (group by) and summation (count()). Below the exact same example, with only standard DAO access function. It first select all calendar dates in the given range, then for each it aggregates the trips count of the date calendar.

from _collections import defaultdict
date_count = defaultdict(lambda: 0)
for date in dao.calendar_dates(fltr=(func.date(CalendarDate.date) >= from_date.date) &
                                    (func.date(CalendarDate.date) <= to_date.date), prefetch_trips=True):
    date_count[date.date] += len(date.calendar.trips) 
for date, count in date_count.items():
    print("%s : %d trips" % (date, count))

Please see the SQLAlchemy query tutorial for more information on querying.

Debugging SQL queries

To debug the SQL statements that are performed in the back, set sql_logging=True in the DAO constructor:

dao = Dao(..., sql_logging=True)

It's a bit verbose, but it can be helpful to trace down performance issues (notably when objects are being loaded and by how many queries). See the chapter on eager loading (pre-fetching) for more information.