UTM shuttle times scraper #32

qasim · 2016-04-12T03:27:27Z

https://m.utm.utoronto.ca/shuttleByDate.php?year=2016&month=04&day=10

UTM has a mobile website for their UTM <-> UTSG shuttle. This would fall under transit / transportation. This scraper should scrape the current month's shuttle times (First day of the current month all the way to the last day). The URL makes this an easy ~30 page request scrape.

As for UTSG and UTSC and other UTM transportation, transit is solely TTC and Go. They already have their own open data APIs so we will leave it at that!

qasim · 2016-04-13T02:39:33Z

Proposed schema (per shuttle stop id):

{
  "id": String,
  "name": String,
  "dates": [{
    "start": String,
    "end": String
   }]
}

arkon · 2016-04-13T03:31:02Z

I don't really understand what the start and end fields in dates mean?

qasim · 2016-04-13T04:09:59Z

Woops, start and end don't apply here, do they. ;P

So there is another aspect to this that I hadn't looked at or seen before. Based on other transit APIs I've looked at, they organize data with routes as the top level object and then routes own stops which contain times.

So the following is schema with top level being a route:

{
  "name": String,
  "stops": [{
    "location": String,
    "building_id": String,
    "times": [String]
  }]
}

An example with fictional data for St. George route:

{
  "name": "St. George Route",
  "stops": [
    {
      "location": "Instructional Centre Layby",
      "building_id": "334",
      "times": [
        "2016-04-13T05:55:00-04:00",
        "2016-04-13T07:55:00-04:00",
        "2016-04-14T05:55:00-04:00",
        "2016-04-14T07:55:00-04:00"
      ]
    },
    {
      "location": "Hart House",
      "building_id": "002",
      "times": [
        "2016-04-13T08:55:00-04:00",
        "2016-04-13T10:55:00-04:00",
        "2016-04-14T08:55:00-04:00",
        "2016-04-14T10:55:00-04:00"
      ]
    }
  ]
}

qasim · 2016-04-13T04:12:21Z

Note: the dates are formatted in the ISO 8601 standard, offset for the Eastern timezone. It balances human readability in a compact form, and of course remains machine readable. I think this is the standard the whole project should take, but if you have an argument for something better than we can discuss that.

arkon · 2016-04-13T12:18:57Z

Would we have a gigantic list of all times for the month per stop, or would we try to split it up so it's 1 file per day?

qasim · 2016-04-13T14:52:28Z

Once per day seems appropriate since there would be a /lot/ of times otherwise. I wish the shuttle times were a little more predictable, but on random days it likes to change slightly. :/

If we do days, then the top level would be days:

{
  "date": "2016-04-13",
  "routes": [
      ...
  ]
}

arkon · 2016-04-13T17:54:18Z

Yeah it's usually schedules that are consistent for Monday - Thursday, then a few are missing for Friday, and Saturday/Sunday have way less. Then there's the special schedules for exam periods, reading weeks, etc.

arkon · 2016-04-14T02:11:53Z

So it seems like the route ids aren't the same across the days, so we'll need to use the names as the identifiers. Unless you have a better idea, @qasim ?

(I'll probably take a shot at implementing this scraper.)

qasim · 2016-04-14T02:18:44Z

@arkon that works. The convention so far has been id being all caps alphanumerical. So you could rmove the spaces/special characters, upper() so ids look like this maybe?

STGEORGE
SHERIDAN

arkon · 2016-04-14T03:19:18Z

@qasim Yeah that would probably work. It should be something like:

{
  "date": "2016-04-13",
  "routes": [
    {
      "id": "STGEORGE",
      "name": "St. George Route",
      "stops": [
        {
          "location": "Instructional Centre Layby",
          "building_id": "334",
          "times": [
            "2016-04-13T05:55:00-04:00",
            "2016-04-13T07:55:00-04:00"
          ]
        },
        {
          "location": "Hart House",
          "building_id": "002",
          "times": [
            "2016-04-13T08:55:00-04:00",
            "2016-04-13T10:55:00-04:00"
          ]
        }
      ]
    },
    {
      "id": "SHERIDAN",
      "name": "Sheridan Route",
      "stops": [
        {
          "location": "Deerfield Hall North Layby",
          "building_id": "340",
          "times": [
            "2016-04-13T05:55:00-04:00",
            "2016-04-13T07:55:00-04:00"
          ]
        },
        {
          "location": "Sheridan",
          "building_id": "",
          "times": [
            "2016-04-13T08:55:00-04:00",
            "2016-04-13T10:55:00-04:00"
          ]
        }
      ]
    }
  ]
}

Note that there's no building_id for Sheridan College.

qasim · 2016-04-14T04:52:55Z

Looks good. Eventually I want the project to start referencing other scraper's IDs as much as possible, there are a few cases where we don't right now. There's no infrastructure for that yet, though (matching building names to IDs in other scrapers). I guess for this one you'll have a manual mapping somewhere of the known stops to building IDs?

arkon · 2016-04-14T11:30:12Z

Yeah, I guess the manual mapping would work. How are you going it elsewhere right now?

qasim · 2016-04-14T14:36:35Z

If it's a map.utoronto.ca layer, chances are there is a building_id attached to things. Otherwise, nothing yet.

qasim · 2016-04-15T22:03:16Z

This should be good to close after #41 (diff) is fixed.

qasim added the help wanted label Apr 12, 2016

qasim mentioned this issue Apr 12, 2016

Transportation API cobalt-uoft/cobalt#56

Closed

qasim assigned arkon Apr 14, 2016

arkon mentioned this issue Apr 15, 2016

UTM Shuttle Bus Schedule scraper #41

Merged

arkon mentioned this issue Apr 16, 2016

Fix time parsing in UTM Shuttle scraper #43

Merged

qasim closed this as completed Apr 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTM shuttle times scraper #32

UTM shuttle times scraper #32

qasim commented Apr 12, 2016

qasim commented Apr 13, 2016

arkon commented Apr 13, 2016

qasim commented Apr 13, 2016

qasim commented Apr 13, 2016

arkon commented Apr 13, 2016

qasim commented Apr 13, 2016

arkon commented Apr 13, 2016

arkon commented Apr 14, 2016

qasim commented Apr 14, 2016

arkon commented Apr 14, 2016

qasim commented Apr 14, 2016

arkon commented Apr 14, 2016

qasim commented Apr 14, 2016

qasim commented Apr 15, 2016

UTM shuttle times scraper #32

UTM shuttle times scraper #32

Comments

qasim commented Apr 12, 2016

qasim commented Apr 13, 2016

arkon commented Apr 13, 2016

qasim commented Apr 13, 2016

qasim commented Apr 13, 2016

arkon commented Apr 13, 2016

qasim commented Apr 13, 2016

arkon commented Apr 13, 2016

arkon commented Apr 14, 2016

qasim commented Apr 14, 2016

arkon commented Apr 14, 2016

qasim commented Apr 14, 2016

arkon commented Apr 14, 2016

qasim commented Apr 14, 2016

qasim commented Apr 15, 2016