Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTM shuttle times scraper #32

Closed
qasim opened this issue Apr 12, 2016 · 14 comments
Closed

UTM shuttle times scraper #32

qasim opened this issue Apr 12, 2016 · 14 comments
Assignees

Comments

@qasim
Copy link
Member

qasim commented Apr 12, 2016

https://m.utm.utoronto.ca/shuttleByDate.php?year=2016&month=04&day=10

UTM has a mobile website for their UTM <-> UTSG shuttle. This would fall under transit / transportation. This scraper should scrape the current month's shuttle times (First day of the current month all the way to the last day). The URL makes this an easy ~30 page request scrape.

As for UTSG and UTSC and other UTM transportation, transit is solely TTC and Go. They already have their own open data APIs so we will leave it at that!

@qasim
Copy link
Member Author

qasim commented Apr 13, 2016

Proposed schema (per shuttle stop id):

{
  "id": String,
  "name": String,
  "dates": [{
    "start": String,
    "end": String
   }]
}

@arkon
Copy link
Contributor

arkon commented Apr 13, 2016

I don't really understand what the start and end fields in dates mean?

@qasim
Copy link
Member Author

qasim commented Apr 13, 2016

Woops, start and end don't apply here, do they. ;P

So there is another aspect to this that I hadn't looked at or seen before. Based on other transit APIs I've looked at, they organize data with routes as the top level object and then routes own stops which contain times.

So the following is schema with top level being a route:

{
  "name": String,
  "stops": [{
    "location": String,
    "building_id": String,
    "times": [String]
  }]
}

An example with fictional data for St. George route:

{
  "name": "St. George Route",
  "stops": [
    {
      "location": "Instructional Centre Layby",
      "building_id": "334",
      "times": [
        "2016-04-13T05:55:00-04:00",
        "2016-04-13T07:55:00-04:00",
        "2016-04-14T05:55:00-04:00",
        "2016-04-14T07:55:00-04:00"
      ]
    },
    {
      "location": "Hart House",
      "building_id": "002",
      "times": [
        "2016-04-13T08:55:00-04:00",
        "2016-04-13T10:55:00-04:00",
        "2016-04-14T08:55:00-04:00",
        "2016-04-14T10:55:00-04:00"
      ]
    }
  ]
}

@qasim
Copy link
Member Author

qasim commented Apr 13, 2016

Note: the dates are formatted in the ISO 8601 standard, offset for the Eastern timezone. It balances human readability in a compact form, and of course remains machine readable. I think this is the standard the whole project should take, but if you have an argument for something better than we can discuss that.

@arkon
Copy link
Contributor

arkon commented Apr 13, 2016

Would we have a gigantic list of all times for the month per stop, or would we try to split it up so it's 1 file per day?

@qasim
Copy link
Member Author

qasim commented Apr 13, 2016

Once per day seems appropriate since there would be a /lot/ of times otherwise. I wish the shuttle times were a little more predictable, but on random days it likes to change slightly. :/

If we do days, then the top level would be days:

{
  "date": "2016-04-13",
  "routes": [
      ...
  ]
}

@arkon
Copy link
Contributor

arkon commented Apr 13, 2016

Yeah it's usually schedules that are consistent for Monday - Thursday, then a few are missing for Friday, and Saturday/Sunday have way less. Then there's the special schedules for exam periods, reading weeks, etc.

@arkon
Copy link
Contributor

arkon commented Apr 14, 2016

So it seems like the route ids aren't the same across the days, so we'll need to use the names as the identifiers. Unless you have a better idea, @qasim ?

(I'll probably take a shot at implementing this scraper.)

@qasim
Copy link
Member Author

qasim commented Apr 14, 2016

@arkon that works. The convention so far has been id being all caps alphanumerical. So you could rmove the spaces/special characters, upper() so ids look like this maybe?

STGEORGE
SHERIDAN

@arkon
Copy link
Contributor

arkon commented Apr 14, 2016

@qasim Yeah that would probably work. It should be something like:

{
  "date": "2016-04-13",
  "routes": [
    {
      "id": "STGEORGE",
      "name": "St. George Route",
      "stops": [
        {
          "location": "Instructional Centre Layby",
          "building_id": "334",
          "times": [
            "2016-04-13T05:55:00-04:00",
            "2016-04-13T07:55:00-04:00"
          ]
        },
        {
          "location": "Hart House",
          "building_id": "002",
          "times": [
            "2016-04-13T08:55:00-04:00",
            "2016-04-13T10:55:00-04:00"
          ]
        }
      ]
    },
    {
      "id": "SHERIDAN",
      "name": "Sheridan Route",
      "stops": [
        {
          "location": "Deerfield Hall North Layby",
          "building_id": "340",
          "times": [
            "2016-04-13T05:55:00-04:00",
            "2016-04-13T07:55:00-04:00"
          ]
        },
        {
          "location": "Sheridan",
          "building_id": "",
          "times": [
            "2016-04-13T08:55:00-04:00",
            "2016-04-13T10:55:00-04:00"
          ]
        }
      ]
    }
  ]
}

Note that there's no building_id for Sheridan College.

@qasim
Copy link
Member Author

qasim commented Apr 14, 2016

Looks good. Eventually I want the project to start referencing other scraper's IDs as much as possible, there are a few cases where we don't right now. There's no infrastructure for that yet, though (matching building names to IDs in other scrapers). I guess for this one you'll have a manual mapping somewhere of the known stops to building IDs?

@arkon
Copy link
Contributor

arkon commented Apr 14, 2016

Yeah, I guess the manual mapping would work. How are you going it elsewhere right now?

@qasim
Copy link
Member Author

qasim commented Apr 14, 2016

If it's a map.utoronto.ca layer, chances are there is a building_id attached to things. Otherwise, nothing yet.

@qasim
Copy link
Member Author

qasim commented Apr 15, 2016

This should be good to close after #41 (diff) is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants