Project: Bus stops database

Nikhil VJ edited this page Jun 12, 2018 · 23 revisions

<< back to DataMeet Pune home page

Aim of this project

To arrive at a database of bus stops used by PMPML bus service, which is unique (one entry per stop, no repetitions and having a UID) and geo-located (every stop has a latitude and longitude value).

Skills needed

Some or all of the following may be required:

  • GIS or R based computing
  • QGIS or similar GIS applications
  • OpenRefine
  • Formulae and manipulating tabular data in MS Excel / LibreOffice Calc / Google Spreadsheets, or loading the data to a database and running queries to do the same thing
  • quickly mapping any number of tabular data entries for spot checking

Project Status

Step 1 is complete (for English stop names, not marathi); step 2 onwards to do. Project is OPEN to join in.


Lead: Nikhil VJ, nikhil.js (at), +919665831250

1) Analysis and Aggregating initial data

1.1) Starting dataset:

Pune Open Data portal has released route mapping database in Jan 2017

BRT & Non BRT Route Details & Bus Stop LatLong.xls

(readers new to this: you will have to download and open up this file to understand what’s written next)

Note: there are two other files that have only one sheet’s worth of the same data, so leaving them out. But here are the links to them: en, mr

1.2) Combining the "376.." and "short.." sheets into one

All routes.csv

Watch out for : duplications

The all-routes listing has both brts and non-brts stops, coded as per route number and sequence, in a non-unique way. Totally 35k+ entries.


1.3) BRTS stops

We have a unique listing, with lat-long info.

See it loaded on this web map:

Download CSV: brts unique stops.csv

brts unique stops (click to see full image)

To do: Create a unique id column and assign a UID to each distinct stop.

1.4) Non-BRTS stops

We have a unique names list, but without lat-long info.

2) Steps

2.1) Clustering by stop names:

When we plot a Venn diagram on the stop names list in brts, non-brts and all-listings (ie, 3 circles), we get varying numbers with english names and marathi names. This indicates that there may be variations in spellings in the data.

Recommendation from Nikhil: We can use a tool called OpenRefine to find and normalize varying spellings. It finds all entries having slightly varying letters, and allows the user to change all entries to just one spelling. One way to go about this: adding new columns (en and mr) for normalized names. After a round through OpenRefine we see what we have and take a call if we should proceed with normalized names. I’ll volunteer to do this and post the resulting files and one or more screenshots of the process here.


  1. Combined all the data (brts stops, non-brts stops and all routes listing) into one spreadsheet (csv).
  2. Duplicated the stop name columns to “Name-en-normalized” and “Name-mr-normalized” (worked only on former for now)
  3. Loaded it on OpenRefine.
  4. Cleaned up the cells by trimming leading and trailing whitespaces. See screenshot
  5. Used Text facet > Clustering on the “Name-en-normalized” column
  6. Clustered together the entries with similar names, and through the Merge action, “normalized” them. ie, chose one name and assigned it to all the cells in the cluster. See screenshot.
  7. There were many modes of clustering available.. tried them all.
  8. Did this process with English names only for now.
  9. The output data file is here: bus-combined-en-normalized.csv (zip)
  10. You can run a diff check on the stop names and normalized columns to see which entries are changed.
  11. After normalizing, the venn diagram comes like this: See screenshot
  12. Analysis: Number of unique stop names has reduced from 2396 to 2320 post normalizing. Of this, 96.4% are referenced in BRT or nonBRT unique stop listings, which means they just have to be assigned a UID, and through direct text matching their entries in the full listing can be assigned the UID too. With some manual inspection and lat-long lookup, it may be possible to reconcile the outlier entries too. See Screenshot
  13. Loaded all the data into this google spreadsheet. We can work on assigning UID etc there. First worksheet has all the bus stop entries combined, with english names normalized (as described above). This is split back into the original lists in subsequent worksheets.
  14. This data can be used for next steps.

Note: From this point onwards, all mentions of "stop name", "names" refer to the Normalized stop names, not the original stop names.

2.2) Some corrections in the data

Ran a duplicates check on the codes in the routes list. Made some corrections and noted some anomalies.

all routes list : finding duplicates:

  10 | 23-U	23-U-20
   2 | 147A-D	147A-D-03
   2 | 2-D	2-D-00
   2 | 345-U	345-U-31
   2 | 58P-D	58P-D-00
  • fixed repeating 23-U-20 entries that were to be 23-U-20, 21...29
  • 147A-D-03 was repeating instead of being -03,-04, so fixed that and shifted ahead the numbers of the subsequent entries
  • similar fixes for 2-D-00, 345-U-31
  • noticed 348-D.. stops sequence will need some work, can be done in next project (routes db)
  • route 58P-D : both up and down sequences have been mixed up, can be done in next project

2.3) Assigning a UID to BRTS stops and non-BRTS stops lists

  • Since we already have a unique stops listing here, let's assign a UID to these.

  • Open the spreadsheet, go to BRTS worsheet.

  • If you're a member of this project, the team lead would have given you editor access to this google spreadsheet.

  • Create a column called "UID" if it is not already made.

  • Assign each stop a UID. See the recommendations below.
    Nikhil: Recommending a coding convention we had arrived at earlier : 6 chars, first 5 chars being an acronym of the stop such that a reader can make out which stop it is; and then last char being a serial number in case there are multiple stops at that location.
    example: for Swargate: SWRGT1, SWRGT2 etc.
    ... this said, feel free to explore and arrive at your own coding convention. I'll recommend to not have just random numbering, rather have some way of coding in the general name of the stop so that when someone is reading these codes they can reasonably identify them. It makes debugging a lot easier.

  • 5.8.17: Assigned UIDs for all the BRTS stops.

  • Do the same for the non-BRTS stops list too. But before doing it, we have to remove the entries from nonBRTS list that are already present in BRTS list. See diagram:
    Venn BRTS vs nonBRTS

  • 6.8.17 : Removed entries from nonBRTS stops list that were already in BRTS stops list. Used this formula:
    ..then sorted the whole list by the y/n column, and moved all the 'y' rows to another sheet.

----Following steps are yet to do----

  • To do: Assign UID to all nonBRTS stops.

2.4) Separating the routes data: BRTS side

  • Here's a Venn diagram of stop names, between all routes sheet and BRTS stops sheet:
  • Let's separate out the routes data as per this diagram. The entries in the intersection reference stops in the BRTS stops list. The remaining would reference stops in the non-BRTS stops list.
  • Pull UID from the BRTS stops sheet into the BRTS-side separated routes data. Do this using R or excel formulas (like MATCH, INDEX)
  • Pull the lat-longs also, and compare the distance between route entry location and referenced BRTS stop location. Find out any anomalies. In the absence of anomalies, consider the lat-long co-ords from BRTS stops sheet as the standard for that stop.

2.5) Separating the routes data: nonBRTS side

  • Do the same as described above with the nonBRTS stops, ie, intersect the remaining routes sheet with nonBRTS stops list and get nonBRTS-side separated routes data.
  • Pull in the UID.
  • Since the nonBRTS stops don't have lat-long, we have to go through the different lat-longs in the routes listing and pick the one that's best, or look at satellite view and find the stop's actual location.

2.6) Reconcile the outliers

ie, stops in routes list, BRTS and nonBRTS stops list that did not match each other. They would be few in number and from visual inspection some matches would be found. If there are still stops in routes data that don't have any match, and by seeing lat-long locations they don't seem to belong to any other stop, then create new entries for them in the nonBRTS sheet, and assign UID and lat-long.

2.7) Put Humpty Dumpty back together again

ie, get the routes back together: BRTS side, nonBRTS side, outliers. We will now have routes data where all rows have UID. This data will be used in the next project, routes database.

2.8) Structuring stops data to match stops.txt of static GTFS


Old doc: Project working log : Pmpml bus stops

Bigger picture

This task / project ties in to a long term process of improving PMPML through increased transparency and systemization. The global standard data format for public transit is (GTFS), which is used by Google Transit and most transit related apps. It critically needs a stop-centric database and routes laid out in a systemized way. We want to achieve this critical dataset in an open sourced manner, such that there is no private ownership or secrecy and the database is openly and freely available to all persons and groups who are working on improving PMPML.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.