# Fuzzy matching stations from 2022 without information

In exploring the issues with inconsistency in station names and station identification both across time and between the ride data and station information data, I determined that 2022 ride data matched the station information data best - unsurprisingly, as the latter dates from early 2023.

Some stations in the ride data did not match both name and station ID with the station information data, but this was typically down to differences in station name (spelling, punctuation, address vs. descriptive name). The location information associated with the ID was generally correct in these situations, meaning they can be used for spatial analysis.

Some station IDs in the ride data, however, did not appear in the station information dataset. This may be because the same station existed in 2023 but the ID had been changed, or because the station location and associated ID no longer existed in 2023.

### Load in packages and data

In [2]:
setwd("../")

In [3]:
library(tidyverse)
library(fuzzyjoin)
library(levitate)

“package ‘tidyverse’ was built under R version 4.2.3”
“package ‘ggplot2’ was built under R version 4.2.3”
“package ‘tibble’ was built under R version 4.2.3”
“package ‘tidyr’ was built under R version 4.2.3”
“package ‘readr’ was built under R version 4.2.3”
“package ‘purrr’ was built under R version 4.2.3”
“package ‘dplyr’ was built under R version 4.2.3”
“package ‘stringr’ was built under R version 4.2.3”
“package ‘forcats’ was built under R version 4.2.3”
“package ‘lubridate’ was built under R version 4.2.3”
── [1mAttaching core tidyverse packages[22m ────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m 

In [4]:
data_all_years <- read_csv("./Data/data_all_years.csv")

[1mRows: [22m[34m16961551[39m [1mColumns: [22m[34m10[39m
[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (3): Start.Station.Name, End.Station.Name, User.Type
[32mdbl[39m  (5): Trip.Id, Trip.Duration, Start.Station.Id, End.Station.Id, Bike.Id
[34mdttm[39m (2): Start.Time, End.Time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [5]:
stations <- read_csv("./Data/stations/station_information.csv")

[1mRows: [22m[34m661[39m [1mColumns: [22m[34m17[39m
[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (6): name, physical_configuration, address, obcn, post_code, cross_street
[32mdbl[39m (6): station_id, lat, lon, altitude, capacity, nearby_distance
[33mlgl[39m (5): is_charging_station, rental_methods, groups, _ride_code_support, is...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [6]:
intersections <- read_csv("./Data/centrelines/Centreline Intersection - 4326.csv")

[1mRows: [22m[34m48955[39m [1mColumns: [22m[34m21[39m
[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m   (7): INTERSECTION_DESC, CLASSIFICATION, CLASSIFICATION_DESC, ELEVATION...
[32mdbl[39m  (10): _id, INTERSECTION_ID, ELEVATION_ID, NUMBER_OF_ELEVATIONS, ELEVATI...
[33mlgl[39m   (2): ELEVATION, HEIGHT_RESTRICTION
[34mdttm[39m  (2): DATE_EFFECTIVE, DATE_EXPIRY

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [7]:
spec(intersections)

cols(
  `_id` = [32mcol_double()[39m,
  INTERSECTION_ID = [32mcol_double()[39m,
  DATE_EFFECTIVE = [34mcol_datetime(format = "")[39m,
  DATE_EXPIRY = [34mcol_datetime(format = "")[39m,
  ELEVATION_ID = [32mcol_double()[39m,
  INTERSECTION_DESC = [31mcol_character()[39m,
  CLASSIFICATION = [31mcol_character()[39m,
  CLASSIFICATION_DESC = [31mcol_character()[39m,
  NUMBER_OF_ELEVATIONS = [32mcol_double()[39m,
  ELEVATION_FEATURE_CODE = [32mcol_double()[39m,
  ELEVATION_FEATURE_CODE_DESC = [31mcol_character()[39m,
  ELEVATION_LEVEL = [32mcol_double()[39m,
  ELEVATION = [33mcol_logical()[39m,
  ELEVATION_UNIT = [31mcol_character()[39m,
  HEIGHT_RESTRICTION = [33mcol_logical()[39m,
  HEIGHT_RESTRICTION_UNIT = [31mcol_character()[39m,
  STATE = [32mcol_double()[39m,
  TRANS_ID_CREATE = [32mcol_double()[39m,
  TRANS_ID_EXPIRE = [32mcol_double()[39m,
  OBJECTID = [32mcol_double()[39m,
  geometry = [31mcol_character()[39m
)

In [38]:
head(intersections)

_id,INTERSECTION_ID,DATE_EFFECTIVE,DATE_EXPIRY,ELEVATION_ID,INTERSECTION_DESC,CLASSIFICATION,CLASSIFICATION_DESC,NUMBER_OF_ELEVATIONS,ELEVATION_FEATURE_CODE,⋯,ELEVATION_LEVEL,ELEVATION,ELEVATION_UNIT,HEIGHT_RESTRICTION,HEIGHT_RESTRICTION_UNIT,STATE,TRANS_ID_CREATE,TRANS_ID_EXPIRE,OBJECTID,geometry
<dbl>,<dbl>,<dttm>,<dttm>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,⋯,<dbl>,<lgl>,<chr>,<lgl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,13470264,2008-12-12 04:22:46,3000-01-01 05:00:00,13,Robindale Ave / Rimilton Ave,MNRSL,Minor-Single Level,1,501300,⋯,,,,,,8,200000.0,-1,1,"{'type': 'MultiPoint', 'coordinates': [[-79.5310702158097, 43.6072425849711]]}"
2,13470193,2008-12-12 04:22:46,3000-01-01 05:00:00,4718,Bellman Ave / Valermo Dr,MNRSL,Minor-Single Level,1,501300,⋯,,,,,,8,200000.0,-1,4,"{'type': 'MultiPoint', 'coordinates': [[-79.5313732423075, 43.609600012102]]}"
3,13470188,2008-12-12 04:22:46,3000-01-01 05:00:00,32728,Rimilton Ave / Valermo Dr,SEUSL,Pseudo Intersection-Single Level,1,509200,⋯,,,,,,8,200000.0,-1,5,"{'type': 'MultiPoint', 'coordinates': [[-79.5301175801351, 43.6098292200395]]}"
4,13470203,2008-12-12 04:22:46,3000-01-01 05:00:00,21669,Valermo Dr / Goa Crt,MNRSL,Minor-Single Level,1,501300,⋯,,,,,,8,200000.0,-1,7,"{'type': 'MultiPoint', 'coordinates': [[-79.5331747278101, 43.6091899836547]]}"
5,13470228,2008-12-12 04:22:46,3000-01-01 05:00:00,36820,Valermo Dr / Thirtieth St,MNRSL,Minor-Single Level,1,501300,⋯,,,,,,8,200000.0,-1,9,"{'type': 'MultiPoint', 'coordinates': [[-79.5355925204638, 43.6086391702897]]}"
6,13470242,2008-12-12 04:22:46,3000-01-01 05:00:00,2869,Valermo Dr / Delta St,MNRSL,Minor-Single Level,1,501300,⋯,,,,,,8,200000.0,-1,10,"{'type': 'MultiPoint', 'coordinates': [[-79.5380199282464, 43.6081028218457]]}"


### Identify station ID/name combinations in ride data only

Using an anti join, station ID/name combinations that appear in the ride data but not the station information data are identified.

In [8]:
station_id_name_notfound <- anti_join(data_all_years,
                                      stations,
                                      by = join_by(Start.Station.Id == station_id,
                                                   Start.Station.Name == name)) %>%
drop_na(Start.Station.Id, Start.Station.Name) %>%
filter(year(Start.Time) == 2022) %>%
distinct(Start.Station.Id) %>%
left_join(data_all_years) %>%
select(Start.Station.Id, Start.Station.Name) %>%
distinct() %>%
left_join(stations, by = join_by(Start.Station.Id == station_id)) %>%
select(Start.Station.Id, Start.Station.Name, name, address)

[1m[22mJoining with `by = join_by(Start.Station.Id)`


These combinations are then split into cases where the ID is found in the station information data and where they are not.

In [9]:
station_id_found <- station_id_name_notfound %>%
filter(Start.Station.Id %in% stations$station_id)

In [9]:
station_id_found

“input string 2 is invalid in this locale”
“input string 2 is invalid in this locale”
“input string 2 is invalid in this locale”
“input string 2 is invalid in this locale”
“input string 2 is invalid in this locale”
“input string 2 is invalid in this locale”
ERROR while rich displaying an object: Error in gsub(chr, html_specials[[chr]], text, fixed = TRUE): input string 2 is invalid in this locale

Traceback:
1. tryCatch(withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler), error = outer_handler)
2. tryCatchList(expr, classes, parentenv, handlers)
3. tryCatchOne(expr, names, parentenv, handlers[[1L]])
4. doTryCatch(return(expr), name, parentenv, handler)
5. withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop

Start.Station.Id,Start.Station.Name,name,address
<dbl>,<chr>,<chr>,<chr>
7334,Simcoe St / Wellington St North,Simcoe St / Wellington St W North,Simcoe St / Wellington St 2
7171,Ontario Place Blvd / Lake Shore Blvd W (East),Ontario Place Blvd / Lake Shore Blvd W,Ontario Place Blvd / Lake Shore Blvd W
7171,Ontario Place Blvd / Lakeshore Blvd W,Ontario Place Blvd / Lake Shore Blvd W,Ontario Place Blvd / Lake Shore Blvd W
7171,Ontario Place Blvd / Remembrance Dr,Ontario Place Blvd / Lake Shore Blvd W,Ontario Place Blvd / Lake Shore Blvd W
7250,St. George St / Russell St - SMART,Ursula Franklin St / St. George St - SMART,Ursula Franklin St / St. George St
7323,457 King St. W. at Spadina,457 King St W,457 King St W.
7389,College Park- Gerrard Entrance,College Park - Gerrard Entrance,College Park- Gerrard Entrance
7398,York St / Harbour St,York St / Lake Shore Blvd W,York St / Lakeshore St W - South
7398,York St / Lakeshore St W - South,York St / Lake Shore Blvd W,York St / Lakeshore St W - South
7332,200 Bloor St. E.,200 Bloor St E,200 Bloor St. E.


A visual inspection reveals that all stations where the ID but not name are found in the station information data appear to correspond to the correct location.

In [10]:
station_id_notfound <- station_id_name_notfound %>%
filter(!(Start.Station.Id %in% stations$station_id))

In [11]:
station_id_notfound

Start.Station.Id,Start.Station.Name,name,address
<dbl>,<chr>,<chr>,<chr>
7113,Parliament St / Aberdeen Ave,,
7282,Adelaide St W / Bay St - SMART,,
7013,Scott St / The Esplanade,,
7011,Wellington St W / Portland St,,
7275,Queen St W / James St,,
7491,D'Arcy St / Spadina Ave - SMART,,
7382,Simcoe St / Adelaide St W,,
7372,King St W / Portland St,,
7372,Adelaide St W / Portland St,,
7255,Stewart St / Bathurst St - SMART,,


### Check if stations exist in station information with different ID

To determine if there are station that have apparently been given new IDs between 2022 and April 2023, the best check is to see if any station names for the unknown IDs match names or addresses in the station information data.

In [11]:
station_id_changed <- station_id_notfound %>%
mutate(output = coalesce(stations$station_id[match(Start.Station.Name, stations$name)],
                         stations$station_id[match(Start.Station.Name, stations$address)]))

In [13]:
station_id_changed

Start.Station.Id,Start.Station.Name,name,address,output
<dbl>,<chr>,<chr>,<chr>,<dbl>
7113,Parliament St / Aberdeen Ave,,,
7282,Adelaide St W / Bay St - SMART,,,
7013,Scott St / The Esplanade,,,
7011,Wellington St W / Portland St,,,
7275,Queen St W / James St,,,
7491,D'Arcy St / Spadina Ave - SMART,,,
7382,Simcoe St / Adelaide St W,,,
7372,King St W / Portland St,,,7720.0
7372,Adelaide St W / Portland St,,,
7255,Stewart St / Bathurst St - SMART,,,


In [14]:
stations %>% subset(station_id == 7720)

station_id,name,physical_configuration,lat,lon,altitude,address,capacity,is_charging_station,rental_methods,groups,obcn,nearby_distance,_ride_code_support,post_code,is_valet_station,cross_street
<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<lgl>,<lgl>,<lgl>,<chr>,<dbl>,<lgl>,<chr>,<lgl>,<chr>
7720,King St W / Portland St,SMARTMAPFRAME,43.6444,-79.40065,0,620 King Street West,16,False,,,,350,True,M5V 1M6,,


We find that just 1 station obviously exists under a new ID in the station information data. This is 7372, which in 2022 was listed at both Adelaide and Portland and King W and Portland (very close geographically) but in 2023 is at the latter under the ID 7720. We can add 7372 to the station information data using the information from 7720 to approximate its location.

In [12]:
station_7372 <- stations %>% subset(station_id == 7720)
station_7372$station_id <- 7372
stations_update_2022 <- stations %>% add_row(station_7372)
station_id_changed <- station_id_changed %>% filter(Start.Station.Id != 7372)

In [77]:
saveRDS(stations_update_2022, "./Data/stations_update_2022.rds")

While exact matches are rare, some stations have had IDs replaced while the names have changed subtly or are formatted differently in the station information data. To attempt to identify those, a fuzzy-matched join is worth trying.

In [13]:
stations_fuzz_name <- stringdist_join(station_id_changed,
                                      stations,
                                      mode = "left",
                                      by = c(Start.Station.Name = "name"),
                                      max_dist = 6)

In [14]:
stations_fuzz_name %>% select(Start.Station.Id,
                              Start.Station.Name,
                              station_id,
                              name.y,
                              address.y)

Start.Station.Id,Start.Station.Name,station_id,name.y,address.y
<dbl>,<chr>,<dbl>,<chr>,<chr>
7113,Parliament St / Aberdeen Ave,,,
7282,Adelaide St W / Bay St - SMART,,,
7013,Scott St / The Esplanade,7716.0,Church St / The Esplanade,75 The Esplanade
7011,Wellington St W / Portland St,7469.0,Wellington St W / York St,Wellington St W / York St
7275,Queen St W / James St,7542.0,Queen St W / John St,Queen St W / John St
7275,Queen St W / James St,7712.0,Queen St W / Shaw St,999 Queen Street West
7491,D'Arcy St / Spadina Ave - SMART,,,
7382,Simcoe St / Adelaide St W,,,
7255,Stewart St / Bathurst St - SMART,,,
7544,Foster Pl / Elizabeth St - SMART,,,


In [15]:
stations_fuzz_address <- stringdist_join(station_id_changed,
                                         stations,
                                         mode = "left",
                                         by = c(Start.Station.Name = "address"),
                                         max_dist = 6)

In [16]:
stations_fuzz_address %>% select(Start.Station.Id,
                                 Start.Station.Name,
                                 station_id,
                                 name.y,
                                 address.y)

Start.Station.Id,Start.Station.Name,station_id,name.y,address.y
<dbl>,<chr>,<dbl>,<chr>,<chr>
7113,Parliament St / Aberdeen Ave,,,
7282,Adelaide St W / Bay St - SMART,,,
7013,Scott St / The Esplanade,,,
7011,Wellington St W / Portland St,7469.0,Wellington St W / York St,Wellington St W / York St
7275,Queen St W / James St,7542.0,Queen St W / John St,Queen St W / John St
7491,D'Arcy St / Spadina Ave - SMART,,,
7382,Simcoe St / Adelaide St W,,,
7255,Stewart St / Bathurst St - SMART,,,
7544,Foster Pl / Elizabeth St - SMART,,,
7544,Foster Pl / Elizabeth St,,,


Unfortunately, none of the fuzzy matches on either the station name or address fields look of the station information data look particularly promising. Generally the match identifies one of the streets correctly but matches to an intersection with a different cross street. This implies that the station locations of the unmatched IDs are not in the station information dataset and a new strategy is needed.

### Identify locations for stations that do not appear in station information

Most, though not all, stations are at street intersections. The City of Toronto provides a file of the locations of all intersections in the city in the Open Data Portal, which should be helpful for locating those stations which do occur at intersections. This data has already been loaded above. The first step is to try an inner join to look for direct matches between station locations and intersection names.

In [17]:
stations_intersection <- inner_join(station_id_changed,
                                    intersections,
                                    by = join_by(Start.Station.Name == INTERSECTION_DESC))

In [18]:
stations_intersection

Start.Station.Id,Start.Station.Name,name,address,output,_id,INTERSECTION_ID,DATE_EFFECTIVE,DATE_EXPIRY,ELEVATION_ID,⋯,ELEVATION_LEVEL,ELEVATION,ELEVATION_UNIT,HEIGHT_RESTRICTION,HEIGHT_RESTRICTION_UNIT,STATE,TRANS_ID_CREATE,TRANS_ID_EXPIRE,OBJECTID,geometry
<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dttm>,<dttm>,<dbl>,⋯,<dbl>,<lgl>,<chr>,<lgl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
7275,Queen St W / James St,,,,43259,13466379,2024-04-12 15:42:23,3000-01-01 05:00:00,7152,⋯,,,,,,8,370649,-1,114450,"{'type': 'MultiPoint', 'coordinates': [[-79.3807162626028, 43.6521008898127]]}"
7544,Foster Pl / Elizabeth St,,,,40195,13466018,2023-08-17 12:53:58,3000-01-01 05:00:00,4542,⋯,,,,,,8,363175,-1,110618,"{'type': 'MultiPoint', 'coordinates': [[-79.3848220785899, 43.6545534061131]]}"
7653,Bloor St W / Indian Rd,,,,3501,13466046,2008-12-12 04:22:46,3000-01-01 05:00:00,12465,⋯,,,,,,8,200000,-1,5567,"{'type': 'MultiPoint', 'coordinates': [[-79.4570501089068, 43.6553527151862]]}"
7653,Bloor St W / Indian Rd,,,,38036,13466037,2022-03-28 12:58:19,3000-01-01 05:00:00,14404,⋯,,,,,,8,344050,-1,106326,"{'type': 'MultiPoint', 'coordinates': [[-79.4568442039919, 43.6553957541211]]}"
7049,Queen St W / Portland St,,,,42303,13467150,2024-01-31 12:56:38,3000-01-01 05:00:00,4235,⋯,,,,,,8,368160,-1,113245,"{'type': 'MultiPoint', 'coordinates': [[-79.4014174537963, 43.6477126767303]]}"
7578,Oak St / Parliament St,,,,22859,20235156,2008-12-12 04:22:46,3000-01-01 05:00:00,14181,⋯,,,,,,8,200000,-1,40384,"{'type': 'MultiPoint', 'coordinates': [[-79.3665329339236, 43.6607189905741]]}"
7472,Dundas St E / Victoria St,,,,9996,13465716,2008-12-12 04:22:46,3000-01-01 05:00:00,11374,⋯,,,,,,8,200000,-1,18404,"{'type': 'MultiPoint', 'coordinates': [[-79.3795762008715, 43.6562913281717]]}"
7407,University Ave / Queen St W,,,,22551,13466604,2008-12-12 04:22:46,3000-01-01 05:00:00,9968,⋯,,,,,,8,200000,-1,39861,"{'type': 'MultiPoint', 'coordinates': [[-79.386628635294, 43.6508528459786]]}"
7482,Danforth Ave / Sibley Ave,,,,9663,13459659,2008-12-12 04:22:46,3000-01-01 05:00:00,1800,⋯,,,,,,8,200000,-1,17645,"{'type': 'MultiPoint', 'coordinates': [[-79.2926856191101, 43.690223652078]]}"


This identifies 8 stations (the same maps to two intersections because a street is discontinuous across each side of an intersection), which is not particularly successful.

The next approach is to try fuzzy matching the station names to the intersection names.

In [19]:
stations_fuzz_intersection <- stringdist_left_join(station_id_changed,
                                                   intersections,
                                                   by = c(Start.Station.Name = "INTERSECTION_DESC"),
                                                   max_dist = 2)

In [20]:
stations_fuzz_intersection %>% select(Start.Station.Id,
                                      Start.Station.Name,
                                      INTERSECTION_DESC)

Start.Station.Id,Start.Station.Name,INTERSECTION_DESC
<dbl>,<chr>,<chr>
7113,Parliament St / Aberdeen Ave,
7282,Adelaide St W / Bay St - SMART,
7013,Scott St / The Esplanade,
7011,Wellington St W / Portland St,
7275,Queen St W / James St,Queen St W / James St
7491,D'Arcy St / Spadina Ave - SMART,
7382,Simcoe St / Adelaide St W,
7255,Stewart St / Bathurst St - SMART,
7544,Foster Pl / Elizabeth St - SMART,
7544,Foster Pl / Elizabeth St,Foster Pl / Elizabeth St


Fuzzy matching returns almost identical matches to the simple inner join. This is initially surprising, because a file containing all intersections should match most of the stations in the ride data that we need locations for (a few are not at intersections).

Exploring the intersection data, it becomes clear that a major issue with the matching (fuzzy or otherwise) between the ride data and intersection data is that the order of how cross-streets are listed is not standardized. The same two streets intersecting, but listed in a different order, will not match directly and will not fuzzy match well using a simple fuzzy matcher.

What is needed is a tokenized fuzzy matching approach, where the addresses are broken up into tokens (words) and then matched as sets of tokens rather than as coherent strings. The collection of tokens for a given intersection should be the same regardless of the order that the cross streets are listed in. In Python, this can be done with the popular fuzzywuzzy package. In R, there are multiple ports but the one built natively in r is levitate. Unlike fuzzyjoin, levitate is built for string matching but not expressly for joining.

In [36]:
stations_levitate_intersection <- NULL
for (i in 1:length(station_id_changed$Start.Station.Name)) {
    temp_matches <- c()
    for (j in 1:length(intersections$INTERSECTION_DESC)) {
        temp_matches[j] = lev_token_set_ratio(station_id_changed$Start.Station.Name[i],
                                              intersections$INTERSECTION_DESC[j])
        }
    temp_row <- tibble_row(Start.Station.Id = station_id_changed$Start.Station.Id[i],
                           Start.Station.Name = station_id_changed$Start.Station.Name[i],
                           INTERSECTION_DESC = intersections$INTERSECTION_DESC[which.max(temp_matches)],
                           INTERSECTION_ID = intersections$INTERSECTION_ID[which.max(temp_matches)],
                           geometry = intersections$geometry[which.max(temp_matches)],
                           Score = max(temp_matches))
    stations_levitate_intersection <- bind_rows(stations_levitate_intersection, temp_row)
}

In [37]:
stations_levitate_intersection

Start.Station.Id,Start.Station.Name,INTERSECTION_DESC,INTERSECTION_ID,geometry,Score
<dbl>,<chr>,<chr>,<dbl>,<chr>,<dbl>
7113,Parliament St / Aberdeen Ave,Aberdeen Ave / Parliament St,13464279,"{'type': 'MultiPoint', 'coordinates': [[-79.3682787943632, 43.6650446564633]]}",1.0
7282,Adelaide St W / Bay St - SMART,Bay St / Adelaide St W,13466743,"{'type': 'MultiPoint', 'coordinates': [[-79.3808101970891, 43.6498927202986]]}",1.0
7013,Scott St / The Esplanade,The Esplanade / Scott St,13467327,"{'type': 'MultiPoint', 'coordinates': [[-79.3752644227303, 43.6463225154789]]}",1.0
7011,Wellington St W / Portland St,Portland St / Wellington St W,30114865,"{'type': 'MultiPoint', 'coordinates': [[-79.3994793772098, 43.6430140298378]]}",1.0
7275,Queen St W / James St,Queen St W / James St,13466379,"{'type': 'MultiPoint', 'coordinates': [[-79.3807162626028, 43.6521008898127]]}",1.0
7491,D'Arcy St / Spadina Ave - SMART,Spadina Ave / D'Arcy St / Glen Baillie Pl,13466135,"{'type': 'MultiPoint', 'coordinates': [[-79.398427359226, 43.6539024276164]]}",0.7777778
7382,Simcoe St / Adelaide St W,Adelaide St W / Simcoe St,13466978,"{'type': 'MultiPoint', 'coordinates': [[-79.3866213342558, 43.6486382517816]]}",1.0
7255,Stewart St / Bathurst St - SMART,Bathurst St / Stewart St,13467820,"{'type': 'MultiPoint', 'coordinates': [[-79.4023967156926, 43.6431940739398]]}",1.0
7544,Foster Pl / Elizabeth St - SMART,Foster Pl,13465978,"{'type': 'MultiPoint', 'coordinates': [[-79.3841080708464, 43.6547101361621]]}",1.0
7544,Foster Pl / Elizabeth St,Foster Pl,13465978,"{'type': 'MultiPoint', 'coordinates': [[-79.3841080708464, 43.6547101361621]]}",1.0


In [38]:
saveRDS(stations_levitate_intersection, "./Data/stations_levitate_intersection.RDS")

Other than being quite computationally inefficient (~1 minute per station) , this nested for loop approach using token set matching in levitate provides much better matching between the ride data stations and the intersection data than the simple fuzzy match in fuzzyjoin. There may well be a way to vectorize the process, at least for checking a given station in the ride data against all intersections (in place of the inner for loop). For larger comparisons, the zoomerjoin package might well be worth considering as an alternative, though it is not quite as straightforward to install (compiled using Rust) and would be challenging to install in the conda environment used for this project. I already had to build levitate as a conda package to use here.

Manual inspection shows fairly high overall match quality excepting situations where the station in the ride data is not at an intersection. Generally, matches with a score above 0.8 are correct. In a few cases, where there is an 'intersection' that contains only 1 street in the name in the file, that is chosen as a better match. Interestingly, in most of these cases, the fuzzyjoin matching performed better. An ensemble of these approaches, where stations not matched correctly by fuzzyjoin are passed to levitate, seems to be the best option.

In [21]:
stations_fuzzjoin_found <- stations_fuzz_intersection %>% filter(!is.na(INTERSECTION_DESC))

In [22]:
stations_fuzzjoin_found

Start.Station.Id,Start.Station.Name,name,address,output,_id,INTERSECTION_ID,DATE_EFFECTIVE,DATE_EXPIRY,ELEVATION_ID,⋯,ELEVATION_LEVEL,ELEVATION,ELEVATION_UNIT,HEIGHT_RESTRICTION,HEIGHT_RESTRICTION_UNIT,STATE,TRANS_ID_CREATE,TRANS_ID_EXPIRE,OBJECTID,geometry
<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dttm>,<dttm>,<dbl>,⋯,<dbl>,<lgl>,<chr>,<lgl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
7275,Queen St W / James St,,,,43259,13466379,2024-04-12 15:42:23,3000-01-01 05:00:00,7152,⋯,,,,,,8,370649,-1,114450,"{'type': 'MultiPoint', 'coordinates': [[-79.3807162626028, 43.6521008898127]]}"
7544,Foster Pl / Elizabeth St,,,,40195,13466018,2023-08-17 12:53:58,3000-01-01 05:00:00,4542,⋯,,,,,,8,363175,-1,110618,"{'type': 'MultiPoint', 'coordinates': [[-79.3848220785899, 43.6545534061131]]}"
7293,College St / McCaul St,,,,18163,13465273,2008-12-12 04:22:46,3000-01-01 05:00:00,29937,⋯,,,,,,8,200000,-1,32168,"{'type': 'MultiPoint', 'coordinates': [[-79.3934845254722, 43.6591999315297]]}"
7653,Bloor St W / Indian Rd,,,,3501,13466046,2008-12-12 04:22:46,3000-01-01 05:00:00,12465,⋯,,,,,,8,200000,-1,5567,"{'type': 'MultiPoint', 'coordinates': [[-79.4570501089068, 43.6553527151862]]}"
7653,Bloor St W / Indian Rd,,,,38036,13466037,2022-03-28 12:58:19,3000-01-01 05:00:00,14404,⋯,,,,,,8,344050,-1,106326,"{'type': 'MultiPoint', 'coordinates': [[-79.4568442039919, 43.6553957541211]]}"
7049,Queen St W / Portland St,,,,42303,13467150,2024-01-31 12:56:38,3000-01-01 05:00:00,4235,⋯,,,,,,8,368160,-1,113245,"{'type': 'MultiPoint', 'coordinates': [[-79.4014174537963, 43.6477126767303]]}"
7578,Oak St / Parliament St,,,,22859,20235156,2008-12-12 04:22:46,3000-01-01 05:00:00,14181,⋯,,,,,,8,200000,-1,40384,"{'type': 'MultiPoint', 'coordinates': [[-79.3665329339236, 43.6607189905741]]}"
7472,Dundas St E / Victoria St,,,,9996,13465716,2008-12-12 04:22:46,3000-01-01 05:00:00,11374,⋯,,,,,,8,200000,-1,18404,"{'type': 'MultiPoint', 'coordinates': [[-79.3795762008715, 43.6562913281717]]}"
7407,University Ave / Queen St W,,,,22551,13466604,2008-12-12 04:22:46,3000-01-01 05:00:00,9968,⋯,,,,,,8,200000,-1,39861,"{'type': 'MultiPoint', 'coordinates': [[-79.386628635294, 43.6508528459786]]}"
7482,Danforth Ave / Sibley Ave,,,,9663,13459659,2008-12-12 04:22:46,3000-01-01 05:00:00,1800,⋯,,,,,,8,200000,-1,17645,"{'type': 'MultiPoint', 'coordinates': [[-79.2926856191101, 43.690223652078]]}"


There is one duplicate entry in the successful list of found stations, as it maps to both parts of a discontinuous intersection. This can be dropped to create a single location (they are only metres apart).

In [26]:
stations_fuzzjoin_found <- stations_fuzzjoin_found %>% filter(!row_number() == 5)

In [25]:
stations_fuzzjoin_failed <- stations_fuzz_intersection %>% filter(is.na(INTERSECTION_DESC))

In [57]:
stations_levitate_intersection_2 <- NULL
for (i in 1:length(stations_fuzzjoin_failed$Start.Station.Name)) {
    temp_matches <- c()
    for (j in 1:length(intersections$INTERSECTION_DESC)) {
        temp_matches[j] = lev_token_set_ratio(stations_fuzzjoin_failed$Start.Station.Name[i],
                                              intersections$INTERSECTION_DESC[j])
        }
    temp_row <- tibble_row(Start.Station.Id = stations_fuzzjoin_failed$Start.Station.Id[i],
                           Start.Station.Name = stations_fuzzjoin_failed$Start.Station.Name[i],
                           INTERSECTION_DESC = intersections$INTERSECTION_DESC[which.max(temp_matches)],
                           INTERSECTION_ID = intersections$INTERSECTION_ID[which.max(temp_matches)],
                           geometry = intersections$geometry[which.max(temp_matches)],
                           Score = max(temp_matches))
    stations_levitate_intersection_2 <- bind_rows(stations_levitate_intersection, temp_row)
}

In [58]:
stations_levitate_intersection_2

Start.Station.Id,Start.Station.Name,INTERSECTION_DESC,INTERSECTION_ID,geometry,Score
<dbl>,<chr>,<chr>,<dbl>,<chr>,<dbl>
7113,Parliament St / Aberdeen Ave,Aberdeen Ave / Parliament St,13464279,"{'type': 'MultiPoint', 'coordinates': [[-79.3682787943632, 43.6650446564633]]}",1.0
7282,Adelaide St W / Bay St - SMART,Bay St / Adelaide St W,13466743,"{'type': 'MultiPoint', 'coordinates': [[-79.3808101970891, 43.6498927202986]]}",1.0
7013,Scott St / The Esplanade,The Esplanade / Scott St,13467327,"{'type': 'MultiPoint', 'coordinates': [[-79.3752644227303, 43.6463225154789]]}",1.0
7011,Wellington St W / Portland St,Portland St / Wellington St W,30114865,"{'type': 'MultiPoint', 'coordinates': [[-79.3994793772098, 43.6430140298378]]}",1.0
7491,D'Arcy St / Spadina Ave - SMART,Spadina Ave / D'Arcy St / Glen Baillie Pl,13466135,"{'type': 'MultiPoint', 'coordinates': [[-79.398427359226, 43.6539024276164]]}",0.7777778
7382,Simcoe St / Adelaide St W,Adelaide St W / Simcoe St,13466978,"{'type': 'MultiPoint', 'coordinates': [[-79.3866213342558, 43.6486382517816]]}",1.0
7255,Stewart St / Bathurst St - SMART,Bathurst St / Stewart St,13467820,"{'type': 'MultiPoint', 'coordinates': [[-79.4023967156926, 43.6431940739398]]}",1.0
7544,Foster Pl / Elizabeth St - SMART,Foster Pl,13465978,"{'type': 'MultiPoint', 'coordinates': [[-79.3841080708464, 43.6547101361621]]}",1.0
7017,Widmer St / Adelaide St W,Adelaide St W / Widmer St,13467149,"{'type': 'MultiPoint', 'coordinates': [[-79.3915645414971, 43.6475822548383]]}",1.0
7017,Widmer St / Adelaide St,Adelaide St W / Widmer St,13467149,"{'type': 'MultiPoint', 'coordinates': [[-79.3915645414971, 43.6475822548383]]}",1.0


In [59]:
saveRDS(stations_levitate_intersection_2, "./Data/stations_levitate_intersection_2.RDS")

In [26]:
stations_levitate_intersection <- readRDS("./Data/stations_levitate_intersection.rds")

In [27]:
stations_levitate_intersection_2 <- readRDS("./Data/stations_levitate_intersection_2.RDS")

NOTE: This is a temporary test of using levitate's inbuilt lev_score_multiple() function to replace the inner for loop. It may be slightly faster based on a test of a single station, but the code is cleaner in some respects and dirtier in others. I have not run it yet...

In [None]:
stations_levitate_intersection_3 = NULL
for (i in 1:length(stations_fuzzjoin_failed$Start.Station.Name)) {
    temp_matches <- lev_score_multiple(stations_fuzzjoin_failed$Start.Station.Name[i],
                                          intersections$INTERSECTION_DESC,
                                          .fn = lev_token_set_ratio)
    temp_row <- tibble_row(Start.Station.Id = stations_fuzzjoin_failed$Start.Station.Id[i],
                           Start.Station.Name = stations_fuzzjoin_failed$Start.Station.Name[i],
                           INTERSECTION_DESC = names(temp_matches[1]),
                           INTERSECTION_ID = intersections$INTERSECTION_ID[intersections$INTERSECTION_DESC == names(temp_matches[1])],
                           geometry = intersections$geometry[intersections$INTERSECTION_DESC == names(temp_matches[1])],
                           Score = temp_matches[[1]])
    stations_levitate_intersection_3 <- bind_rows(stations_levitate_intersection, temp_row)
}

Some stations in the levitate-based fuzzy match with the intersection data have duplicate entries for the same ID (often for 'regular' and SMART stations) and have already been matched correctly in one form using fuzzyjoin. These can be dropped before considering the remaining matches. NOTE this dropping could be done before the levitate search to reduce overall computation time.

In [28]:
stations_levitate_filtered <- stations_levitate_intersection_2 %>%
filter(!(Start.Station.Id %in% stations_fuzzjoin_found$Start.Station.Id))

At this point, the assignment of correct or incorrect matches must be done manually as there is no clear cutoff in match score. In two cases, score<0.8 is associated with a correct match. The best approach is to split into tables for correct and incorrect matches, and then the latter can be manually edited or discarded depending on time cost.

In [29]:
incorrect_match_indices <- c(14, 16:18, 22, 24, 26, 27, 31:33)
stations_levitate_correct <- stations_levitate_filtered %>%
filter(!row_number() %in% incorrect_match_indices)
stations_levitate_incorrect <- stations_levitate_filtered %>%
filter(row_number() %in% incorrect_match_indices)

In [30]:
stations_levitate_correct

Start.Station.Id,Start.Station.Name,INTERSECTION_DESC,INTERSECTION_ID,geometry,Score
<dbl>,<chr>,<chr>,<dbl>,<chr>,<dbl>
7113,Parliament St / Aberdeen Ave,Aberdeen Ave / Parliament St,13464279,"{'type': 'MultiPoint', 'coordinates': [[-79.3682787943632, 43.6650446564633]]}",1.0
7282,Adelaide St W / Bay St - SMART,Bay St / Adelaide St W,13466743,"{'type': 'MultiPoint', 'coordinates': [[-79.3808101970891, 43.6498927202986]]}",1.0
7013,Scott St / The Esplanade,The Esplanade / Scott St,13467327,"{'type': 'MultiPoint', 'coordinates': [[-79.3752644227303, 43.6463225154789]]}",1.0
7011,Wellington St W / Portland St,Portland St / Wellington St W,30114865,"{'type': 'MultiPoint', 'coordinates': [[-79.3994793772098, 43.6430140298378]]}",1.0
7491,D'Arcy St / Spadina Ave - SMART,Spadina Ave / D'Arcy St / Glen Baillie Pl,13466135,"{'type': 'MultiPoint', 'coordinates': [[-79.398427359226, 43.6539024276164]]}",0.7777778
7382,Simcoe St / Adelaide St W,Adelaide St W / Simcoe St,13466978,"{'type': 'MultiPoint', 'coordinates': [[-79.3866213342558, 43.6486382517816]]}",1.0
7255,Stewart St / Bathurst St - SMART,Bathurst St / Stewart St,13467820,"{'type': 'MultiPoint', 'coordinates': [[-79.4023967156926, 43.6431940739398]]}",1.0
7017,Widmer St / Adelaide St W,Adelaide St W / Widmer St,13467149,"{'type': 'MultiPoint', 'coordinates': [[-79.3915645414971, 43.6475822548383]]}",1.0
7017,Widmer St / Adelaide St,Adelaide St W / Widmer St,13467149,"{'type': 'MultiPoint', 'coordinates': [[-79.3915645414971, 43.6475822548383]]}",1.0
7509,Ontario St / King St E,King St E / Ontario St,13466404,"{'type': 'MultiPoint', 'coordinates': [[-79.366059144905, 43.6517548282251]]}",1.0


At this stage, there are a few duplicated entries with the same ID, often where one is a SMART station and one is not. In all cases, these both map to the same location so one entry can safely be dropped to avoid duplicated stations.

In [32]:
stations_levitate_correct <- stations_levitate_correct %>% filter(!row_number() %in% c(9, 13, 16, 20))

In [34]:
saveRDS(stations_fuzzjoin_found, "./Data/stations_fuzzjoin_found.rds")
saveRDS(stations_levitate_correct, "./Data/stations_levitate_correct.rds")
saveRDS(stations_levitate_incorrect, "./Data/stations_levitate_incorrect.rds")

At this point, the various station information tables can be merged for further geospatial mapping. This will be performed in a new script (using a conda environment built for geospatial analysis instead of fuzzy joining).