

# Segmenting and Clustering Neighborhoods in Toronto


## Scraping neighbourhood data from Wikipedia

Scraping postal codes and neighbourhood names from [this Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

  

1.  Import modules & scrape Wikipedia page:



In [1]:
# First set IPython kernel such that we see all the outputs:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Modules
from bs4 import BeautifulSoup
from requests import get
import pandas as pd

# Import Wikipedia page
wikiurl='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = get(wikiurl)
soup = BeautifulSoup(response.text, 'lxml')

# soup

2.  Extract the table:



In [2]:
table = soup.find("table", { "class" : "wikitable sortable" })

table_rows = table.find_all('tr')
l = []
for tr in table_rows:
	td = tr.find_all('td')    
	# use strip() to get rid of \n present at the end of each entry of the last column
	row = [tr.text.strip() for tr in td]
	l.append(row)

3.  Scrape header:



In [3]:
th = table.find_all('th')    
header = [word.text.strip() for word in th]

header

['Postcode', 'Borough', 'Neighborhood']

4.  Create dataframe and rename column name:



In [4]:
df=pd.DataFrame(l, columns=header)
df.rename(columns={'Postcode' : 'PostalCode'}, inplace=True)
df.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


5.  Drop empty row or boroughs that are 'Not assigned':



In [5]:
df.dropna(axis=0, inplace=True)
df = df[df.Borough != 'Not assigned']
df.reset_index(drop=True, inplace=True)

df.head(3)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront


6.  Replace 'Not assigned' neighbourhoods to 'Borough' value:



In [6]:
# axis=1 to apply to each row
df['Neighborhood'] = df.apply(lambda row: row['Borough'] if row['Neighborhood'] == "Not assigned" else row['Neighborhood'], axis=1)

7.  Group by `PostalCode` and join neighbourhood names:



In [7]:
# df=df.groupby(['PostalCode','Borough']).agg(lambda x:', '.join(set(x)))
df = df.groupby('PostalCode', as_index=False).agg({'Borough' : 'first', 'Neighborhood': ', '.join})
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


8.  Print the number of rows of your dataframe.



In [8]:
df.shape

(103, 3)

  


# Obtaining the coordinates of each neighborhood

1.  Fetch the coordinates with `geopy.geocoders`:



In [9]:
from geopy.geocoders import Nominatim
import time
import numpy as np
import sys

lat = []
lng = []
timeout = 1
errormsg = ''
perc = 0;

geolocator = Nominatim(user_agent="toronto_address")

# Iterate over each row
# this method is faster than .iterrows() & .itertuples()
for row in zip(df['Neighborhood'], df['Borough'], df['PostalCode']):
    # track progress
    perc += 1
    progress = "{} / {} done".format(perc, df.shape[0])
    sys.stdout.write("\r" + progress)
    sys.stdout.flush()
    
    # initialise variable to none
    location = None
    
    # Iterate over each neighborhood of boroughs if coordinates are not found
    for word in row[0].split(","):
        location = geolocator.geocode('{}, {}'.format(word, row[1]))
        if (location != None):
            lat.append(location.latitude)
            lng.append(location.longitude)
            # timeout for server
            time.sleep(timeout)
            break
        else:
            # timeout for server
            time.sleep(timeout)
            # try one more time
            location = geolocator.geocode('{}, {}'.format(word, row[1]))
            if (location != None):
                lat.append(location.latitude)
                lng.append(location.longitude)
                break
            # note the addresse of unretrived coordinates
            elif (word == row[0].split(",")[-1]):
                lat.append(np.nan)
                lng.append(np.nan)
                errormsg += '\nUnsuccessful: {}, {}, {}'.format(word, row[1], row[2])

print(errormsg)

103 / 103 done
Unsuccessful: Studio District, East Toronto, M4M
Unsuccessful: North Toronto West, Central Toronto, M4R
Unsuccessful: Harbourfront, Downtown Toronto, M5A
Unsuccessful: St. James Town, Downtown Toronto, M5C
Unsuccessful: Berczy Park, Downtown Toronto, M5E
Unsuccessful:  Toronto Dominion Centre, Downtown Toronto, M5K
Unsuccessful:  University of Toronto, Downtown Toronto, M5S
Unsuccessful:  Kensington Market, Downtown Toronto, M5T
Unsuccessful: Stn A PO Boxes 25 The Esplanade, Downtown Toronto, M5W
Unsuccessful: Humewood-Cedarvale, York, M6C
Unsuccessful: Caledonia-Fairbanks, York, M6E
Unsuccessful: Canada Post Gateway Processing Centre, Mississauga, M7R
Unsuccessful: Business Reply Mail Processing Centre 969 Eastern, East Toronto, M7Y


We see that the boroughs of most of them are "... Toronto". Let us try to simplify it to "Toronto" and retry.

2.  Append coordinates to dataframe:



In [10]:
df['Latitude'] = lat
df['Longitude'] = lng
df.head(3)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.80493,-79.165837
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.790117,-79.173334
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.755225,-79.198229


3. Try to retrieve missing coordinates from "bad" addresses by simplifying the borough's names:



In [11]:
# Obtain arguments of the NaN coordinates
checklat = np.argwhere(np.isnan(lat)).tolist()
checklng = np.argwhere(np.isnan(lng)).tolist()
# Flatten list
check_lat = [y for x in checklat for y in x]
check_lng = [y for x in checklng for y in x]

In [12]:
geolocator = Nominatim(user_agent="check_address")

findstr = ['Toronto', 'York', 'Mississauga']
perc = 0
errormsg = ''

for i in check_lat:
    # track progress
    perc += 1
    progress = "{} / {} done".format(perc, len(check_lat))
    sys.stdout.write("\r" + progress)
    sys.stdout.flush()
    # Go through neighborhoods in a borough
    for j in df['Neighborhood'][i].split(","):
        time.sleep(1)
        for k in findstr:
            if k in df['Borough'][i]:
                location = geolocator.geocode('{}, {}'.format(j, k))
                if (location != None):
                    df.at[i,'Latitude'] = location.latitude
                    df.at[i,'Longitude'] = location.longitude
                else:
                    errormsg += '\nUnsuccessful: {}, {}'.format(j, k)

print(errormsg)

13 / 13 done
Unsuccessful: Stn A PO Boxes 25 The Esplanade, Toronto
Unsuccessful: Humewood-Cedarvale, York
Unsuccessful: Caledonia-Fairbanks, York
Unsuccessful: Canada Post Gateway Processing Centre, Mississauga
Unsuccessful: Business Reply Mail Processing Centre 969 Eastern, Toronto


In [13]:
df[np.isnan(df['Latitude'])]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
69,M5W,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,,
73,M6C,York,Humewood-Cedarvale,,
74,M6E,York,Caledonia-Fairbanks,,
86,M7R,Mississauga,Canada Post Gateway Processing Centre,,
87,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,,


4.  Find missing addresses by hand (Google if needed):



In [14]:
index = 69
location = geolocator.geocode('The Esplanade, Toronto')
if (location != None):
    df.at[index,'Latitude'] = location.latitude
    df.at[index,'Longitude'] = location.longitude
    print(location)

The Esplanade, St. Lawrence, Toronto Centre, Old Toronto, Toronto, Golden Horseshoe, Ontario, M5A 4M8, Canada


In [15]:
index = 73
location = geolocator.geocode('Humewood, York, Canada')
if (location != None):
    df.at[index,'Latitude'] = location.latitude
    df.at[index,'Longitude'] = location.longitude
    print(location)

Humewood, Toronto—St. Paul's, York, Toronto, Golden Horseshoe, Ontario, M6C 2X4, Canada


In [16]:
index = 74
location = geolocator.geocode('Caledonia, York, Canada')
if (location != None):
    df.at[index,'Latitude'] = location.latitude
    df.at[index,'Longitude'] = location.longitude
    print(location)

Caledonia, St. Clair Avenue West, Earlscourt, Davenport, Old Toronto, Toronto, Golden Horseshoe, Ontario, M6C 1C6, Canada


In [17]:
index = 86
# From Google: Post Gateway Processing Centre, Mississauga
location = geolocator.geocode('4567 Dixie Rd, Mississauga, Canada')
if (location != None):
    df.at[index,'Latitude'] = location.latitude
    df.at[index,'Longitude'] = location.longitude
    print(location)

Dixie Road, Orchard Heights, Lakeview, Mississauga, Peel Region, Golden Horseshoe, Ontario, L5E 1V4, Canada


In [18]:
index = 87
# From Google: Business Reply Mail Processing Centre 969 Eastern, East Toronto
location = geolocator.geocode('969 Eastern Ave, Toronto, Canada')
if (location != None):
    df.at[index,'Latitude'] = location.latitude
    df.at[index,'Longitude'] = location.longitude
    print(location)

969, Eastern Avenue, Tiny Town, East York, Toronto—Danforth, Old Toronto, Toronto, Golden Horseshoe, Ontario, M4L 1E2, Canada


5.  Check that we have all the coordinates:



In [19]:
df[np.isnan(df['Latitude'])]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude


6.  Export dataframe as csv (just in case):



In [20]:
# df.to_csv('./toronto_neighbourhoods.csv')

7.  Show (part of) dataframe

In [21]:
df.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.80493,-79.165837
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.790117,-79.173334
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.755225,-79.198229
3,M1G,Scarborough,Woburn,43.759824,-79.225291
4,M1H,Scarborough,Cedarbrae,43.756467,-79.226692
5,M1J,Scarborough,Scarborough Village,43.743742,-79.211632
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.714167,-79.271109
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.708823,-79.295986
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.721939,-79.236232
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.702112,-79.260091
