### Toronto Neighborhood Clustering

For Coursera Capstone week 3 Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

### Data Wrangling and Cleaning

The data for clustering Toronto's neighborhoods will be sourced by wikipedia. This dataset is indexed by postal code and needs to be scrubbed for unassigned zones (ex: large companies like Amazon may secure a unique postal code for large-volume shipping, etc).

In [1]:
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

import pandas as pd
import io
import requests

html_content=requests.get(wiki_url).content
df = pd.read_html(html_content)[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [2]:
# Rename columns
df.columns = ['PostalCode', 'Borough', 'Neighborhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [3]:
# Ignore cells with a borough that is 'Not assigned'
unassigned_boroughs_indeces = df[df['Borough'] == 'Not assigned'].index
df.drop(unassigned_boroughs_indeces, inplace=True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [4]:
# Examine dataframe for 'Not assigned' neighborhoods
df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
7,M7A,Queen's Park,Not assigned


In [5]:
# Name 'Not assigned' neighborhoods after borough name
df.loc[df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df['Borough']

# Expect empty mask after conditional replacement
df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood


In [6]:
# df.groupby(['PostalCode', 'Borough'], as_index=False)[['Neighborhood']].agg([('Neighborhood', ', '.join)])
formatted_df = df.groupby(['PostalCode', 'Borough'], as_index=False).agg({'Neighborhood': lambda x: ', '.join(x)})
formatted_df.reset_index()
formatted_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


In [7]:
formatted_df.shape

(103, 3)

### Combine wikipedia data with geocoder data

Since Wikipedia dataset does not include zip code coordinates, we should hydrate dataset with longitude and latitude from Geocoder in order to access Foursquare data.

**Instead of following instructions tightly** I've elected to move forward with the original dataset grouped by neighborhood. If the goal is to eventually compare neighbhorhoods, we should use their coordinates intead of the coordinates of a block of zip codes.

In [42]:
df.describe()

Unnamed: 0,PostalCode,Borough,Neighborhood
count,210,210,210
unique,103,11,207
top,M8Y,Etobicoke,Queen's Park
freq,8,44,2


In [47]:
# We expected all Neighborhood values to be unique
# In fact, to do the same analysis as NYC, we do not need postal code at all
df.drop("PostalCode", axis=1, inplace=True)
df.drop_duplicates(inplace=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 2 to 285
Data columns (total 2 columns):
Borough         210 non-null object
Neighborhood    210 non-null object
dtypes: object(2)
memory usage: 14.9+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 208 entries, 2 to 285
Data columns (total 2 columns):
Borough         208 non-null object
Neighborhood    208 non-null object
dtypes: object(2)
memory usage: 4.9+ KB
None


In [29]:
# Hydrate dataset with geospatial data
# import sys
# !conda install --yes --prefix {sys.prefix} -c conda-forge geopandas
# !conda install --yes --prefix {sys.prefix} -c conda-forge geopy

In [38]:
from geopy.geocoders import Nominatim

locator = Nominatim(user_agent="toronto_geocoder")

# test locator with sample neighborhood name
location = locator.geocode("Rouge, Toronto, Ontario")
print(location)

Rouge, Scarborough—Rouge Park, Scarborough, Toronto, Ontario, M1B 2K5, Canada


In [84]:
from geopy.extra.rate_limiter import RateLimiter

geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

# use .loc for column addition to avoid SettingWithCopyWarning
df.loc[:, 'Location'] = df['Neighborhood'].apply(lambda neigh: geocode("{}, Toronto, Ontario".format(neigh)) if neigh else None)
df.loc[:, 'Point'] = df['Location'].apply(lambda loc: tuple(loc.point) if loc else None)

df.loc[:, 'Latitude'] = df['Point'].apply(lambda t: t[0] if t else None)
df.loc[:, 'Longitude'] = df['Point'].apply(lambda t: t[1] if t else None)

df.drop(["Location", "Point"], axis=1, inplace=True)
df

Unnamed: 0,Borough,Neighborhood,latitude,longitude
2,North York,Parkwoods,43.758800,-79.320197
3,North York,Victoria Village,43.732658,-79.311189
4,Downtown Toronto,Harbourfront,43.640080,-79.380150
5,North York,Lawrence Heights,43.722778,-79.450933
6,North York,Lawrence Manor,43.722079,-79.437507
...,...,...,...,...
281,Etobicoke,Kingsway Park South West,43.650352,-79.500009
282,Etobicoke,Mimico NW,43.616677,-79.496805
283,Etobicoke,The Queensway West,43.623618,-79.514764
284,Etobicoke,Royal York South West,43.648183,-79.511296


In [89]:
# Examine our new columns
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 208 entries, 2 to 285
Data columns (total 4 columns):
Borough         208 non-null object
Neighborhood    208 non-null object
latitude        198 non-null float64
longitude       198 non-null float64
dtypes: float64(2), object(2)
memory usage: 8.1+ KB


In [93]:
df[df['latitude'].isnull()]

Unnamed: 0,Borough,Neighborhood,latitude,longitude
34,York,Humewood-Cedarvale,,
48,York,Caledonia-Fairbanks,,
94,North York,CFB Toronto,,
131,York,Del Ray,,
173,Mississauga,Canada Post Gateway Processing Centre,,
220,Downtown Toronto,Railway Lands,,
224,Etobicoke,Humber Bay Shores,,
228,Etobicoke,Beaumond Heights,,
239,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,,
264,East Toronto,Business Reply Mail Processing Centre 969 Eastern,,


In [97]:
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Borough,Neighborhood,latitude,longitude
0,North York,Parkwoods,43.758800,-79.320197
1,North York,Victoria Village,43.732658,-79.311189
2,Downtown Toronto,Harbourfront,43.640080,-79.380150
3,North York,Lawrence Heights,43.722778,-79.450933
4,North York,Lawrence Manor,43.722079,-79.437507
...,...,...,...,...
193,Etobicoke,Kingsway Park South West,43.650352,-79.500009
194,Etobicoke,Mimico NW,43.616677,-79.496805
195,Etobicoke,The Queensway West,43.623618,-79.514764
196,Etobicoke,Royal York South West,43.648183,-79.511296


In [102]:
# Etobicoke, Scarborough, York, East York, North York, and the City of Toronto
df.Borough.value_counts()

Etobicoke           42
Scarborough         37
North York          37
Downtown Toronto    33
Central Toronto     17
West Toronto        13
East Toronto         6
East York            6
York                 6
Queen's Park         1
Name: Borough, dtype: int64

In [None]:
# filter out city-only neighborhoods
# https://www.toronto.ca/city-government/data-research-maps/neighbourhoods-communities/neighbourhood-profiles/