# Segmenting and Clustering Neighbourhoods in Toronto

Peer-graded assignment from Applied Data Science specialization at Coursera.

This is Filipe Brandenburger's entry notebook.

In [1]:
import numpy as np
import pandas as pd

### 1. List of Boroughs and Neighbourhoods

In [2]:
toronto_hoods_df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
toronto_hoods_df = toronto_hoods_df[toronto_hoods_df["Borough"] != "Not assigned"]

### 2. Attach coordinates to Rows

In [8]:
toronto_geo_df = pd.read_csv("https://cocl.us/Geospatial_data")
# Merge them together using Postal Code as the key (which is auto-detected.)
toronto_hoods_df = toronto_hoods_df.merge(toronto_geo_df)

## 3. Use k-means clustering on the Neighbourhoods

Let's find the top features of each neighbourhood from the Foursquare API, then apply the k-means clustering method to find similar Neighbourhoods.

In [10]:
import requests
from matplotlib import cm
from matplotlib import colors
from sklearn.cluster import KMeans
import folium

Set the credentials to the Foursquare API (use a hidden cell below the example one):

In [11]:
CLIENT_ID = 'your-client-id'          # your Foursquare ID
CLIENT_SECRET = 'your-client-secret'  # your Foursquare Secret
VERSION = '20180605'                  # Foursquare API version

In [12]:
# The code was removed by Watson Studio for sharing.

Restrict the search to Toronto only neighbourhoods, only the ones with "Toronto" somewhere in the Borough name.

In [13]:
toronto_hoods_df = toronto_hoods_df[toronto_hoods_df["Borough"].str.contains("Toronto")]
toronto_hoods_df

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


Add a function to query Foursquare API and pick up to 100 popular places in a 500m radius from the center of the neighbourhood. This function is somewhat adapted and simplified from the Manhattan example.

In [16]:
def get_nearby_venues(src_df, radius=500, limit=100):
    venue_list = []
    for i, row in src_df.iterrows():
        name = row["Neighborhood"]
        lat = row["Latitude"]
        lng = row["Longitude"]
        print(name)
        # create the API request URL
        url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            limit)
        # make the GET request, keep only relevant information for each nearby venue
        for v in requests.get(url).json()["response"]["groups"][0]["items"]:
            venue_list.append((
                name,
                lat,
                lng,
                v["venue"]["name"],
                v["venue"]["location"]["lat"],
                v["venue"]["location"]["lng"],
                v["venue"]["categories"][0]["name"]))
    return pd.DataFrame(venue_list, columns=[
        "Neighborhood",
        "Neighborhood Latitude",
        "Neighborhood Longitude",
        "Venue",
        "Venue Latitude",
        "Venue Longitude",
        "Venue Category"])

In [18]:
toronto_venues_df = get_nearby_venues(toronto_hoods_df)

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West, Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
R

In [19]:
toronto_venues_df

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.654260,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.654260,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.654260,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,"Regent Park, Harbourfront",43.654260,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
4,"Regent Park, Harbourfront",43.654260,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
5,"Regent Park, Harbourfront",43.654260,-79.360636,Impact Kitchen,43.656369,-79.356980,Restaurant
6,"Regent Park, Harbourfront",43.654260,-79.360636,Corktown Common,43.655618,-79.356211,Park
7,"Regent Park, Harbourfront",43.654260,-79.360636,The Extension Room,43.653313,-79.359725,Gym / Fitness Center
8,"Regent Park, Harbourfront",43.654260,-79.360636,The Distillery Historic District,43.650244,-79.359323,Historic Site
9,"Regent Park, Harbourfront",43.654260,-79.360636,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot


Some venues have a category of "Neighborhood", which is problematic since that clashes with the column name. Let's rename them to "Neighborhood Venue".

In [25]:
toronto_venues_df.loc[toronto_venues_df["Venue Category"] == "Neighborhood", "Venue Category"] = "Neighborhood Venue"

Now let's turn the categories into columns and find the ratio of each category per neighbourhood.

In [26]:
toronto_onehot_df = pd.get_dummies(
    toronto_venues_df.set_index("Neighborhood")["Venue Category"]
).groupby(
    "Neighborhood"
).mean()

In [27]:
toronto_onehot_df

Unnamed: 0_level_0,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",0.0,0.066667,0.066667,0.066667,0.066667,0.133333,0.066667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.015873,0.0,0.015873,0.0,0.015873,0.0,0.0,0.015873
Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Church and Wellesley,0.012658,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,0.025316
"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,...,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0
Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We're ready to apply k-means clustering here. Let's cluster the neighbourhoods into 5 clusters, same as we did with Manhattan in the Lab.

In [40]:
kclusters = 5
# Run k-means clustering.
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_onehot_df)
# Check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 3, 0,
       0, 0, 0, 0, 1, 4, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0], dtype=int32)

In [41]:
# Insert the labels back into the initial dataframe.
toronto_hoods_df["Cluster Labels"] = kmeans.labels_

Let's now plot the neighbourhoods into their clusters.

Let's center the graph into the Harbourfront in Downtown Toronto.

In [45]:
latitude = toronto_hoods_df.loc[2, "Latitude"]
longitude = toronto_hoods_df.loc[2, "Longitude"]
clusters_map = folium.Map(location=[latitude, longitude], zoom_start=12)
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
for _, row in toronto_hoods_df.iterrows():
    cluster_number = row["Cluster Labels"]
    folium.CircleMarker(
        [row["Latitude"], row["Longitude"]],
        radius=5,
        popup=folium.Popup("{} Cluster {}".format(row["Neighborhood"], cluster_number)),
        color=rainbow[cluster_number],
        fill=True,
        fill_color=rainbow[cluster_number],
        fill_opacity=0.7
    ).add_to(clusters_map)

In [55]:
clusters_map

Interesting results!

Most of the city seems to be uniform (at least for this small number of clusters.)

The red dot (cluster 4) is interesting in that it's for University of Toronto, which of course will be dominated by venues related to the University itself.

The lone light blue dot (cluster 1) is Davisville, which seems to have a large number of venues that are restaurants, coffee and desert shops. It's unclear whether this neighbourhood was unusual among the others around it, or if this was an artifact of k-means picking it as a centroid since it's not surrounded by other neighbourhoods from the east.

The lone light green dot at the top (cluster 2) is Lawrence Park which only has a park, a swim school and a bus line listed as venues, so it's likely to stand out.

Overall, this was a pretty interesting exercise!

In [56]:
toronto_venues_df[toronto_venues_df["Neighborhood"] == "Davisville"]

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
1074,Davisville,43.704324,-79.38879,Jules Cafe Patisserie,43.704138,-79.388413,Dessert Shop
1075,Davisville,43.704324,-79.38879,Thobors Boulangerie Patisserie Café,43.704514,-79.388616,Café
1076,Davisville,43.704324,-79.38879,Marigold Indian Bistro,43.702881,-79.388008,Indian Restaurant
1077,Davisville,43.704324,-79.38879,XO Gelato,43.705177,-79.388793,Dessert Shop
1078,Davisville,43.704324,-79.38879,Viva Napoli,43.705752,-79.389125,Pizza Place
1079,Davisville,43.704324,-79.38879,Zee Grill,43.704985,-79.388476,Seafood Restaurant
1080,Davisville,43.704324,-79.38879,Starbucks,43.705923,-79.389548,Coffee Shop
1081,Davisville,43.704324,-79.38879,Sakae Sushi,43.704944,-79.388704,Sushi Restaurant
1082,Davisville,43.704324,-79.38879,Florentia Ristorante,43.703594,-79.387985,Italian Restaurant
1083,Davisville,43.704324,-79.38879,Positano,43.704558,-79.388639,Italian Restaurant
