# Segmenting and Clustering Neighborhoods in Toronto

##### In this notebook, we will explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information. 

We will obtain the neighbourhood information from this <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">Wikipedia page</a>, which we will scrape using the Beautiful Soup library.

In [75]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
from geopy.geocoders import Nominatim 
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser") # Get Wikipedia page in HTML
neigh_table = soup.find("table") # Find the tаble we want

df_neigh = pd.read_html(str(neigh_table))[0] # Encode table as pandas dataframe

df_neigh.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


We will now clean the data, getting rid of the rows that have "Not assigned" as their burough.

In [3]:
df_neigh.drop(df_neigh[df_neigh['Borough'] == "Not assigned"].index, inplace = True)
df_neigh.reset_index(drop = True, inplace = True)

df_neigh.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
df_neigh.shape # Find shape of the dataframe

(103, 3)

We now have a table of neighbourhoods of Toronto with 103 entries that encodes the burough and neighbourhood information for each postal code.

Next, we want to find the latitude and longitude of each postal code. We will use Geocoder to do this.

In [6]:
# Initialize latitude and longitude columns with None
df_neigh["Latitude"] = [None]*103
df_neigh["Longitude"] = [None]*103
df_neigh.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,,
1,M4A,North York,Victoria Village,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",,
3,M6A,North York,"Lawrence Manor, Lawrence Heights",,
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",,


test_coord = None

while(test_coord is None):
        g = geocoder.google('M3A, Toronto, Ontario')
        test_coord = g.latlng
        print(g)
        
print(test_coord)

In [7]:
g = geocoder.google('M3A, Toronto, Ontario')
print(g)

<[REQUEST_DENIED] Google - Geocode [empty]>


Since Geocoder is not working we will import the data from a csv file.

In [35]:
coordinates = pd.read_csv("Geospatial_Coordinates.csv")

coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [36]:
for post_code in df_neigh['Postal Code']:
    df_neigh.loc[df_neigh['Postal Code'] == post_code, 'Latitude'] = coordinates.loc[coordinates['Postal Code'] == post_code, 'Latitude'].iloc[0]
    df_neigh.loc[df_neigh['Postal Code'] == post_code, 'Longitude'] = coordinates.loc[coordinates['Postal Code'] == post_code, 'Longitude'].iloc[0]
    
df_neigh.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7533,-79.3297
1,M4A,North York,Victoria Village,43.7259,-79.3156
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6543,-79.3606
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7185,-79.4648
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6623,-79.3895


### Clustering the neighbouthoods

First we reduce the dataframe to include only the boroughs in Toronto.

In [38]:
# List of unique boroughs in Toronto
toronto_hoods = np.unique(np.array([x for x in df_neigh['Borough'].values if 'Toronto' in x]))

# Boolean list of boroughs in the dataframe that are in Toronto
temp_bool = []

for x in df_neigh['Borough'].values:
    if x in toronto_hoods: temp_bool.append(True)
    else: temp_bool.append(False)

# New dataframe including only boroughs in Toronto
toronto_df = df_neigh[temp_bool]
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6543,-79.3606
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6623,-79.3895
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3789
15,M5C,Downtown Toronto,St. James Town,43.6515,-79.3754
19,M4E,East Toronto,The Beaches,43.6764,-79.293


Let's create a map of Toronto that marks the neighbourhoods to check that everything looks fine.

In [41]:
# Obtain geographical coordinates of Toronto
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print("The coordinates of Toronto are: ", latitude, longitude)

The coordinates of Toronto are:  43.6534817 -79.3839347


In [51]:
# Create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# Add markers to map
for lat, lng, label in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

We now want to analyse the neighbourhoods. We will use Foursquare to obtain data on the top 10 venues in each neighbourhood and then cluster the neighbourhoods based on this information.

We define a function to get the venues.

In [53]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    """This function returns the venues of the list of neighbouthoods names within radius of latitude and longitude, 
    where latitudes and longitudes are the coordinates of the neighbouthoods."""
    
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [54]:
# Run the venue function and encode the results in a dataframe
toronto_venues = getNearbyVenues(toronto_df['Neighbourhood'], toronto_df['Latitude'], toronto_df['Longitude'])
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


We will group venues by category so we can compare each neighbourhood.

In [61]:
# One hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighbourhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# Move neighbourhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

# Group by neighbourhood and average to get the percentage of venues in each category
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018519,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.076923,0.076923,0.076923,0.153846,0.153846,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.016393,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.016393,0.0,0.0,0.016393


Finallly, we use k-means to cluster the neighbouthoods into 5 clusters.

In [73]:
# Set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

# Add column of clusters into neighbouthoods dataframe
toronto_df.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_df.head()

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6543,-79.3606
4,3,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6623,-79.3895
9,3,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3789
15,3,M5C,Downtown Toronto,St. James Town,43.6515,-79.3754
19,3,M4E,East Toronto,The Beaches,43.6764,-79.293


Finally, we create a map with the clusters colour-coded.

In [80]:
# Create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Neighbourhood'], toronto_df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Before we visualise the list of neighbourhoods in each cluster, we are going to create a dataframe that includes the top 10 venues in each neighbourhood so we can more easily compare the clusters. First we write a function that sorts the venues in descending order.

In [81]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [83]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neigh_venues_sorted = pd.DataFrame(columns=columns)
neigh_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neigh_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neigh_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
    
neigh_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,3,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Restaurant,Bakery,Cheese Shop,Farmers Market,Beer Bar,Concert Hall,Bistro
1,3,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Bakery,Grocery Store,Stadium,Burrito Place,Restaurant,Climbing Gym,Performing Arts Venue
2,3,"Business reply mail Processing Centre, South C...",Yoga Studio,Auto Workshop,Park,Pizza Place,Comic Shop,Recording Studio,Restaurant,Burrito Place,Brewery,Skate Park
3,3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Terminal,Coffee Shop,Rental Car Location,Plane,Sculpture Garden,Harbor / Marina,Airport Lounge,Airport Food Court,Airport
4,3,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant,Café,Burger Joint,Salad Place,Bubble Tea Shop,Portuguese Restaurant,Ramen Restaurant,Poke Place


## Clusters

### Cluster 1

In [91]:
neigh_venues_sorted.loc[neigh_venues_sorted['Cluster Labels'] == 0, neigh_venues_sorted.columns[[1] + list(range(5, neigh_venues_sorted.shape[1]))]]

Unnamed: 0,Neighborhood,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Lawrence Park,Wine Bar,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
35,The Beaches,Wine Bar,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


### Cluster 2

In [92]:
neigh_venues_sorted.loc[neigh_venues_sorted['Cluster Labels'] == 1, neigh_venues_sorted.columns[[1] + list(range(5, neigh_venues_sorted.shape[1]))]]

Unnamed: 0,Neighborhood,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,"Moore Park, Summerhill East",Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


### Cluster 3

In [93]:
neigh_venues_sorted.loc[neigh_venues_sorted['Cluster Labels'] == 2, neigh_venues_sorted.columns[[1] + list(range(5, neigh_venues_sorted.shape[1]))]]

Unnamed: 0,Neighborhood,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,"Forest Hill North & West, Forest Hill Road Park",Jewelry Store,Wine Bar,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
26,Rosedale,Wine Bar,Dance Studio,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


### Cluster 4

In [94]:
neigh_venues_sorted.loc[neigh_venues_sorted['Cluster Labels'] == 3, neigh_venues_sorted.columns[[1] + list(range(5, neigh_venues_sorted.shape[1]))]]

Unnamed: 0,Neighborhood,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Restaurant,Bakery,Cheese Shop,Farmers Market,Beer Bar,Concert Hall,Bistro
1,"Brockton, Parkdale Village, Exhibition Place",Bakery,Grocery Store,Stadium,Burrito Place,Restaurant,Climbing Gym,Performing Arts Venue
2,"Business reply mail Processing Centre, South C...",Pizza Place,Comic Shop,Recording Studio,Restaurant,Burrito Place,Brewery,Skate Park
3,"CN Tower, King and Spadina, Railway Lands, Har...",Rental Car Location,Plane,Sculpture Garden,Harbor / Marina,Airport Lounge,Airport Food Court,Airport
4,Central Bay Street,Café,Burger Joint,Salad Place,Bubble Tea Shop,Portuguese Restaurant,Ramen Restaurant,Poke Place
5,Christie,Nightclub,Athletics & Sports,Candy Store,Italian Restaurant,Restaurant,Baby Store,Coffee Shop
6,Church and Wellesley,Restaurant,Gay Bar,Café,Hotel,Yoga Studio,Mediterranean Restaurant,Men's Store
7,"Commerce Court, Victoria Hotel",Café,American Restaurant,Gym,Italian Restaurant,Deli / Bodega,Japanese Restaurant,Cocktail Bar
8,Davisville,Coffee Shop,Sushi Restaurant,Italian Restaurant,Gym,Café,Seafood Restaurant,Brewery
9,Davisville North,Electronics Store,Gym / Fitness Center,Breakfast Spot,Department Store,Food & Drink Shop,Sandwich Place,Discount Store


### Cluster 5

In [95]:
neigh_venues_sorted.loc[neigh_venues_sorted['Cluster Labels'] == 4, neigh_venues_sorted.columns[[1] + list(range(5, neigh_venues_sorted.shape[1]))]]

Unnamed: 0,Neighborhood,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
27,Roselawn,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
