# Segmenting and Clustering Neighbourhoods in the city of Toronto, Canada
#### Applied Data Science Capstone - Week 3 assignement

## Part 1 - Obtaining the list of neighbourhoods
In this part we will use the BeautifulSoup package as well as the html-parsing capabilities of the pandas package to load the list of postcodes corresponding to the Toronto area from Wikipedia into a dataframe.

In [1]:
from bs4 import BeautifulSoup
import requests

import pandas as pd

In [2]:
#Finding the table in the wikipedia page using BeautifulSoup
wiki_page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
wiki_soup = BeautifulSoup(wiki_page.text)
html_table = wiki_soup.find(class_='wikitable sortable').prettify()

#Converting the html table to a pandas dataframe
df_list = pd.read_html(html_table)
Toronto_hoods_df = df_list[0]
Toronto_hoods_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [3]:
#dropping unassigned Boroughs
Toronto_hoods_df.drop(Toronto_hoods_df[Toronto_hoods_df['Borough']=='Not assigned'].index, inplace=True)
Toronto_hoods_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
Toronto_hoods_df[Toronto_hoods_df['Neighbourhood']=='Not assigned'].index #No unassigned Neighbourhoods

Int64Index([], dtype='int64')

In [5]:
[g for _, g in Toronto_hoods_df.groupby('Postal Code') if len(g) > 1] #No duplicate postal codes

[]

In [6]:
#Clean up the index
Toronto_hoods_df.reset_index(drop=True, inplace=True)
Toronto_hoods_df.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
Toronto_hoods_df.shape

(103, 3)

We now have a clean dataframe with one line for each assigned postal code and a list of corresponding neighbourhoods.

## Part 2 - Obtaining the coordinates of the neighboorhoods

We will use geopy as an implementation of the ArcGIS API to obtain the latitude and longitude of each neighbourhood in our data frame. The ArcGIS API is easy to use (it does not need extra credentials for our purposes) and was able to locate all the postal codes on the list, unlike, for example, the OpenStreetMap Nominatim API, although it seems less stable.

Geopy also offers a very helpfull wrapper function RateLimiter for batch processing. The wrapper catches errors during connection to the API and tries the same query again (up to `max_retries` times), waiting at least `min_delay_seconds` seconds between each query.

In [8]:
from geopy.geocoders import ArcGIS #gets latitude and longitude from an address
from geopy.extra.rate_limiter import RateLimiter #helper class for batch processing

In [9]:
#initialise the geolocator and the rate limiter
geolocator = ArcGIS(user_agent='toronto_explorer')
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=5, max_retries=5)

In [10]:
#get the location objects from the geolocator
Toronto_hoods_df['location'] = Toronto_hoods_df['Postal Code'].apply(
    lambda postal_code: geocode('{}, Toronto, Ontario, Canada'.format(postal_code)))

RateLimiter caught an error, retrying (0/5 tries). Called with (*('M7A, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (0/5 tries). Called with (*('M1B, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (2/5 tries). Called with (*('M1B, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (4/5 tries). Called with (*('M1B, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\l

RateLimiter caught an error, retrying (0/5 tries). Called with (*('M6B, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (0/5 tries). Called with (*('M3C, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\l

RateLimiter caught an error, retrying (0/5 tries). Called with (*('M4G, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\l

RateLimiter caught an error, retrying (2/5 tries). Called with (*('M4G, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (0/5 tries). Called with (*('M6L, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\l

RateLimiter caught an error, retrying (0/5 tries). Called with (*('M9L, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\l

RateLimiter caught an error, retrying (1/5 tries). Called with (*('M4M, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (0/5 tries). Called with (*('M6P, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (0/5 tries). Called with (*('M1X, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (2/5 tries). Called with (*('M1X, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

RateLimiter caught an error, retrying (1/5 tries). Called with (*('M8Z, Toronto, Ontario, Canada',), **{}).
Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 485, in wrap_socket
    cnx.do_handshake()
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1934, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Christian\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "C:\Users\Christian\Anaconda3\lib\site-pac

In [11]:
#check for rows where the geolocation failed
failed_rows = Toronto_hoods_df[Toronto_hoods_df['location'].isnull()].index

In [12]:
#retry only the failed rows
Toronto_hoods_df.loc[failed_rows, 'location'] = Toronto_hoods_df.loc[failed_rows, 'Postal Code'].apply(
    lambda postal_code: geocode('{}, Toronto, Ontario, Canada'.format(postal_code)))

In [13]:
#make sure all rows have a location object
Toronto_hoods_df[Toronto_hoods_df['location'].isnull()] #should be empty, if not run two previous cells again

Unnamed: 0,Postal Code,Borough,Neighbourhood,location


Now that we have the location objects for each neighbourhood, let's extract the lattitude and longitude, and then drop the location column to format the table as required.

In [14]:
#extract latitude and longitude from location object
Toronto_hoods_df['Latitude'] = Toronto_hoods_df['location'].apply(
    lambda location: location.latitude)
Toronto_hoods_df['Longitude'] = Toronto_hoods_df['location'].apply(
    lambda location: location.longitude)

In [15]:
#drop the location objects
Toronto_hoods_df.drop('location', axis='columns', inplace=True)

In [16]:
Toronto_hoods_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188


We now have added the lattitude and longitude values for each of the neighbourhoods in our dataframe, and are ready for further processing.

# Part 3 - Explore and cluster the neighbourhoods of Toronto

In [17]:
#Library to create maps
import folium

#Class for the k-means clustering
from sklearn.cluster import KMeans

#To generate colors for the clusters
from matplotlib import cm
from matplotlib.colors import rgb2hex

In [18]:
#Find the center of our map
location = geocode('Toronto, Ontario')
latitude = location.latitude
longitude = location.longitude

In [19]:
#Create the map
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

#Helper function that we can apply to our dataframe to draw markers for each neighbourhood
def add_marker(row, target_map):
    label = '{}, {}'.format(row['Neighbourhood'], row['Borough'])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row['Latitude'], row['Longitude']],
        popup=label,
        radius=5,
        fill=True,
        fill_opacity=0.7,
    ).add_to(target_map)
    
Toronto_hoods_df.apply(add_marker, axis=1, args=(map_toronto,))

map_toronto

In [20]:
#Limit ourselves to boroughs which have 'Toronto' in their name
Central_Toronto_df = Toronto_hoods_df[['Toronto' in b for b in Toronto_hoods_df['Borough']]] 

In [21]:
map_toronto_center = folium.Map(location=[latitude, longitude], zoom_start=12)

Central_Toronto_df.apply(add_marker, axis=1, args=(map_toronto_center,)) #reusing the helper function from above

map_toronto_center

In [22]:
#'Foursquare_identifier' is a text file with the client ID on the first line and the client secret on the second line
with open('Foursquare_identifier') as ID: 
    #Read the identifiers, removing the endline character
    CLIENT_ID = ID.readline()[:-1]
    CLIENT_SECRET = ID.readline()[:-1]
VERSION = '20201027'
LIMIT = 100

In [23]:
#Function to get the venues cloes to a point, recycled from the lab session
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [24]:
Central_Toronto_venues = getNearbyVenues(Central_Toronto_df['Neighbourhood'],
                                         Central_Toronto_df['Latitude'],
                                         Central_Toronto_df['Longitude'])

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West, Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
R

In [25]:
print(Central_Toronto_venues.shape)
Central_Toronto_venues.head()

(1739, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65512,-79.36264,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65512,-79.36264,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65512,-79.36264,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Regent Park, Harbourfront",43.65512,-79.36264,The Yoga Lounge,43.655515,-79.364955,Yoga Studio
4,"Regent Park, Harbourfront",43.65512,-79.36264,Body Blitz Spa East,43.654735,-79.359874,Spa


In [26]:
Central_Toronto_venues.groupby('Venue Category').count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Accessories Store,1,1,1,1,1,1
Afghan Restaurant,1,1,1,1,1,1
American Restaurant,26,26,26,26,26,26
Antique Shop,2,2,2,2,2,2
Aquarium,2,2,2,2,2,2
...,...,...,...,...,...,...
Wine Bar,11,11,11,11,11,11
Wine Shop,1,1,1,1,1,1
Wings Joint,1,1,1,1,1,1
Women's Store,1,1,1,1,1,1


In [42]:
Central_Toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,60,60,60,60,60,60
"Brockton, Parkdale Village, Exhibition Place",85,85,85,85,85,85
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",100,100,100,100,100,100
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",77,77,77,77,77,77
Central Bay Street,76,76,76,76,76,76
Christie,11,11,11,11,11,11
Church and Wellesley,79,79,79,79,79,79
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,26,26,26,26,26,26
Davisville North,8,8,8,8,8,8


We found 1739 venues in 229 different categories. However, some neighbourhoods only have a few venues, which is not enough to properly cluster them. We will therefore exclude neighbourhoods with less than 4 venues.

In [50]:
min_venues = 4
temp = Central_Toronto_venues.groupby('Neighborhood').count()
hoods_to_exclude = temp[temp['Venue'] < min_venues].index

We will now find the frequency of each type of venue in each neighbourhood by using one-hot dummy variables, and the averaging for each neighbourhood.

In [27]:
Central_Toronto_onehot = pd.get_dummies(Central_Toronto_venues['Venue Category'])
Central_Toronto_onehot.insert(0, 'Neighbourhood', Central_Toronto_venues['Neighborhood'])
print(Central_Toronto_onehot.shape)
Central_Toronto_onehot.head()

(1739, 230)


Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
Central_Toronto_freq = Central_Toronto_onehot.groupby('Neighbourhood').mean().reset_index()
Central_Toronto_freq.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.016667,0.0,0.016667,0.0,0.0,0.0,...,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667
1,"Brockton, Parkdale Village, Exhibition Place",0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.023529,0.011765,...,0.0,0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011765
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.01,0.03,...,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,...,0.012987,0.0,0.012987,0.0,0.0,0.0,0.0,0.0,0.0,0.012987
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.013158,0.013158,0.0,0.0,...,0.0,0.0,0.0,0.013158,0.013158,0.013158,0.0,0.0,0.0,0.0


We are now droping the neighbourhoods with low venue counts selected above, before clustering and inserting the cluster labels in the data frame.

In [57]:
Central_Toronto_freq.drop(
    Central_Toronto_freq[Central_Toronto_freq['Neighbourhood'].isin(hoods_to_exclude)].index,
    axis='index', 
    inplace=True,
)

In [64]:
kclusters = 3
kmeans = KMeans(n_clusters=kclusters, random_state=1234).fit(Central_Toronto_freq.drop('Neighbourhood', axis='columns'))
#Central_Toronto_freq.insert(1, 'Cluster Labels', kmeans.labels_) #for the first run, place the column correctly
Central_Toronto_freq['Cluster Labels'] = kmeans.labels_ #for subsequent runs, e.g. with a different kclusters
Central_Toronto_freq.head()

Unnamed: 0,Neighbourhood,Cluster Labels,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,...,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,1,0.0,0.0,0.0,0.016667,0.0,0.016667,0.0,0.0,...,0.0,0.016667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016667
1,"Brockton, Parkdale Village, Exhibition Place",1,0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.023529,...,0.0,0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011765
2,"Business reply mail Processing Centre, South C...",1,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.01,...,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.012987,0.0,0.012987,0.0,0.0,0.0,0.0,0.0,0.0,0.012987
4,Central Bay Street,1,0.0,0.0,0.0,0.0,0.0,0.013158,0.013158,0.0,...,0.0,0.0,0.0,0.013158,0.013158,0.013158,0.0,0.0,0.0,0.0


In [65]:
Central_Toronto_freq.groupby('Cluster Labels').count()

Unnamed: 0_level_0,Neighbourhood,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3,3,3,3,3,3,3,3,3,3,...,3,3,3,3,3,3,3,3,3,3
1,29,29,29,29,29,29,29,29,29,29,...,29,29,29,29,29,29,29,29,29,29
2,2,2,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2


We see that the clusters are very uneven in size. This seems to indicate that most neighbourhoods in central Toronto contain similar venues.

In [66]:
#Adding back the geographical information
Central_Toronto_merged = pd.merge(Central_Toronto_df, Central_Toronto_freq, on='Neighbourhood')
Central_Toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,...,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0
3,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587,1,0.0,0.0,0.024096,0.0,...,0.0,0.012048,0.0,0.0,0.0,0.012048,0.0,0.0,0.0,0.0
4,M4E,East Toronto,The Beaches,43.67709,-79.29547,2,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [67]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)
colors = [rgb2hex(cm.plasma(i/kclusters)) for i in range(kclusters)]
def add_marker_cluster(row):
    label = '{}, {}'.format(row['Neighbourhood'], row['Cluster Labels']+1)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row['Latitude'], row['Longitude']],
        popup=label,
        radius=5,
        fill=True,
        fill_opacity=0.7,
        color=colors[row['Cluster Labels']],
        fill_color=colors[row['Cluster Labels']],
    ).add_to(map_clusters)
    
Central_Toronto_merged.apply(add_marker_cluster, axis=1)

map_clusters

We see that all the central neighbourhoods got clustered together in cluster 2. Interestingly, the Davisville and Davisville North neighbourhoods are also in this cluster, despite being further from the center. We will now have a look at the most common venues in each cluster.

In [68]:
Clusters_freq = Central_Toronto_freq.drop('Neighbourhood', axis=1).groupby('Cluster Labels').mean()

In [69]:
for i in range(kclusters):
    print('Cluster {} \n'.format(i+1))
    print(Clusters_freq.loc[i].T.sort_values(ascending=False).head(10))
    print('-------\n')


Cluster 1 

Park                                        0.150000
Light Rail Station                          0.111111
Coffee Shop                                 0.111111
Residential Building (Apartment / Condo)    0.083333
Sandwich Place                              0.083333
Convenience Store                           0.083333
Playground                                  0.066667
Shop & Service                              0.066667
Tennis Court                                0.066667
Bike Trail                                  0.066667
Name: 0, dtype: float64
-------

Cluster 2 

Coffee Shop            0.093148
Café                   0.043901
Restaurant             0.032120
Park                   0.031251
Italian Restaurant     0.027070
Hotel                  0.025587
Sandwich Place         0.024096
Pizza Place            0.022220
Bakery                 0.021962
Japanese Restaurant    0.020296
Name: 1, dtype: float64
-------

Cluster 3 

Café                  0.136364
Health Food Store

Cluster 2 has venues typical for the business center of a city: coffe shops, restaurants, and hotels. Cluster 1 seems to be residential areas, with residential buildings and convienence stores. It also seems to be equiped for families with young children, with parks and playgrounds making the top ten most common venues. Cluster 3 on the other hand has a lot of health food stores and athletics and sports venues, which could indicate more affluent areas.