# Clustering Toronto Neighbourhoods

### A. Gutmanas
### Feb 2020



---
## Part 1:
### Get, clean and load neighbourhood locations

Despite the fact that a simple google search and some critical review points to the City of Toronto website (https://www.toronto.ca) and their "Open Data" portal: https://open.toronto.ca, which contains the necessary data, I will follow the instructions from the course.

1. Scrape the list of postcodes for Toronto from Wiki page at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and load into a dataframe.
2. Drop rows where borough is _not assigned_.
3. Normalise the dataframe so that neighbourhoods with the same postal code are combined into a comma-separated list.

Let's start by importing the necessary libraries. Some of them will be needed later.

In [1]:
#!pip install folium    # uncomment if library not available
#!pip install shapely   # uncomment if library not available
#!pip install geocoder  # uncomment if library not available
#!pip install colour    # uncomment if library not available
#!pip install sklearn   # uncomment if library not available

In [2]:
# import libraries
import json
from shapely.geometry import shape, Point # will be needed later
import pandas as pd
import requests
import folium
import bs4
import geocoder
from colour import Color
from sklearn.cluster import KMeans

Get the raw HTML from the Wikipedia page

In [3]:
# get the raw data
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
wiki_text = requests.get(wiki_url).text

Find the actual table on the page, get column names (from _th_ tags) and values (from _td_ tags), load the resulting data into a dataframe. Of course, one could add the business logic of checking if values are "Not assigned" and combining neighbourhoods. But it seems to be a cleaner way not to do that here, even if it means we will first load and then drop some rows.

In [4]:
# find the table with Toronto postal codes
wiki_tables = bs4.BeautifulSoup(wiki_text).find_all("table", attrs={"class": "wikitable sortable"})

cols = [x.get_text().strip() for x in wiki_tables[0].find_all("th")]
rows = wiki_tables[0].find_all("tr")
values = []
for row in rows[1:]:
    values.append([x.get_text().strip() for x in row.find_all("td")])

toronto_postcodes = pd.DataFrame(columns=cols, data=values)
toronto_postcodes.head()



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")



Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Now, some cleanup. Drop "Not assigned" boroughs.

In [5]:
toronto_postcodes.drop(toronto_postcodes.loc[toronto_postcodes['Borough']=="Not assigned"].index, inplace=True)
toronto_postcodes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Check if there are any neighbourhoods that are "Not assigned" and copy the name of the corresponding borough instead. Check how many such cases there are.

In [6]:
indices = toronto_postcodes.loc[toronto_postcodes['Neighbourhood']=="Not assigned"].index
toronto_postcodes.loc[indices,"Neighbourhood"] = toronto_postcodes.loc[indices,"Borough"]

toronto_postcodes.loc[indices,:] 

Unnamed: 0,Postcode,Borough,Neighbourhood
9,M9A,Queen's Park,Queen's Park


OK. Just that one case. 

_Actually, having lived in Toronto, Queen's Park doesn't exactly qualify as a borough, but it looks like the Ontario legislature and government wish to have a postal code area all to themselves!_

Now the fun bit - group the neighbourhoods by their postal code area and concatenate them into a comma separated list.
Just for curiosity, check also if any postal code covers more than one borough.

In [7]:
tg = toronto_postcodes.groupby(["Postcode"])
postcodes = list(tg.groups.keys())
boroughs = []
neighbourhoods = []

for code in postcodes:    
    area = tg.get_group(code)
    if area["Borough"].nunique() != 1:
        print(f"Postal code {code} covers an area in {area['Borough'].nunique()} boroughs. Keeping only the first one")
    boroughs.append(area.iloc[0,1])
    neighbourhoods.append(pd.Series(area["Neighbourhood"].unique()).str.cat(sep=", "))
    
toronto_work_df = pd.DataFrame({
    "Postal_code": postcodes,
    "Borough": boroughs,
    "Neighbourhood": neighbourhoods
})
    
toronto_work_df.head()    

Unnamed: 0,Postal_code,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


There is probably a better way of achieving this without looping over the individual groups of a dataframe, but we are not dealing with massive data, so I'll be lazy and leave it as is. (BTW, no postal code spread itself over more than one borough.

## End of part 1: 
check the size of the resulting dataframe

In [8]:
toronto_work_df.shape

(103, 3)

### An alternative way to get a list of Toronto's neighbourhoods (with geolocation data!)
Just for fun, I will also load the neighbourhood geodata from: https://open.toronto.ca/dataset/neighbourhoods, which allows download in CSV, GeoJSON and a few other formats. This is easier than scraping Wikipedia, which also contains a different list of neighbourhoods and boroughs. The exact link for the CSV is https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/a083c865-6d60-4d1d-b6c6-b0c8a85f9c15?format=csv&projection=4326, and it will be easy to load into a pandas dataframe. This dataset lacks "boroughs", which are the old municipalities before and the city of Toronto was amalgamated in 2001. This information could be useful at some point, and the geographic boundaries for these areas are available from https://open.toronto.ca/dataset/former-municipality-boundaries/. The GeoJSON file can be downloaded from https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/f82dbe76-928e-4cec-8147-a21882f575e2?format=geojson&projection=4326

In [9]:
# Download the CSV with Toronto neighbourhoods and load into a dataframe
toronto_raw = pd.read_csv("https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/a083c865-6d60-4d1d-b6c6-b0c8a85f9c15?format=csv&projection=4326")
toronto_raw.head()

Unnamed: 0,_id,AREA_ID,AREA_ATTR_ID,PARENT_AREA_ID,AREA_SHORT_CODE,AREA_LONG_CODE,AREA_NAME,AREA_DESC,X,Y,LONGITUDE,LATITUDE,OBJECTID,Shape__Area,Shape__Length,geometry
0,3221,25886861,25926662,49885,94,94,Wychwood (94),Wychwood (94),,,-79.425515,43.676919,16491505,3217960.0,7515.779658,"{u'type': u'Polygon', u'coordinates': (((-79.4..."
1,3222,25886820,25926663,49885,100,100,Yonge-Eglinton (100),Yonge-Eglinton (100),,,-79.40359,43.704689,16491521,3160334.0,7872.021074,"{u'type': u'Polygon', u'coordinates': (((-79.4..."
2,3223,25886834,25926664,49885,97,97,Yonge-St.Clair (97),Yonge-St.Clair (97),,,-79.397871,43.687859,16491537,2222464.0,8130.411276,"{u'type': u'Polygon', u'coordinates': (((-79.3..."
3,3224,25886593,25926665,49885,27,27,York University Heights (27),York University Heights (27),,,-79.488883,43.765736,16491553,25418210.0,25632.335242,"{u'type': u'Polygon', u'coordinates': (((-79.5..."
4,3225,25886688,25926666,49885,31,31,Yorkdale-Glen Park (31),Yorkdale-Glen Park (31),,,-79.457108,43.714672,16491569,11566690.0,13953.408098,"{u'type': u'Polygon', u'coordinates': (((-79.4..."


In [10]:
# create a new dataframe with relevant columns only 
toronto_base = toronto_raw[["AREA_SHORT_CODE", "LONGITUDE", "LATITUDE"]].copy()
toronto_base.head()

Unnamed: 0,AREA_SHORT_CODE,LONGITUDE,LATITUDE
0,94,-79.425515,43.676919
1,100,-79.40359,43.704689
2,97,-79.397871,43.687859
3,27,-79.488883,43.765736
4,31,-79.457108,43.714672


In [11]:
# add cleaned up names of neighbourhoods
toronto_base["AREA_NAME"] = [x[:x.find('(')-1] for x in toronto_raw["AREA_NAME"]]                               
toronto_base.head()

Unnamed: 0,AREA_SHORT_CODE,LONGITUDE,LATITUDE,AREA_NAME
0,94,-79.425515,43.676919,Wychwood
1,100,-79.40359,43.704689,Yonge-Eglinton
2,97,-79.397871,43.687859,Yonge-St.Clair
3,27,-79.488883,43.765736,York University Heights
4,31,-79.457108,43.714672,Yorkdale-Glen Park


In [12]:
# download GeoJSON with data for old municipalities (i.e., boroughs)
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/f82dbe76-928e-4cec-8147-a21882f575e2?format=geojson&projection=4326"
boroughs_geoJSON = requests.get(url).json()

For each neighbourhood create a "Point" object and loop over the boroughs GeoJSON to check which borough the point belongs to. 

In [13]:
boroughs = []
for index, neighbourhood in toronto_base.iterrows():
    # print(neighbourhood["AREA_NAME"])
    point = Point(neighbourhood["LONGITUDE"], neighbourhood["LATITUDE"])
    for feature in boroughs_geoJSON['features']:
        polygon = shape(feature['geometry'])
        if polygon.contains(point):
            # print(neighbourhood['AREA_NAME']," is in ",feature['properties']['AREA_NAME'])
            boroughs.append(feature['properties']['AREA_NAME'])
            break
            
toronto_base["BOROUGH"] = boroughs
toronto_base.sort_values(by=["BOROUGH", "AREA_SHORT_CODE"], inplace=True)

In [14]:
toronto_base.head()

Unnamed: 0,AREA_SHORT_CODE,LONGITUDE,LATITUDE,AREA_NAME,BOROUGH
29,54,-79.312228,43.7068,O'Connor-Parkview,EAST YORK
57,55,-79.349984,43.707749,Thorncliffe Park,EAST YORK
9,56,-79.366072,43.703797,Leaside-Bennington,EAST YORK
91,57,-79.35563,43.688825,Broadview North,EAST YORK
32,58,-79.335488,43.696781,Old East York,EAST YORK


In [15]:
toronto_base.shape

(140, 5)

So, there are 103 postal code areas in Toronto, and 140 official neighbourhoods recognised by the City of Toronto. For the purposes of the project, it is probably immaterial which of the approaches is used. 

### End of alternative data download

---
## Part 2:
### Obtain latitude and longitude for the neighbourhoods

Following the instructions, use the geocoder library and try searching for each postcode (possibly in an infinite loop?)

Let's define a function to obtain coordinates for a Toronto postcode area

In [16]:
def get_lat_lng(postal_code, suffix="Toronto, Ontario", max_iter=10):
    """
    combine the postal code and the city/province/country in the suffix
    no more than max_iter attempts 
    return a tuple of longitude and latitude
    """
    result = None
    i = 0
    while result is None and i<max_iter:
        # google method failed to return anything even after a 1000 iterations. 
        # by trial and error found that arcgis does the job. 
        # 
        g = geocoder.arcgis(f'{postal_code}, Toronto, Ontario')
        result = g.json
        i += 1
    
    if result:
        return result['lat'], result['lng']
    else:
        return None

Now let's iterate over the postcodes and actually obtain the coordinates. Then add them to the dataframe.

In [17]:
latitudes = []
longitudes = []

for code in postcodes:
    ll = get_lat_lng(code)
    if ll is None:
        latitudes.append(None)
        longitudes.append(None)
        print("None for ", code)
    else:
        latitudes.append(ll[0])
        longitudes.append(ll[1])
        
    
toronto_work_df["Latitude"] = latitudes
toronto_work_df["Longitude"] = longitudes
toronto_work_df.head()

Unnamed: 0,Postal_code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193
3,M1G,Scarborough,Woburn,43.768369,-79.21759
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944


A quick sanity check.

In [18]:
toronto_work_df.loc[toronto_work_df["Postal_code"] == "M4R"]

Unnamed: 0,Postal_code,Borough,Neighbourhood,Latitude,Longitude
46,M4R,Central Toronto,North Toronto West,43.714523,-79.40696


In [19]:
# Coordinates for Yonge and Eg (roughly central)
longitude = -79.403590 
latitude = 43.704689

Now let's create a map!

In [20]:
boroughs_geoJSON

{'type': 'FeatureCollection',
 'crs': {'type': 'name',
  'properties': {'name': 'urn:ogc:def:crs:OGC:1.3:CRS84'}},
 'features': [{'type': 'Feature',
   'properties': {'_id': 139,
    'AREA_ID': 49884,
    'DATE_EFFECTIVE': None,
    'AREA_ATTR_ID': 49884,
    'PARENT_AREA_ID': 49886,
    'AREA_SHORT_CODE': 14,
    'AREA_LONG_CODE': 14,
    'AREA_NAME': 'YORK',
    'AREA_DESC': 'YORK',
    'X': None,
    'Y': None,
    'LONGITUDE': -79.4775659929,
    'LATITUDE': 43.685081164799996,
    'OBJECTID': 11093905,
    'Shape__Area': 45043586.53125,
    'Shape__Length': 53124.2847222816},
   'geometry': {'type': 'Polygon',
    'coordinates': [[[-79.4926212023891, 43.6474363515455],
      [-79.4924881713615, 43.6477167084201],
      [-79.4924187105467, 43.6478630840389],
      [-79.4922706961263, 43.6481016103021],
      [-79.49203629835151, 43.648385137201],
      [-79.4919704392671, 43.6484661358722],
      [-79.4917794179277, 43.6487136598091],
      [-79.4915357342044, 43.6490602125298],
  

In [21]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11, tiles='cartodbpositron')
# add boroughs (old municipalities) to map
colormap = {
    'YORK': '#b3e2cd',
    'SCARBOROUGH': '#fdcdac',
    'NORTH YORK': '#cbd5e8',
    'TORONTO': '#f4cae4',
    'ETOBICOKE': '#e6f5c9',
    'EAST YORK': '#fff2ae'
}
style_function = lambda x: {
    'fillColor': colormap[x['properties']['AREA_NAME']],
    'color': "#aaaaaa"
}
for feature in boroughs_geoJSON['features']:
    
    folium.GeoJson(
        feature,
        style_function=style_function,
        name='geojson'
    ).add_to(map_toronto)
    
    label = feature['properties']['AREA_NAME']
    label_color = Color(colormap[label])
    label_color.luminance *= 0.3
    lat = feature['properties']['LATITUDE']
    lng = feature['properties']['LONGITUDE']
    folium.Marker(
        location=[lat,lng],
        icon=folium.DivIcon(html=f"""<div style="color: {label_color.hex_l}; align: center">{label}</div>""")
    ).add_to(map_toronto)
    


for lat, lng, borough, neighborhood in zip(toronto_base['LATITUDE'], 
                                           toronto_base['LONGITUDE'], 
                                           toronto_base['BOROUGH'], 
                                           toronto_base['AREA_NAME']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='#5ab4ac',
        fill=True,
        fill_color='#5ab4ac',
        fill_opacity=0.3,
        parse_html=False
    ).add_to(map_toronto)  
    

for lat, lng, borough, neighborhood, postcode in zip(toronto_work_df['Latitude'], 
                                                     toronto_work_df['Longitude'], 
                                                     toronto_work_df['Borough'], 
                                                     toronto_work_df['Neighbourhood'],
                                                     toronto_work_df['Postal_code']):
    label = '{} {}, {}'.format(postcode, neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='#d8b365',
        fill=True,
        fill_color='#d8b365',
        fill_opacity=0.3,
        parse_html=False
    ).add_to(map_toronto)  
    

title_html = '<h3 align="center" style="font-size:20px"><b>Toronto boroughs, neighbourhoods (blue) and postal codes (brown)</b></h3>'
    
map_toronto.get_root().html.add_child(folium.Element(title_html))

map_toronto

The map above shows the "Postal code" neighbourhoods from Wikipedia in brown and the ones downloaded from the City of Toronto Open data portal in blue. The shaded areas are the old municipalities, which were amalgamated into a single City of Toronto in 1998

There is a curious concentration of brown dots (postal codes) in the city centre. This is not too surprising, really, as this is the area with many high-rises, and many businesses, so the postal code density is expected to be higher there.

---
### End of Part 2

## Part 3:
### Looking into venues with Foursquare, and clustering of neighbourhoods by popularity of venues.



Start by setting up the Foursquare credentials. I have regenerated my secret key after submitting the notebook, in case I forget to remove the credentials!

In [22]:
CLIENT_ID = '' 
CLIENT_SECRET = '' 
VERSION = '20200201' # Foursquare API version

In [23]:
base_url = "https://api.foursquare.com/v2"
venues_url = base_url + "/venues"

Some exploratory look at an area I know well

In [24]:
lat = toronto_work_df.at[46,"Latitude"]
lng = toronto_work_df.at[46,"Longitude"]

In [25]:
radius = 1000
LIMIT = 100
url = venues_url + '/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION,
    lat, 
    lng, 
    radius, 
    LIMIT)

In [26]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e49dcfc29ce6a001c24b7cc'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Lawrence Park South',
  'headerFullLocation': 'Lawrence Park South, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 46,
  'suggestedBounds': {'ne': {'lat': 43.723522793000065,
    'lng': -79.3945315214085},
   'sw': {'lat': 43.70552277500004, 'lng': -79.41938847859144}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4f4d31aee4b0ef284ae397ea',
       'name': 'Himalayan Java',
       'location': {'address': '2552 Yonge St',
        'crossStreet': 'Briar Hill',
        'lat': 43.713486181375714,
        'lng': -79.39981137215881,
        'labele

In [27]:
venues = results['response']['groups'][0]['items']
venues

[{'reasons': {'count': 0,
   'items': [{'summary': 'This spot is popular',
     'type': 'general',
     'reasonName': 'globalInteractionReason'}]},
  'venue': {'id': '4f4d31aee4b0ef284ae397ea',
   'name': 'Himalayan Java',
   'location': {'address': '2552 Yonge St',
    'crossStreet': 'Briar Hill',
    'lat': 43.713486181375714,
    'lng': -79.39981137215881,
    'labeledLatLngs': [{'label': 'display',
      'lat': 43.713486181375714,
      'lng': -79.39981137215881}],
    'distance': 586,
    'cc': 'CA',
    'city': 'Toronto',
    'state': 'ON',
    'country': 'Canada',
    'formattedAddress': ['2552 Yonge St (Briar Hill)',
     'Toronto ON',
     'Canada']},
   'categories': [{'id': '4bf58dd8d48988d16d941735',
     'name': 'Café',
     'pluralName': 'Cafés',
     'shortName': 'Café',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/cafe_',
      'suffix': '.png'},
     'primary': True}],
   'photos': {'count': 0, 'groups': []}},
  'referralId': 'e-0-4f4d31aee4b0ef

Well, the below function is simply knicked from the course lab. It does the job to get out categories for the venue.

In [28]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Let's flatten the JSON into a dataframe and then only keep a few columns (venue name, category, coordinates and distance to the centre of the postal code area.

In [29]:
nearby_venues = pd.io.json.json_normalize(venues)
nearby_venues.columns

Index(['referralId', 'reasons.count', 'reasons.items', 'venue.id',
       'venue.name', 'venue.location.address', 'venue.location.crossStreet',
       'venue.location.lat', 'venue.location.lng',
       'venue.location.labeledLatLngs', 'venue.location.distance',
       'venue.location.cc', 'venue.location.city', 'venue.location.state',
       'venue.location.country', 'venue.location.formattedAddress',
       'venue.categories', 'venue.photos.count', 'venue.photos.groups',
       'venue.location.postalCode', 'venue.venuePage.id'],
      dtype='object')

In [30]:
filtered_columns = ['venue.name', 
                    'venue.categories', 
                    'venue.location.lat', 
                    'venue.location.lng', 
                    'venue.location.distance']

nearby_venues = nearby_venues[filtered_columns]
nearby_venues.head()

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng,venue.location.distance
0,Himalayan Java,"[{'id': '4bf58dd8d48988d16d941735', 'name': 'C...",43.713486,-79.399811,586
1,De Mello Palheta Coffee Roasters,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",43.711791,-79.399403,679
2,Sheridan Nurseries,"[{'id': '4bf58dd8d48988d11b951735', 'name': 'F...",43.719005,-79.4005,720
3,Douce France,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",43.711554,-79.399394,692
4,Cibo Wine Bar,"[{'id': '4bf58dd8d48988d110941735', 'name': 'I...",43.711464,-79.39957,685


Now, replace the mess in the categories column with the actual category as defined by the above function

In [31]:
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues.head()

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng,venue.location.distance
0,Himalayan Java,Café,43.713486,-79.399811,586
1,De Mello Palheta Coffee Roasters,Coffee Shop,43.711791,-79.399403,679
2,Sheridan Nurseries,Flower Shop,43.719005,-79.4005,720
3,Douce France,Bakery,43.711554,-79.399394,692
4,Cibo Wine Bar,Italian Restaurant,43.711464,-79.39957,685


Some final look at this one area before proceeding towards the whole city.

In [32]:
nearby_venues.shape

(46, 5)

In [33]:
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

In [34]:
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng,distance
0,Himalayan Java,Café,43.713486,-79.399811,586
1,De Mello Palheta Coffee Roasters,Coffee Shop,43.711791,-79.399403,679
2,Sheridan Nurseries,Flower Shop,43.719005,-79.4005,720
3,Douce France,Bakery,43.711554,-79.399394,692
4,Cibo Wine Bar,Italian Restaurant,43.711464,-79.39957,685


Another function stolen from the course lab. Minor modifications to also include category ID and distance to the centre of the postal code area. 

In [35]:
def get_nearby_venues(areas, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for area, lat, lng in zip(areas, latitudes, longitudes):
        # create the API request URL
        url = venues_url+'/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            area, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['location']['distance'], 
            v['venue']['categories'][0]['name'],
            v['venue']['categories'][0]['id'],
        ) for v in results])
        
        print(f"{len(results)} venues within {radius} m of {area} center")

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [
        'Neighborhood', 
        'Neighborhood Latitude', 
        'Neighborhood Longitude', 
        'Venue Name', 
        'Venue Latitude', 
        'Venue Longitude', 
        'Venue Distance to Neighbourhood Centre', 
        'Venue Category',
        'Venue Category ID',
    ]
    
    return(nearby_venues)

Get the Foursquare data

In [36]:
toronto_venues_by_postal_code = get_nearby_venues(
    toronto_work_df["Postal_code"],
    toronto_work_df["Latitude"], 
    toronto_work_df["Longitude"]
)

8 venues within 1000 m of M1B center
4 venues within 1000 m of M1C center
19 venues within 1000 m of M1E center
18 venues within 1000 m of M1G center
24 venues within 1000 m of M1H center
11 venues within 1000 m of M1J center
22 venues within 1000 m of M1K center
22 venues within 1000 m of M1L center
16 venues within 1000 m of M1M center
11 venues within 1000 m of M1N center
22 venues within 1000 m of M1P center
31 venues within 1000 m of M1R center
48 venues within 1000 m of M1S center
37 venues within 1000 m of M1T center
22 venues within 1000 m of M1V center
26 venues within 1000 m of M1W center
0 venues within 1000 m of M1X center
19 venues within 1000 m of M2H center
84 venues within 1000 m of M2J center
7 venues within 1000 m of M2K center
4 venues within 1000 m of M2L center
58 venues within 1000 m of M2M center
100 venues within 1000 m of M2N center
25 venues within 1000 m of M2P center
16 venues within 1000 m of M2R center
26 venues within 1000 m of M3A center
29 venues within

In [37]:
toronto_venues_by_postal_code.shape

(5120, 9)

Over 5k venues, but of course, many are duplicates because I had put 1km radius for the query. This is OK, really, because the idea is to classify neighbourhoods by what is accessible on foot from the centre of each neighbourhood. So 1km sounds reasonable, even if not super accurate.
BTW, postal code M1X returned 0 venues. That's the northeastern corner of Scarborough. I definitely believe this result. 

In [38]:
toronto_venues_by_postal_code.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Distance to Neighbourhood Centre,Venue Category,Venue Category ID
0,M1B,43.811525,-79.195517,Canadiana exhibit,43.817962,-79.193374,736,Zoo Exhibit,58daa1558bbb0b01f18ec1fd
1,M1B,43.811525,-79.195517,Wendy's,43.807448,-79.199056,535,Fast Food Restaurant,4bf58dd8d48988d16e941735
2,M1B,43.811525,-79.195517,Grizzly Bear Exhibit,43.817031,-79.193458,634,Zoo Exhibit,58daa1558bbb0b01f18ec1fd
3,M1B,43.811525,-79.195517,Ecopainting inc.,43.808417,-79.202392,651,Construction & Landscaping,5454144b498ec1f095bff2f2
4,M1B,43.811525,-79.195517,Upper Rouge Trail,43.809988,-79.186147,771,Trail,4bf58dd8d48988d159941735


In [39]:
len(toronto_venues_by_postal_code["Venue Category"].unique())

341

Nearly 342 unique categories. Hmmm....
I think a closer look is needed.

In [40]:
toronto_venues_by_postal_code["Venue Category"].unique()

array(['Zoo Exhibit', 'Fast Food Restaurant',
       'Construction & Landscaping', 'Trail', 'Other Great Outdoors',
       'Hobby Shop', 'Italian Restaurant', 'Burger Joint',
       'Breakfast Spot', 'Park', 'Food & Drink Shop', 'Liquor Store',
       'Pizza Place', 'Grocery Store', 'Juice Bar', 'Pharmacy',
       'Discount Store', 'Sports Bar', 'Supermarket',
       'Gym / Fitness Center', 'Athletics & Sports', 'Gymnastics Gym',
       'Bus Station', 'Convenience Store', 'Restaurant', 'Coffee Shop',
       'Indian Restaurant', 'Vietnamese Restaurant', 'Department Store',
       'Chinese Restaurant', 'Thrift / Vintage Store', 'Sandwich Place',
       'Bakery', 'Caribbean Restaurant', 'Hakka Restaurant',
       'Music Store', 'Thai Restaurant', 'Bank', 'Gas Station',
       'Fried Chicken Joint', 'German Restaurant', 'Bus Line',
       'Ice Cream Shop', 'Big Box Store', 'Train Station',
       'Metro Station', 'Light Rail Station', 'Asian Restaurant',
       'Rental Car Location', 'Vege

OK. These categories are useful when looking at an individual venue, but what is a difference between a Coffee Shop and a Cafe? What about a Bistro? I think a more genereal approach is needed. Fortunately, Foursquare groups categories into a tree, so I will retrieve that whole tree of categories now.

In [41]:
cetegories_url = venues_url + "/categories?&client_id={}&client_secret={}&v={}".format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION
)
categories_json = requests.get(cetegories_url).json()["response"]


In [42]:
categories_json["categories"]

[{'id': '4d4b7104d754a06370d81259',
  'name': 'Arts & Entertainment',
  'pluralName': 'Arts & Entertainment',
  'shortName': 'Arts & Entertainment',
  'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/default_',
   'suffix': '.png'},
  'categories': [{'id': '56aa371be4b08b9a8d5734db',
    'name': 'Amphitheater',
    'pluralName': 'Amphitheaters',
    'shortName': 'Amphitheater',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/default_',
     'suffix': '.png'},
    'categories': []},
   {'id': '4fceea171983d5d06c3e9823',
    'name': 'Aquarium',
    'pluralName': 'Aquariums',
    'shortName': 'Aquarium',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/aquarium_',
     'suffix': '.png'},
    'categories': []},
   {'id': '4bf58dd8d48988d1e1931735',
    'name': 'Arcade',
    'pluralName': 'Arcades',
    'shortName': 'Arcade',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_ent

The tree of categories is quite a complicated one, so below is a function to obtain the root level category for each branch or leaf, and return a dictionary with that data.

In [43]:
def category_iterator(child_json, parent_name=None):
    """
    map each category to its top level parent. 
    Uses recursion to descent into deeper levels ("grandchildren, etc")
    I.e., "Chinese Restaurant" should map to "Food"
    return categories_dict (3 lists)
    """
    categories_dict = {
        "parent_category": [],
        "child_category": [],
        "child_id": []
    }
    
    for p in child_json["categories"]:
        if parent_name:
            categories_dict["parent_category"].append(parent_name)
        else:
            categories_dict["parent_category"].append(p["name"])            
        categories_dict["child_category"].append(p["name"])
        categories_dict["child_id"].append(p["id"])
        if len(p["categories"])>0:
            if parent_name:
                recursive_dict = category_iterator(p, parent_name)
            else:
                recursive_dict = category_iterator(p, p["name"])
            categories_dict["parent_category"].extend(recursive_dict["parent_category"])
            categories_dict["child_category"].extend(recursive_dict["child_category"])
            categories_dict["child_id"].extend(recursive_dict["child_id"])
            
    return categories_dict

In [44]:
categories_df = pd.DataFrame(category_iterator(categories_json))
categories_df.set_index("child_id", inplace=True)

In [45]:
categories_df.head()


Unnamed: 0_level_0,parent_category,child_category
child_id,Unnamed: 1_level_1,Unnamed: 2_level_1
4d4b7104d754a06370d81259,Arts & Entertainment,Arts & Entertainment
56aa371be4b08b9a8d5734db,Arts & Entertainment,Amphitheater
4fceea171983d5d06c3e9823,Arts & Entertainment,Aquarium
4bf58dd8d48988d1e1931735,Arts & Entertainment,Arcade
4bf58dd8d48988d1e2931735,Arts & Entertainment,Art Gallery


I've put these category relationships into a dataframe, which will allow to transfer the root categories to the venues dataframe I obtained above. The cell below constructs a new column for the venues dataframe

In [46]:
parent_category = []
for category_id in toronto_venues_by_postal_code["Venue Category ID"]:
    parent_category.append(categories_df.at[category_id,"parent_category"])

toronto_venues_by_postal_code["Parent Category"] = parent_category

In [47]:
toronto_venues_by_postal_code.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Distance to Neighbourhood Centre,Venue Category,Venue Category ID,Parent Category
0,M1B,43.811525,-79.195517,Canadiana exhibit,43.817962,-79.193374,736,Zoo Exhibit,58daa1558bbb0b01f18ec1fd,Arts & Entertainment
1,M1B,43.811525,-79.195517,Wendy's,43.807448,-79.199056,535,Fast Food Restaurant,4bf58dd8d48988d16e941735,Food
2,M1B,43.811525,-79.195517,Grizzly Bear Exhibit,43.817031,-79.193458,634,Zoo Exhibit,58daa1558bbb0b01f18ec1fd,Arts & Entertainment
3,M1B,43.811525,-79.195517,Ecopainting inc.,43.808417,-79.202392,651,Construction & Landscaping,5454144b498ec1f095bff2f2,Shop & Service
4,M1B,43.811525,-79.195517,Upper Rouge Trail,43.809988,-79.186147,771,Trail,4bf58dd8d48988d159941735,Outdoors & Recreation


OK. I think we are ready to proceed. Let's do the One-Hot encoding by the parent category and add the postal code to the new dataframe. (Same principle as in the course lab, but on a more manageable level, in my opinion).

In [48]:
toronto_one_hot_venues_by_postal_code = pd.get_dummies(toronto_venues_by_postal_code[['Parent Category']], prefix="", prefix_sep="")

In [49]:
toronto_one_hot_venues_by_postal_code["Postal_code"] = toronto_venues_by_postal_code["Neighborhood"]
fixed_columns = [toronto_one_hot_venues_by_postal_code.columns[-1]] + list(toronto_one_hot_venues_by_postal_code.columns)[:-1]
toronto_one_hot_venues_by_postal_code = toronto_one_hot_venues_by_postal_code[fixed_columns]
toronto_one_hot_venues_by_postal_code.head()

Unnamed: 0,Postal_code,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
0,M1B,1,0,0,0,0,0,0,0,0
1,M1B,0,0,1,0,0,0,0,0,0
2,M1B,1,0,0,0,0,0,0,0,0
3,M1B,0,0,0,0,0,0,0,1,0
4,M1B,0,0,0,0,1,0,0,0,0


Let's now see how many venues of each type we have.

In [50]:
totals = toronto_one_hot_venues_by_postal_code[list(toronto_one_hot_venues_by_postal_code.columns)[1:]].sum(axis=0)
totals

Arts & Entertainment            219
College & University             11
Food                           2837
Nightlife Spot                  302
Outdoors & Recreation           502
Professional & Other Places      32
Residence                         1
Shop & Service                 1044
Travel & Transport              172
dtype: int64

Clearly, food dominates our lives! But I think we can drop Residence, College and Professional columns. They have too little data to be useful. Plus, if anyone is curious, I have looked at those lists, and they are woefully incomplete. Maybe this is because I limited the resultset to 100 venues for each postal code area, but I think more realistically, Foursquare is geared towards food, entertainment and shopping.

So, I will group the encoded dataframe by the Postal Code, sum each column in each subgroup and drop those three columns with fewer than 100 venues. To account for different scales of each category, I will also divide each valu by the corresponding total calculated above.

In [51]:
toronto_venues_counts_by_postal_code = toronto_one_hot_venues_by_postal_code.groupby("Postal_code").sum().div(totals)
toronto_venues_counts_by_postal_code.drop(["Residence", "College & University", "Professional & Other Places"], axis=1, inplace=True)
toronto_venues_counts_by_postal_code.head()


Unnamed: 0_level_0,Arts & Entertainment,Food,Nightlife Spot,Outdoors & Recreation,Shop & Service,Travel & Transport
Postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,0.013699,0.000352,0.0,0.003984,0.001916,0.0
M1C,0.0,0.001057,0.0,0.001992,0.0,0.0
M1E,0.0,0.00141,0.003311,0.00996,0.007663,0.005814
M1G,0.0,0.003877,0.0,0.003984,0.004789,0.0
M1H,0.0,0.00564,0.0,0.003984,0.004789,0.005814


Import KMeans and prepare data

In [52]:
X = toronto_venues_counts_by_postal_code.values
X

array([[0.01369863, 0.00035249, 0.        , 0.00398406, 0.00191571,
        0.        ],
       [0.        , 0.00105746, 0.        , 0.00199203, 0.        ,
        0.        ],
       [0.        , 0.00140994, 0.00331126, 0.00996016, 0.00766284,
        0.00581395],
       [0.        , 0.00387734, 0.        , 0.00398406, 0.00478927,
        0.        ],
       [0.        , 0.00563976, 0.        , 0.00398406, 0.00478927,
        0.00581395],
       [0.        , 0.00317237, 0.        , 0.        , 0.00095785,
        0.00581395],
       [0.        , 0.00317237, 0.        , 0.        , 0.00670498,
        0.03488372],
       [0.        , 0.00352485, 0.00331126, 0.00398406, 0.00287356,
        0.03488372],
       [0.        , 0.0024674 , 0.        , 0.00398406, 0.00670498,
        0.        ],
       [0.00456621, 0.00035249, 0.        , 0.01394422, 0.        ,
        0.00581395],
       [0.        , 0.00528728, 0.        , 0.00398406, 0.00478927,
        0.        ],
       [0.        , 0

One of the main issues in KMeans is that the user has to set the number of expected clusters. Given that we only have 6 categories left, it is probably wisest to not have too many clusters. Let's iterate of this parameter and collect a new vector of labels for each iteration.

In [53]:
cluster_sizes = [2, 3, 4, 5]
labels = {}
for k in cluster_sizes:
    k_means = KMeans(init = "k-means++", n_clusters = k, n_init = 12)
    k_means.fit(X)
    labels[k] = k_means.labels_

Now, let's populate teh dataframe with the labels from each iteration and have a look at teh resulting clusters. Grouping the dataframe by label and calculating a mean will help here.

In [55]:
for k in cluster_sizes:
    toronto_venues_counts_by_postal_code["Cluster ID"] = labels[k]
    means_df = toronto_venues_counts_by_postal_code.groupby("Cluster ID").mean() * 100
    print(f"Number of clusters: {k}")
    print(means_df.to_string())

Number of clusters: 2
            Arts & Entertainment      Food  Nightlife Spot  Outdoors & Recreation  Shop & Service  Travel & Transport
Cluster ID                                                                                                           
0                       2.886497  2.039378        2.802744               1.472681        1.457307            1.764950
1                       0.259163  0.579695        0.290854               0.794121        0.799938            0.683532
Number of clusters: 3
            Arts & Entertainment      Food  Nightlife Spot  Outdoors & Recreation  Shop & Service  Travel & Transport
Cluster ID                                                                                                           
0                       0.196698  0.428947        0.173204               0.677291        0.615974            0.652952
1                       1.674277  1.943890        2.330145               1.453445        1.851852            0.710594
2           

So there is an interesting pattern emerging already with just 2 clusters. There are lots of neighbourhoods where there is just a smattering of stuff - some shopping, some food, some entertainment, but nothing outstanding. Basically - a "Boringville", where people live their lives in highrises or townhouses. And then there are these other neighbourhoods, that have the rest of it. Increasing the number of clusters adds flavour to those other neighbourhoods: e.g., entertainment and nightlife are in different areas from shopping. 

In [56]:
# This cell may need to be uncommented to account for the fact that "M1X" did not have any venues.
if len(pd.Index(toronto_work_df["Postal_code"]).difference(pd.Index(toronto_venues_counts_by_postal_code.index))) > 0:    
    toronto_work_df.loc[toronto_work_df["Postal_code"]=="M1X",:]
    toronto_work_df = toronto_work_df.drop(16, axis=0)

For the final map, I chose 5 clusters, although it is debatable if my definition of "Transit city" is skewed by the amount of data in the initial result set.  
Add the labels to the dataframe with coordinates for each postal code.

In [57]:
toronto_work_df["Cluster ID"] = labels[5]
toronto_work_df.head()

Unnamed: 0,Postal_code,Borough,Neighbourhood,Latitude,Longitude,Cluster ID
0,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517,0
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725,0
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193,0
3,M1G,Scarborough,Woburn,43.768369,-79.21759,0
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944,0


Check the means for the six categories in each cluster. 

In [58]:
toronto_venues_counts_by_postal_code.groupby("Cluster ID").mean() * 100

Unnamed: 0_level_0,Arts & Entertainment,Food,Nightlife Spot,Outdoors & Recreation,Shop & Service,Travel & Transport
Cluster ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.220438,0.411435,0.154145,0.662866,0.584621,0.410986
1,1.927955,1.973916,4.157469,1.460823,1.628352,0.516796
2,4.200913,1.963342,2.582781,1.673307,0.996169,3.837209
3,0.0,0.574047,0.331126,0.796813,0.875753,2.657807
4,1.547438,1.928876,1.416483,1.449757,1.963602,0.807494


Come up with creative names for each cluster. The arts and entertainment venues concentrate in cluster 2, while nightlife is more active in cluster 1. Cluster 0 is clearly a residential area - covering the basics of other activities, but just about. Cluster 3 seems to have an unusual difference from cluster 0 in the fact that all sorts of transit solutions are more prevalent (bus stops, stations, car rentals), but it doesn't have any fun going for it (e.g., 0 for entertainment), so it can't really be lumped in with cluster 2, also high on bus stops. This leaves cluster 4, which I called "Shoptown", but it clearly is more than that - also relatively rich in food venues, nightlife, etc.

In [64]:
# NB! These labels need to be rearranged depending on how the clustering algorithm assigns labels. 
# Probably better to write a function to do so based on the actual data, but I must submit this notebook today!
cluster_labels = [
    "Boringville", 
    "Nightlife", 
    "Entertainment district", 
    "Transit city",
    "Shoptown", 
]

Redo the map of Toronto with labels coloured by cluster.

In [65]:
cluster_map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11, tiles='cartodbpositron')
# add boroughs (old municipalities) to map
colormap = {
    'YORK': '#b3e2cd',
    'SCARBOROUGH': '#fdcdac',
    'NORTH YORK': '#cbd5e8',
    'TORONTO': '#f4cae4',
    'ETOBICOKE': '#e6f5c9',
    'EAST YORK': '#fff2ae'
}

cluster_colors = [
    '#7fc97f',
    '#beaed4',
    '#fdc086',
    '#ffff99',
    '#386cb0'
]
style_function = lambda x: {
    'fillColor': colormap[x['properties']['AREA_NAME']],
    'color': "#aaaaaa"
}
for feature in boroughs_geoJSON['features']:
    
    folium.GeoJson(
        feature,
        style_function=style_function,
        name='geojson'
    ).add_to(cluster_map_toronto)
    
    label = feature['properties']['AREA_NAME']
    label_color = Color(colormap[label])
    label_color.luminance *= 0.3
    lat = feature['properties']['LATITUDE']
    lng = feature['properties']['LONGITUDE']
    folium.Marker(
        location=[lat,lng],
        icon=folium.DivIcon(html=f"""<div style="color: {label_color.hex_l}; align: center">{label}</div>""")
    ).add_to(cluster_map_toronto)
        
# add a coloured circle for each postal code area, with colour determined by the cluster it belongs to
# add the "creative" cluster name to each label.
for lat, lng, borough, neighborhood, postcode, cluster_id in zip(toronto_work_df['Latitude'], 
                                                     toronto_work_df['Longitude'], 
                                                     toronto_work_df['Borough'], 
                                                     toronto_work_df['Neighbourhood'],
                                                     toronto_work_df['Postal_code'],
                                                     toronto_work_df['Cluster ID']
                                                                ):
    label = '{} ({}) {}'.format(postcode, cluster_labels[cluster_id], neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=cluster_colors[cluster_id],
        fill=True,
        fill_color=cluster_colors[cluster_id],
        fill_opacity=0.3,
        parse_html=False
    ).add_to(cluster_map_toronto)  
    

title_html = '<h3 align="center" style="font-size:20px"><b>Toronto boroughs and clusters of postal codes</b></h3>'
    
cluster_map_toronto.get_root().html.add_child(folium.Element(title_html))

cluster_map_toronto

And here it is, another map of the City of Toronto. Boringville is comprised of essentially all the residential areas in the outer boroughs, while all the fun is happening in the original city of Toronto (all the entertainment and nightlife, and most of the shopping).

## End of Part 3