# Toronto Neighborhoods

## Scraping Neighborhood Data

This notebook scrapes the neighborhood data from Wikipedia

The scraping is done using the BeautifulSoup library.

After checking the page I asssume the following:
 - There will be only one table with the class _wikitable_ on the page.
 - The _Not assigned_ value is always written in the same casing.
 - There is only one row for each postal code
 
 

> **Note** that the wiki page has been changed since the assignment was created. Now there is only one row for each postal code, and the different neighborhoods under the same postal code are separated by /.


In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
WIKI_URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

hoods = pd.DataFrame(columns=['Postal code', 'Borough', 'Neighborhood'])

soup = BeautifulSoup(requests.get(WIKI_URL).text, 'html.parser')
for row in soup.select_one('table.wikitable').find_all('tr'):
    cols = row.find_all('td')
    if len(cols) < 3:
        continue
    if cols[1].get_text().strip() == 'Not assigned':
        continue
    hoods = hoods.append({'Postal code': cols[0].get_text().strip(), 'Borough': cols[1].get_text().strip(), 'Neighborhood': cols[2].get_text().strip() if cols[2].get_text().strip() != 'Not assigned' else cols[1].get_text().strip()}, ignore_index=True)


In [3]:
hoods.head(12)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [4]:
hoods.shape

(103, 3)

In [5]:
!pip install geocoder
import geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 13.4MB/s ta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [6]:
MAX_RETRIES = 10
def get_lat_lon(hood):
    
    retries = MAX_RETRIES
    location = None
    while retries > 0 and location is None:
        location = geocoder.osm('{}, Toronto, Ontario'.format(hood)).latlng
        retries -= 1
        
    if location is None:        
        return { 'Latitude': None, 'Longitude': None }

    return { 'Latitude': location[0], 'Longitude': location[1] }

> **Note:** Here I ended up using the OSM provider instead of Google and the neighborhood names insted of the postal codes. With google and postal codes almost no result was returned even retrying every data point 50 times, this way only a few of them are missing, those will be removed from the data set.

In [7]:
lat_lon_df = pd.DataFrame(columns=['Latitude', 'Longitude'])

for hood in hoods['Neighborhood']:
    lat_lon_df = lat_lon_df.append(get_lat_lon(hood.split('/')[0].strip()), ignore_index=True)

hoods_with_loc = pd.concat([hoods, lat_lon_df], axis=1, sort=False)

original_size = hoods_with_loc.shape[0]

hoods_with_loc = hoods_with_loc.dropna()

print('Removed {} rows due to missing location data'.format(original_size - hoods_with_loc.shape[0]))

Removed 8 rows due to missing location data


In [8]:
hoods_with_loc.head(12)

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7588,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.660706,-79.360457
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.722079,-79.437507
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.659659,-79.39034
5,M9A,Etobicoke,Islington Avenue,43.622575,-79.514215
6,M1B,Scarborough,Malvern / Rouge,43.809196,-79.221701
7,M3B,North York,Don Mills,43.775347,-79.345944
8,M4B,East York,Parkview Hill / Woodbine Gardens,43.653482,-79.383935
10,M6B,North York,Glencairn,43.708712,-79.440685


In [9]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ------------------------------------------------------------
                       

## Creating a map of the neighborhoods

In [10]:
import html

toronto_coordinates = [ 43.717899, -79.6582408 ]
toronto_map = folium.Map(location=toronto_coordinates, zoom_start=12)

for lat, lon, neighborhood, postal_code, borough in zip(hoods_with_loc.Latitude, hoods_with_loc.Longitude, hoods_with_loc.Neighborhood, hoods_with_loc['Postal code'], hoods_with_loc.Borough):
    folium.features.Marker(
        [lat, lon],
        popup=folium.Popup('<h3>{}</h3><h4>{}, {}</h4>'.format(html.escape(neighborhood), postal_code, borough)),
    ).add_to(toronto_map)

toronto_map

In [11]:
# The code was removed by Watson Studio for sharing.

## Gathering the venues for each neighborhood
> **Note** For easier usage I decided to use a foursquare python library that is recommended on [Foursquare](https://developer.foursquare.com/docs/places-api/libraries/).

In [12]:
!pip install foursquare
import foursquare

client = foursquare.Foursquare(client_id=FOURSQUARE_CLIENT_ID, client_secret=FOURSQUARE_CLIENT_SECRET)

Collecting foursquare
  Downloading https://files.pythonhosted.org/packages/16/c7/d51ecf7e06a75741a61ff752e5e010db8794ec0af01da98f42db7ab64ffe/foursquare-1%212020.1.30-py3-none-any.whl
Installing collected packages: foursquare
Successfully installed foursquare-1!2020.1.30


In [13]:
import requests
def get_venues(lat, lon, hood):
    url = 'https://api.foursquare.com/v2/venues/explore?ll={},{}&radius=500&time=any&day=any&v=20200411&client_id={}&client_secret={}'.format(lat, lon, FOURSQUARE_CLIENT_ID, FOURSQUARE_CLIENT_SECRET)
    response = requests.get(url).json()
    #response = client.venues.explore(params={'ll': '{},{}'.format(lat, lon), 'radius': radius, 'time': 'any', 'day': 'any'})
    venues = pd.DataFrame(columns=['Neighborhood', 'Name', 'Category', 'Latitude', 'Longitude', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Icon'])
    try:
        for v in response['response']['groups'][0]['items']:
            venues = venues.append({
                'Neighborhood': hood,
                'Name': v['venue']['name'],
                'Category': v['venue']['categories'][0]['pluralName'],
                'Latitude': v['venue']['location']['lat'],
                'Longitude': v['venue']['location']['lng'],
                'Neighborhood Latitude': lat,
                'Neighborhood Longitude': lon,
                'Icon': "{}64{}".format(v['venue']['categories'][0]['icon']['prefix'], v['venue']['categories'][0]['icon']['suffix'])
            }, ignore_index=True)
    except:
        print(response)
    return venues

In [14]:
venues_with_hoods = pd.DataFrame(columns=['Neighborhood', 'Name', 'Category', 'Latitude', 'Longitude', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Icon'])

for lat, lon, hood in zip(hoods_with_loc.Latitude, hoods_with_loc.Longitude, hoods_with_loc.Neighborhood):
    venues_with_hoods = pd.concat([venues_with_hoods, get_venues(lat, lon, hood)], axis=0)

venues_with_hoods.reset_index(inplace=True)
venues_with_hoods.head()

Unnamed: 0,Neighborhood,Name,Category,Latitude,Longitude,Neighborhood Latitude,Neighborhood Longitude,Icon
0,Parkwoods,Allwyn's Bakery,Caribbean Restaurants,43.75984,-79.324719,43.7588,-79.320197,https://ss3.4sqi.net/img/categories_v2/food/ca...
1,Parkwoods,LCBO,Liquor Stores,43.757774,-79.314257,43.7588,-79.320197,https://ss3.4sqi.net/img/categories_v2/shops/f...
2,Parkwoods,Petro-Canada,Gas Stations,43.75795,-79.315187,43.7588,-79.320197,https://ss3.4sqi.net/img/categories_v2/shops/g...
3,Parkwoods,Shoppers Drug Mart,Pharmacies,43.760857,-79.324961,43.7588,-79.320197,https://ss3.4sqi.net/img/categories_v2/shops/p...
4,Parkwoods,Pizza Pizza,Pizza Places,43.760231,-79.325666,43.7588,-79.320197,https://ss3.4sqi.net/img/categories_v2/food/pi...


## Exploring venues data

In [15]:
print('There are {} unique categories.'.format(venues_with_hoods.Category.unique().size))

There are 238 unique categories.


In [16]:
venues_with_hoods.describe(include='all')

Unnamed: 0,Neighborhood,Name,Category,Latitude,Longitude,Neighborhood Latitude,Neighborhood Longitude,Icon
count,1702,1702,1702,1702.0,1702.0,1702.0,1702.0,1702
unique,89,1083,238,,,,,181
top,Don Mills,Tim Hortons,Coffee Shops,,,,,https://ss3.4sqi.net/img/categories_v2/food/co...
freq,60,54,132,,,,,132
mean,,,,43.692586,-79.387777,43.692425,-79.387956,
std,,,,0.052136,0.069741,0.051773,0.069848,
min,,,,43.598181,-79.581093,43.600763,-79.576516,
25%,,,,43.650142,-79.420599,43.650099,-79.419526,
50%,,,,43.669165,-79.3906,43.670338,-79.390504,
75%,,,,43.744727,-79.361496,43.744039,-79.360457,


Checking in how many neighborhoods a category appers

In [17]:
venues_with_hoods.groupby('Category').count().sort_values(by='Neighborhood')

Unnamed: 0_level_0,Neighborhood,Name,Latitude,Longitude,Neighborhood Latitude,Neighborhood Longitude,Icon
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Accessories Stores,1,1,1,1,1,1,1
Lingerie Stores,1,1,1,1,1,1,1
Laundry Services,1,1,1,1,1,1,1
Laundromats,1,1,1,1,1,1,1
Laser Tag Places,1,1,1,1,1,1,1
Lakes,1,1,1,1,1,1,1
Kids Stores,1,1,1,1,1,1,1
Indie Theaters,1,1,1,1,1,1,1
Indie Movie Theaters,1,1,1,1,1,1,1
Indian Chinese Restaurants,1,1,1,1,1,1,1


In [39]:
cat_list = venues_with_hoods.groupby('Category').count().sort_values(by='Neighborhood')
unique_cat_list = cat_list[cat_list.Neighborhood < 3]
print('There are {} categories that only appear once in the dataset'.format(len(unique_cat_list)))

There are 38 categories that only appear once in the dataset


There are lots of venues that are unique to a neighborhood. Remove those, since it won't help us find similar neighborhoods.

In [100]:
for cat in unique_cat_list.index:
    venues_with_hoods = venues_with_hoods[venues_with_hoods.Category != cat]

Check the categories again

In [101]:
venues_with_hoods.groupby('Category').count().sort_values(by='Neighborhood')

Unnamed: 0_level_0,level_0,index,Neighborhood,Name,Latitude,Longitude,Neighborhood Latitude,Neighborhood Longitude,Icon
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Creperies,3,3,3,3,3,3,3,3,3
Steakhouses,3,3,3,3,3,3,3,3,3
Comfort Food Restaurants,3,3,3,3,3,3,3,3,3
Business Services,3,3,3,3,3,3,3,3,3
Bus Stops,3,3,3,3,3,3,3,3,3
Falafel Restaurants,3,3,3,3,3,3,3,3,3
Construction & Landscaping,3,3,3,3,3,3,3,3,3
New American Restaurants,3,3,3,3,3,3,3,3,3
Miscellaneous Shops,3,3,3,3,3,3,3,3,3
Snack Places,3,3,3,3,3,3,3,3,3


Checking how many venues each district has:

In [102]:
category_count = venues_with_hoods.groupby('Neighborhood').count()[['Category']]
category_count

Unnamed: 0_level_0,Category
Neighborhood,Unnamed: 1_level_1
Agincourt,11
Alderwood / Long Branch,8
Bathurst Manor / Wilson Heights / Downsview North,4
Bayview Village,12
Bedford Park / Lawrence Manor East,2
Berczy Park,26
Birch Cliff / Cliffside West,4
Brockton / Parkdale Village / Exhibition Place,18
CN Tower / King and Spadina / Railway Lands / Harbourfront West / Bathurst Quay / South Niagara / Island airport,20
Cedarbrae,24


There are some neighborhoods with less than 10 venues. Remove those, because we cannot classify them accurately because of the lack of data

In [107]:
remove_indexes = []
for index, venue in venues_with_hoods.iterrows():
    if category_count.loc[venue.Neighborhood].Category < 10:
        remove_indexes.append(index)
print("Removing", len(remove_indexes), "venues from", venues_with_hoods.shape[0])
venues_with_hoods_filtered = venues_with_hoods.drop(remove_indexes)
venues_with_hoods_filtered

Removing 134 venues from 1551


Unnamed: 0,level_0,index,Neighborhood,Name,Category,Latitude,Longitude,Neighborhood Latitude,Neighborhood Longitude,Icon
0,0,0,Parkwoods,Allwyn's Bakery,Caribbean Restaurants,43.759840,-79.324719,43.758800,-79.320197,https://ss3.4sqi.net/img/categories_v2/food/ca...
1,1,1,Parkwoods,LCBO,Liquor Stores,43.757774,-79.314257,43.758800,-79.320197,https://ss3.4sqi.net/img/categories_v2/shops/f...
2,2,2,Parkwoods,Petro-Canada,Gas Stations,43.757950,-79.315187,43.758800,-79.320197,https://ss3.4sqi.net/img/categories_v2/shops/g...
3,3,3,Parkwoods,Shoppers Drug Mart,Pharmacies,43.760857,-79.324961,43.758800,-79.320197,https://ss3.4sqi.net/img/categories_v2/shops/p...
4,4,4,Parkwoods,Pizza Pizza,Pizza Places,43.760231,-79.325666,43.758800,-79.320197,https://ss3.4sqi.net/img/categories_v2/food/pi...
5,5,5,Parkwoods,TD Canada Trust,Banks,43.757569,-79.314976,43.758800,-79.320197,https://ss3.4sqi.net/img/categories_v2/shops/f...
6,6,6,Parkwoods,Family Food Fair Convenience,Convenience Stores,43.760620,-79.324459,43.758800,-79.320197,https://ss3.4sqi.net/img/categories_v2/shops/c...
7,7,8,Parkwoods,Parkwoods Village Centre,Shopping Malls,43.760735,-79.324873,43.758800,-79.320197,https://ss3.4sqi.net/img/categories_v2/shops/m...
8,8,9,Parkwoods,Dollarama,Discount Stores,43.760341,-79.325519,43.758800,-79.320197,https://ss3.4sqi.net/img/categories_v2/shops/d...
9,9,11,Parkwoods,La Notre,Coffee Shops,43.760704,-79.325396,43.758800,-79.320197,https://ss3.4sqi.net/img/categories_v2/food/co...


In [108]:
hoods_aggregated = pd.get_dummies(venues_with_hoods_filtered[['Neighborhood', 'Category']], columns=['Category'], prefix='', prefix_sep='').groupby('Neighborhood').mean().reset_index()
hoods_aggregated.head()

Unnamed: 0,Neighborhood,American Restaurants,Art Galleries,Arts & Crafts Stores,Asian Restaurants,Athletics & Sports,BBQ Joints,Bakeries,Banks,Bars,...,Tibetan Restaurants,Toy / Game Stores,Trails,Train Stations,Vegetarian / Vegan Restaurants,Video Game Stores,Vietnamese Restaurants,Wine Bars,Women's Stores,Yoga Studios
0,Agincourt,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.090909,0.0,0.0,0.090909,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.038462,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,...,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0
3,Brockton / Parkdale Village / Exhibition Place,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.166667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0
4,CN Tower / King and Spadina / Railway Lands / ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.05


Prepare the data for clustering and run KMeans

In [109]:
from sklearn.cluster import KMeans

In [110]:
hoods_clustering = hoods_aggregated.drop('Neighborhood', axis=1)
hoods_clustering.head()

Unnamed: 0,American Restaurants,Art Galleries,Arts & Crafts Stores,Asian Restaurants,Athletics & Sports,BBQ Joints,Bakeries,Banks,Bars,Beer Bars,...,Tibetan Restaurants,Toy / Game Stores,Trails,Train Stations,Vegetarian / Vegan Restaurants,Video Game Stores,Vietnamese Restaurants,Wine Bars,Women's Stores,Yoga Studios
0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.090909,0.0,0.0,0.090909,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.038462,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,0.038462,...,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.166667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.05


In [111]:
K = 5
clusters = KMeans(n_clusters=5, random_state=20200411).fit(hoods_clustering)
clusters.labels_

array([4, 1, 0, 4, 4, 1, 4, 2, 0, 2, 2, 3, 3, 0, 4, 0, 1, 1, 4, 4, 1, 0,
       3, 0, 2, 0, 1, 4, 2, 4, 4, 0, 4, 1, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1,
       1, 2, 1, 0, 0, 1, 1, 1, 1, 1, 3, 3], dtype=int32)

Let's check the clusters

In [112]:
hoods_clustered = hoods_aggregated.copy()
hoods_clustered['Cluster'] = pd.DataFrame(clusters.labels_)
hoods_clustered['Longitude'] = hoods_with_loc['Longitude']
hoods_clustered['Latitude'] = hoods_with_loc['Latitude']
hoods_clustered = hoods_clustered.dropna()
hoods_clustered


Unnamed: 0,Neighborhood,American Restaurants,Art Galleries,Arts & Crafts Stores,Asian Restaurants,Athletics & Sports,BBQ Joints,Bakeries,Banks,Bars,...,Train Stations,Vegetarian / Vegan Restaurants,Video Game Stores,Vietnamese Restaurants,Wine Bars,Women's Stores,Yoga Studios,Cluster,Longitude,Latitude
0,Agincourt,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,...,0.090909,0.0,0.0,0.090909,0.0,0.0,0.0,4,-79.320197,43.7588
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,-79.311189,43.732658
2,Berczy Park,0.0,0.038462,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,...,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0,-79.360457,43.660706
3,Brockton / Parkdale Village / Exhibition Place,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.166667,...,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,4,-79.437507,43.722079
4,CN Tower / King and Spadina / Railway Lands / ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.05,0.0,0.0,0.0,0.0,0.0,0.05,4,-79.39034,43.659659
5,Cedarbrae,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,...,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,1,-79.514215,43.622575
6,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,-79.221701,43.809196
7,Church and Wellesley,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,-79.345944,43.775347
8,Commerce Court / Victoria Hotel,0.0,0.035714,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,-79.383935,43.653482
10,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.035714,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,-79.440685,43.708712


Let's gather for each venue the top 10 categories and sum it for their clusters

In [113]:
def get_top_categories(hood, n=10):
    return hood.drop('Cluster').drop('Longitude').drop('Latitude').T.drop('Neighborhood').sort_values(ascending=False).head(n).reset_index()['index'].values


In [118]:
from collections import Counter
for cluster in hoods_clustered['Cluster'].unique():
    top_categories = []
    for i, hood in hoods_clustered[hoods_clustered['Cluster'] == cluster].iterrows():
        top_categories.extend(get_top_categories(hood))
    print('Top categories for Cluster', cluster)
    print(Counter(top_categories).most_common(10))
        
        

Top categories for Cluster 4
[('Coffee Shops', 8), ('Cafés', 5), ('Korean Restaurants', 4), ('Vietnamese Restaurants', 4), ('Pizza Places', 4), ('Restaurants', 4), ('Japanese Restaurants', 4), ('Chinese Restaurants', 3), ('Bars', 3), ('Parks', 3)]
Top categories for Cluster 1
[('Coffee Shops', 10), ('Pizza Places', 8), ('Sandwich Places', 7), ('Grocery Stores', 7), ('Fast Food Restaurants', 5), ('Discount Stores', 5), ('Burger Joints', 5), ('Pharmacies', 5), ('Fried Chicken Joints', 5), ('Banks', 4)]
Top categories for Cluster 0
[('Cafés', 9), ('Restaurants', 6), ('Coffee Shops', 6), ('Japanese Restaurants', 6), ('Bakeries', 5), ('Italian Restaurants', 4), ('Bookstores', 4), ('Hotels', 3), ('Bars', 3), ('Mexican Restaurants', 3)]
Top categories for Cluster 2
[('Sushi Restaurants', 5), ('Italian Restaurants', 5), ('Indian Restaurants', 3), ('Coffee Shops', 3), ('Pubs', 3), ('Gastropubs', 2), ('Pizza Places', 2), ('Pharmacies', 2), ('Cafés', 2), ('Dessert Shops', 2)]
Top categories for C

The above lists give us some idea about the clusters.
For example Cluster 4's top categories include asian restaurants, and Cluster 1's top categories are mostly fast food joints.

Let's show the clusters on a map

In [115]:
import matplotlib.cm as cm
import matplotlib.colors as colors

cluster_map = folium.Map(toronto_coordinates, zoom_start=12)

# set color scheme for the clusters
x = np.arange(K)
ys = [i + x + (i*x)**2 for i in range(K)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]



for hood, lat, lon, cluster in zip(hoods_clustered['Neighborhood'], hoods_clustered['Latitude'], hoods_clustered['Longitude'], hoods_clustered['Cluster'], ):
    label = folium.Popup(hood + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=1).add_to(cluster_map)

    

cluster_map

In the above map we can see how the 5 clusters are positioned inside Toronto.