# Peer-graded Assignment: Segmenting and Clustering Neighbourhoods in Toronto
---

# Part 1

Use the Notebook to build the code to scrape data from the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. Then process, clean, and transform the data. And built a dataframe of the postal code of each neighbourhood along with the borough name and neighbourhood name.


First, we need to read the data from wikipedia page using `pandas.read_html()`.

In [1]:
import pandas as pd

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', match = 'Neighbourhood')
# there are several tables on this wikipedia website, so we need some text matching to find the table we want

print(type(tables)) # this is a list of tables we get from the website
print('Number of table collected: {}'.format(len(tables)))  # with the text matching, we only have one table in this list

<class 'list'>
Number of table collected: 1


In [2]:
# check the table we got

raw_data = tables[0]
raw_data.info()
raw_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Postal Code    180 non-null    object
 1   Borough        180 non-null    object
 2   Neighbourhood  180 non-null    object
dtypes: object(3)
memory usage: 4.3+ KB


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


We have 3 columns(Postal Code, Borough, and Neighbourhood) and 180 rows in the dataframe. That means we collect all the data we need from the website. We will only process the cells that have an assigned borough and ignore cells with a borough that is 'Not assigned'.

In [3]:
# drop all rows with a Borough that is 'Not assigned'
df1 = raw_data[raw_data['Borough'] != 'Not assigned']
df1.shape

(103, 3)

More than one neighbourhood can exist in one postal code area. For example, in the table on the Wikipedia page, we notice that M5A has two neighbourhoods, Harbourfront and Regent Park. These two are combined into one row with the neighbourhoods separated with a comma.

If a cell has a borough but a 'Not assigned' neighbourhood, then the neighbourhood will be the same as the borough.

In [4]:
for index, row in df1.iterrows():
    if row['Neighbourhood'] == 'Not assigned':
        row['Neighbourhood'] = row['Borough']

In [5]:
# Now we reset the index of our dataframe
filtered_data = df1.reset_index(drop = True)
filtered_data

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [6]:
filtered_data.shape

(103, 3)

---

# Part 2

Now that we have built a dataframe of the postal code of each neighbourhood along with the borough name and neighbourhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each group of neighbourhoods (each Postal Code).

In this part, we will combine the geospatial data to our previous data create a dataframe that has the following columns: 'Postal Code', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'.

 We will first load the csv file that has the geographical coordinates of each postal code from http://cocl.us/Geospatial_data.

In [7]:
geospatialData = pd.read_csv('https://cocl.us/Geospatial_data')
geospatialData.info()
geospatialData.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Postal Code  103 non-null    object 
 1   Latitude     103 non-null    float64
 2   Longitude    103 non-null    float64
dtypes: float64(2), object(1)
memory usage: 2.5+ KB


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, we merge the Borough&Neighbourhood data with the geographical coordinates by matching the Postal Code.

In [8]:
df_Toronto = pd.merge(filtered_data, geospatialData, on = 'Postal Code')
df_Toronto.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


---

# Part 3

Now, we will explore and cluster the neighbourhoods in Toronto. And then replicate the same analysis we did to the New York City data.

Use geopy library to get the latitude and longitude values of Toronto.

In [9]:
#!pip install geopy # uncomment this line if need
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [10]:
address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Create a map of Toronto with neighbourhoods superimposed on top. Since some neighbourhoods are grouped together with same Postal Code and geograpical coordinates, we will add markers based on Postal Codes. And the labels for these markers may show several neighbourhoods. 

In [11]:
!pip install folium
import folium # map rendering library

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 4.0 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [12]:
# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start = 10)

# add markers to map
for lat, lng, borough, neighbourhood, postalcode in zip(df_Toronto['Latitude'], df_Toronto['Longitude'], df_Toronto['Borough'], df_Toronto['Neighbourhood'], df_Toronto['Postal Code']):
    label = '({}), {} {}'.format(neighbourhood, borough, postalcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
                        [lat, lng],
                        radius = 5,
                        popup = label,
                        color = 'blue',
                        fill = True,
                        fill_color = '#3186cc',
                        fill_opacity = 0.7,
                        parse_html = False).add_to(map_Toronto)  
    
map_Toronto

Next, we are going to start utilizing the Foursquare API to explore the neighbourhoods and segment them.

In [13]:
import requests # library to handle requests

In [14]:
# The code was removed by Watson Studio for sharing.

Here, we define a function to repeat the process of retrieving venues for each Postal Code(group of neighbourhoods) in Toronto.

In [15]:
# neighbourhoods with the same Postal Code are grouped together and handled as one. 
def getNearbyVenues(codes, latitudes, longitudes, radius = 500):
    
    venues_list=[]
    for code, lat, lng in zip(codes, latitudes, longitudes):
        print(code)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            code, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now we will run the above function on each Postal Code(group of neighbourhoods) and create a new dataframe called Toronto_venues.

In [16]:
Toronto_venues = getNearbyVenues(codes = df_Toronto['Postal Code'],
                                 latitudes = df_Toronto['Latitude'],
                                 longitudes = df_Toronto['Longitude']
                                )

M3A
M4A
M5A
M6A
M7A
M9A
M1B
M3B
M4B
M5B
M6B
M9B
M1C
M3C
M4C
M5C
M6C
M9C
M1E
M4E
M5E
M6E
M1G
M4G
M5G
M6G
M1H
M2H
M3H
M4H
M5H
M6H
M1J
M2J
M3J
M4J
M5J
M6J
M1K
M2K
M3K
M4K
M5K
M6K
M1L
M2L
M3L
M4L
M5L
M6L
M9L
M1M
M2M
M3M
M4M
M5M
M6M
M9M
M1N
M2N
M3N
M4N
M5N
M6N
M9N
M1P
M2P
M4P
M5P
M6P
M9P
M1R
M2R
M4R
M5R
M6R
M7R
M9R
M1S
M4S
M5S
M6S
M1T
M4T
M5T
M1V
M4V
M5V
M8V
M9V
M1W
M4W
M5W
M8W
M9W
M1X
M4X
M5X
M8X
M4Y
M7Y
M8Y
M8Z


Check the size of the resulting dataframe.

In [17]:
print(Toronto_venues.shape)
Toronto_venues.head()

(2141, 7)


Unnamed: 0,Postal Code,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,M4A,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,M4A,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,M4A,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


Check how many venues were returned for each group of neighbourhoods (each Postal Code).

In [18]:
Toronto_venues.groupby('Postal Code').count()

Unnamed: 0_level_0,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,1,1,1,1,1,1
M1C,2,2,2,2,2,2
M1E,8,8,8,8,8,8
M1G,4,4,4,4,4,4
M1H,8,8,8,8,8,8
...,...,...,...,...,...,...
M9N,2,2,2,2,2,2
M9P,6,6,6,6,6,6
M9R,4,4,4,4,4,4
M9V,10,10,10,10,10,10


In [19]:
Toronto_venues.groupby('Postal Code').count().shape

(100, 6)

Here we can see that only 100 out of 103 Postal codes are collected with the feature we want (venues).<br>
The following Postal Codes(group of neighbourhoods) are missing:<br>

    5 M9A Etobicoke Islington Avenue, Humber Valley Village 43.667856 -79.532242
    95 M1X Scarborough Upper Rouge 43.836125 -79.205636
    52 M2M North York Willowdale, Newtonbrook 43.789053 -79.408493
The response we get for these Postal Codes show "There aren't a lot of results near you. Try something more general, reset your filters, or expand the search area". Since there are no venue data for these neighbourhoods, we will ignore them and go ahead with the 100 groups collected. 

Let's find out how many unique categories are curated from all the returned venues

In [20]:
print('There are {} uniques categories.'.format(len(Toronto_venues['Venue Category'].unique())))

There are 273 uniques categories.


### Analyze Each Group of Neighbourhoods (Postal Code)

In [21]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add Postal Code column back to dataframe
Toronto_onehot['Postal Code'] = Toronto_venues['Postal Code'] 

# move Postal Code column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]

Toronto_onehot.head()

Unnamed: 0,Postal Code,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
# examine the new dataframe size
Toronto_onehot.shape

(2141, 274)

Next, let's group rows by Postal Code and by taking the mean of the frequency of occurrence of each category.

In [23]:
Toronto_grouped = Toronto_onehot.groupby('Postal Code').mean().reset_index()
Toronto_grouped.head()

Unnamed: 0,Postal Code,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
Toronto_grouped.shape

(100, 274)

In [None]:
df_Toronto[['Postal Code', 'Neighbourhood']]

In [None]:

Toronto_grouped_neig = pd.merge(Toronto_grouped, df_Toronto[['Postal Code', 'Neighbourhood']], on = 'Postal Code')

In [None]:
Toronto_grouped_neig

In [None]:
Toronto_grouped_neig.shape

Now, we define a function to sort the venues in descending order.

In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    return row_categories_sorted.index.values[0:num_top_venues]

Let's create a new dataframe and display the top 10 venues for each Postal Code(group of neighbourhoods).

In [31]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Postal Code'] = Toronto_grouped['Postal Code']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Health & Beauty Service
1,M1C,Construction & Landscaping,Bar,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
2,M1E,Breakfast Spot,Restaurant,Rental Car Location,Electronics Store,Medical Center,Intersection,Bank,Mexican Restaurant,Yoga Studio,Doner Restaurant
3,M1G,Coffee Shop,Korean BBQ Restaurant,Mexican Restaurant,Yoga Studio,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
4,M1H,Gas Station,Fried Chicken Joint,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Hakka Restaurant,Electronics Store,Eastern European Restaurant


In [32]:
neighbourhoods_venues_sorted.shape

(100, 11)

### Clustering Neighbourhoods (grouped by Postal Code)

Run k-means to cluster the neighbourhoods into 8 clusters.

In [28]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 8

Toronto_grouped_clustering = Toronto_grouped.drop('Postal Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(Toronto_grouped_clustering)

# check the first few cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 0, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each Postal Code.

In [33]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = neighbourhoods_venues_sorted

# merge to add latitude/longitude for each row
Toronto_merged = Toronto_merged.join(df_Toronto.set_index('Postal Code'), on='Postal Code')

Toronto_merged.head()

Unnamed: 0,Cluster Labels,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Borough,Neighbourhood,Latitude,Longitude
0,3,M1B,Fast Food Restaurant,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Health & Beauty Service,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,0,M1C,Construction & Landscaping,Bar,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,1,M1E,Breakfast Spot,Restaurant,Rental Car Location,Electronics Store,Medical Center,Intersection,Bank,Mexican Restaurant,Yoga Studio,Doner Restaurant,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,1,M1G,Coffee Shop,Korean BBQ Restaurant,Mexican Restaurant,Yoga Studio,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Scarborough,Woburn,43.770992,-79.216917
4,1,M1H,Gas Station,Fried Chicken Joint,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Hakka Restaurant,Electronics Store,Eastern European Restaurant,Scarborough,Cedarbrae,43.773136,-79.239476


In [34]:
Toronto_merged.shape

(100, 16)

In [41]:
#rearrage the columns
columns_order = list(Toronto_merged.columns[:2]) + list(Toronto_merged.columns[12:16]) + list(Toronto_merged.columns[2:12])
resultsToronto = Toronto_merged[columns_order]
type(resultsToronto)

pandas.core.frame.DataFrame

In [42]:
resultsToronto

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,3,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,Fast Food Restaurant,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Health & Beauty Service
1,0,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Construction & Landscaping,Bar,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
2,1,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Breakfast Spot,Restaurant,Rental Car Location,Electronics Store,Medical Center,Intersection,Bank,Mexican Restaurant,Yoga Studio,Doner Restaurant
3,1,M1G,Scarborough,Woburn,43.770992,-79.216917,Coffee Shop,Korean BBQ Restaurant,Mexican Restaurant,Yoga Studio,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
4,1,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,Gas Station,Fried Chicken Joint,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Hakka Restaurant,Electronics Store,Eastern European Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,4,M9N,York,Weston,43.706876,-79.518188,Park,Yoga Studio,Eastern European Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
96,1,M9P,Etobicoke,Westmount,43.696319,-79.532242,Intersection,Discount Store,Pizza Place,Sandwich Place,Chinese Restaurant,Coffee Shop,Donut Shop,Distribution Center,Dog Run,Doner Restaurant
97,1,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724,Park,Pizza Place,Sandwich Place,Bus Line,Donut Shop,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
98,1,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437,Pizza Place,Grocery Store,Pharmacy,Fried Chicken Joint,Sandwich Place,Liquor Store,Beer Store,Fast Food Restaurant,Gluten-free Restaurant,Department Store


Finally, let's visualize the resulting clusters.

In [43]:
import folium
import numpy as np
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
latitude = 43.6534817
longitude = -79.3839347
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(resultsToronto['Latitude'], resultsToronto['Longitude'], resultsToronto['Neighbourhood'], resultsToronto['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Let's take a look at the most pupular cluster (label=1), and see what are the discriminating venue categories.

In [45]:
resultsToronto.loc[resultsToronto['Cluster Labels'] == 1, 
                   resultsToronto.columns[[3] + list(range(6, 16))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Guildwood, Morningside, West Hill",Breakfast Spot,Restaurant,Rental Car Location,Electronics Store,Medical Center,Intersection,Bank,Mexican Restaurant,Yoga Studio,Doner Restaurant
3,Woburn,Coffee Shop,Korean BBQ Restaurant,Mexican Restaurant,Yoga Studio,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
4,Cedarbrae,Gas Station,Fried Chicken Joint,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Hakka Restaurant,Electronics Store,Eastern European Restaurant
5,Scarborough Village,Smoke Shop,Playground,Jewelry Store,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
6,"Kennedy Park, Ionview, East Birchmount Park",Department Store,Hobby Shop,Coffee Shop,Train Station,Event Space,Ethiopian Restaurant,Escape Room,Falafel Restaurant,Electronics Store,Dim Sum Restaurant
...,...,...,...,...,...,...,...,...,...,...,...
92,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",Pharmacy,Liquor Store,Pizza Place,Convenience Store,Beer Store,Coffee Shop,Café,Shopping Plaza,Drugstore,Distribution Center
93,Humber Summit,Pizza Place,Furniture / Home Store,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant
96,Westmount,Intersection,Discount Store,Pizza Place,Sandwich Place,Chinese Restaurant,Coffee Shop,Donut Shop,Distribution Center,Dog Run,Doner Restaurant
97,"Kingsview Village, St. Phillips, Martin Grove ...",Park,Pizza Place,Sandwich Place,Bus Line,Donut Shop,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
