# Segmenting and clustering neighborhoods in Toronto

This notebook is **part 3** of the course's third weeks assignment.  This notebook uses the results from part one and two of the assignment by loading their results from a file and redoing some parts.


# Assignment, Part 3

Assignment description:

Continuing from part 2...

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.  Just make sure:

- to add enough Markdown cells to explain what you decided to do and to report any observations you make.
- to generate maps to visualize your neighborhoods and how they cluster together.

Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

## Step 0 - Import libraries


In [1]:
import pandas as pd
import numpy as np

import requests

import os.path

# Comment / uncomment next line as needed.
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Comment / uncomment next line as needed.
#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

print("Libraries imported.")

Libraries imported.


## Step 1 - Load previous part's data

Load the Toronto Postal Code + neighborhood data from part 1 and geo coordinates data from part 2 into a dataframe called *toronto_geo_df*.

In [2]:
toronto_data_filename = "toronto_postal_cleaned.csv"
toronto_df = pd.read_csv(toronto_data_filename)

print("\n\nRead", toronto_df.shape[0], "rows of data into toronto_df")

geo_df_filename = "Geospatial_Coordinates.csv"
geo_df = pd.read_csv(geo_df_filename)

print("Read", geo_df.shape[0], "rows of data into geo_df")

# Fix geo_df data, column name must be changed from 'Postal Code' to 'PostalCode'
geo_df.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

# Merge to two dataframes together
toronto_geo_df = pd.merge(toronto_df, geo_df, on='PostalCode')
print("Merged", toronto_geo_df.shape[0], "rows of data into toronto_geo_df\n\n")
toronto_geo_df.head()




Read 103 rows of data into toronto_df
Read 103 rows of data into geo_df
Merged 103 rows of data into toronto_geo_df




Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
3,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
4,M3B,North York,Don Mills North,43.745906,-79.352188


## Step 2 - Load Toronto venues data from FourSquare 

NOTE: Data will actually be read from a file, **if it exists**.  A file is written every time we actually download venues from FourSquare, so when running this notebook for the first time, data will be downloaded (and saved into a file), but after that the data from first time will be reused.  That is, until you choose to delete the file, then data will be downloaded again.  See the variable FOURSQUARE_DATA_FILENAME below.

In [24]:
#Needed to call FourSquare service:
CLIENT_ID = '<YOUR ID HERE>' # your Foursquare ID
CLIENT_SECRET = '<YOUR SECRET HERE>' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

radius = 500
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

# We will use already downloaded data if it is found.
# If fresh download is preferred, remove this file from filesystem
FOURSQUARE_DATA_FILENAME = "Toronto FS Venue data.csv"

Your credentails:
CLIENT_ID: <YOUR ID HERE>
CLIENT_SECRET:<YOUR SECRET HERE>


In [4]:
# Helper function from 2nd lab of week 3
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

print("getNearbyVenues() defined")

getNearbyVenues() defined


In [5]:

# get the venues, either from a file or download from FourSquare
toronto_venues = None

# We use downloaded data if we have it
if os.path.isfile(FOURSQUARE_DATA_FILENAME):

    print("Reading Toronto Venues data from a file (saved after previous download)")
    toronto_venues = pd.read_csv(FOURSQUARE_DATA_FILENAME)

else:

    print("Downloading Toronto Venues data from FourSquare (online)")
    
    # returns a dataframe
    toronto_venues = getNearbyVenues(names = toronto_geo_df['Neighborhood'],
                                     latitudes = toronto_geo_df['Latitude'],
                                     longitudes=toronto_geo_df['Longitude']
                                     )

    # save data to file
    text_file = open(FOURSQUARE_DATA_FILENAME, 'w')
    text_file.write(toronto_venues.to_csv(index=False))
    text_file.close()
    print("FourSquare Toronto Venues data written to file", FOURSQUARE_DATA_FILENAME)

print("FourSquare Toronto Venues data loaded.")


Reading Toronto Venues data from a file (saved after previous download)
FourSquare Toronto Venues data loaded.


In [6]:
# Show us some of what we got...
print(toronto_venues.shape)
toronto_venues.head()

(2250, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [7]:
toronto_venues[["Neighborhood", "Venue"]].groupby('Neighborhood').count()

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
"Adelaide, King, Richmond",100
Agincourt,5
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",2
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",9
"Alderwood, Long Branch",10
"Bathurst Manor, Downsview North, Wilson Heights",17
"Bathurst Quay, CN Tower, Harbourfront West, Island airport, King and Spadina, Railway Lands, South Niagara",14
Bayview Village,4
"Bedford Park, Lawrence Manor East",25
Berczy Park,56


In [8]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 275 uniques categories.


## Step 3 - Analyze each neighborhood

Next we want to know the top 10 most common types of venues for each neighborhood.

To get is, first create a column for each kind of venue that we have for the neighborhoods (onehot encoding). Then summarize them by the neighborhood.  And finally find the 10 most common of the for each neighborhood.

In [9]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# sort columns so that neighborhood is the first again
other_cols = [c for c in toronto_onehot.columns if c != 'Neighborhood']
fixed_columns = ['Neighborhood'] + other_cols
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

(2250, 275)


Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now *toronto_onehot* contains each venue as its own row.

Next get all neighborhood venue offerings into one row by grouping them with neighborhood, and keep their sum equal to one, meaning that if there are 2 airports, 2 trainstations and avideo store, both airport and trainstation columns get value 0.4 and video store would get 0.2 

In [10]:

toronto_grouped_no_borough = toronto_onehot.groupby('Neighborhood').mean().reset_index()
print("toronto grouped shape (no borough-column)", toronto_grouped_no_borough.shape)

# Merge the Borough -column into the data so we can use it to filter data later
toronto_grouped = pd.merge(toronto_df[['Borough', 'Neighborhood']], toronto_grouped_no_borough, on="Neighborhood")
print("toronto grouped shape (including borough-column)", toronto_grouped.shape)

toronto_grouped.head()

toronto grouped shape (no borough-column) (100, 275)
toronto grouped shape (including borough-column) (100, 276)


Unnamed: 0,Borough,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,North York,Parkwoods,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,North York,Victoria Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Queen's Park,Queen's Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.022727
3,North York,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,North York,Glencairn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
# If you are interested in exploring and understanding the venue data, then
# change the following variable value to True.  This variable value does not affect
# any data processing, only showing it.

SHOW_DATA_FOR_UNDERSTANDING = False

if SHOW_DATA_FOR_UNDERSTANDING:
    print("Below are some data for inspection and exploration to increase understanding it")
else:
    print("Not showing data unless it has been modified.")


Not showing data unless it has been modified.


In [12]:
if SHOW_DATA_FOR_UNDERSTANDING:
    toronto_grouped.loc[0].T.reset_index()

In [13]:
num_top_venues = 5

if SHOW_DATA_FOR_UNDERSTANDING:
    for hood in toronto_grouped['Neighborhood']:
        print("----"+hood+"----")
        temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
        temp.columns = ['venue','freq']
        temp = temp.iloc[1:]
        temp['freq'] = temp['freq'].astype(float)
        temp = temp.round({'freq': 2})
        print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
        print('\n')

In [14]:
#
# Helper function to focus attention on each neighborhoods most common venues
#
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [15]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough', 'Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Borough'] = toronto_grouped['Borough']
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 2:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Borough,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,Parkwoods,Fast Food Restaurant,Food & Drink Shop,Park,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
1,North York,Victoria Village,Intersection,Portuguese Restaurant,Coffee Shop,Pizza Place,Hockey Arena,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
2,Queen's Park,Queen's Park,Coffee Shop,Gym,Sushi Restaurant,Japanese Restaurant,Diner,Chinese Restaurant,Smoothie Shop,Seafood Restaurant,Sandwich Place,Bubble Tea Shop
3,North York,Don Mills North,Baseball Field,Pool,Gym / Fitness Center,Japanese Restaurant,Café,Caribbean Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop
4,North York,Glencairn,Japanese Restaurant,Pub,Bakery,Metro Station,Yoga Studio,Dim Sum Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop


## Step 4 - Filter neighborhoods to analyze via clustering

We must define any filtering, if wanted, before we do the clustering.

In [16]:

# Turn filters on / off with True / False -values. For now only one filter
FILTER_TORONTO = True


filter = None
if FILTER_TORONTO:
    print("\n\nFiltering for Toronto boroughs whose name contains word 'Toronto'")
    filter = toronto_grouped['Borough'].str.contains("Toronto")
else:
    # Effectively no filter, but fill it so that it will pass all data through.
    filter = pd.Series(data = [True for n in toronto_grouped['Neighborhood']])

filter_passed_through = len([x for x in filter if x])

# the all() method is kind of 'and' operation for the whole series value, it returns True only
# if all of the values in the series are True.  Thus it means there is no filtering.
if filter.all():
    print("No data filtering defined.\n")
else:
    print("Filtering in use, proceeding to clustering with", filter_passed_through, "cases out of", toronto_grouped.shape[0], "possible cases.\n")




Filtering for Toronto boroughs whose name contains word 'Toronto'
Filtering in use, proceeding to clustering with 38 cases out of 100 possible cases.



## Step 5 - Cluster neighborhoods

In [17]:
# set number of clusters
# only even slightly meaningful results seem to come with 3-7 clusters...
kclusters = 3

# the borough and neighborhood names do no good when clustering, so don't include them
toronto_grouped_clustering = toronto_grouped[filter].drop(['Borough', 'Neighborhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print("\n\nClustered", len(kmeans.labels_), "neighborhoods into", kclusters, "clusters.\n")

# Show some distributions after clustering
k_df = pd.DataFrame(kmeans.labels_)
k_df.columns = ['ClusterLabel']
k_df["count"] = np.ones(len(kmeans.labels_))
k_df.groupby("ClusterLabel").count()



Clustered 38 neighborhoods into 3 clusters.



Unnamed: 0_level_0,count
ClusterLabel,Unnamed: 1_level_1
0,34
1,3
2,1


In [18]:

# We applied filtering for the clustering data, thus we need to apply filtering to the
# resulting data
f_neighborhoods_venues_sorted = neighborhoods_venues_sorted[filter].reset_index()

# add clustering labels
f_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# Soon we will merge data together, before that drop Borough as redundant column from this dataframe
f_neighborhoods_venues_sorted = f_neighborhoods_venues_sorted.drop('Borough', 1)

# merge to add latitude/longitude for each neighborhood
toronto_merged = pd.merge(toronto_geo_df, f_neighborhoods_venues_sorted, on='Neighborhood')
toronto_merged = toronto_merged.drop('index', 1)
print("\n\nlength of merged data (toronto_merged)", toronto_merged.shape[0], "\n")

toronto_merged.head(10) # check the last columns!



length of merged data (toronto_merged) 38 



Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Coffee Shop,Restaurant,Hotel,Café,Clothing Store,Bakery,Italian Restaurant,Cosmetics Shop,Cocktail Bar,Park
1,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Astrologer,Coffee Shop,Grocery Store,Pub,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
2,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Restaurant,Cocktail Bar,Italian Restaurant,Café,Beer Bar,Farmers Market,Seafood Restaurant,Bakery,Pub
3,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,0,Coffee Shop,Café,Italian Restaurant,Bar,Burger Joint,Indian Restaurant,Chinese Restaurant,Thai Restaurant,Sandwich Place,Ice Cream Shop
4,M6G,Downtown Toronto,Christie,43.669542,-79.422564,0,Grocery Store,Café,Park,Nightclub,Diner,Italian Restaurant,Baby Store,Restaurant,Athletics & Sports,Coffee Shop
5,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,Bakery,Italian Restaurant,American Restaurant,Brewery,Bar,Stationery Store,Fish Market,Juice Bar
6,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Bus Line,Dim Sum Restaurant,Park,Swim School,Donut Shop,Diner,Discount Store,Dog Run,Doner Restaurant,Yoga Studio
7,M5N,Central Toronto,Roselawn,43.711695,-79.416936,2,Garden,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant
8,M4P,Central Toronto,Davisville North,43.712751,-79.390197,0,Gym,Park,Burger Joint,Sandwich Place,Dog Run,Breakfast Spot,Food & Drink Shop,Hotel,Yoga Studio,Doner Restaurant
9,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,0,Sporting Goods Shop,Coffee Shop,Yoga Studio,Furniture / Home Store,Fast Food Restaurant,Diner,Dessert Shop,Mexican Restaurant,Cosmetics Shop,Park


#### Clustering insights

Let's glance some data from each cluster.

- For each cluster, we look at its neighborhoods and the top three venue types for those neighborhoods.
- for each venue type we calculate what percentage of the neighborhoods in that cluster has that same kind of venue
- show only those venue types, which are present in more than 10 % of the neighborhoods in the cluster (in cluster top 3)


In [19]:
# First define a helper function.  Takes in a dataframe, which has all the rows (neighborhoods)
# in the same cluster, but only columns for 1st, 2nd and 3rd most common venue type.  Then print
# out the most common venue types in that cluster (or in clusters top 3 venue types).
def print_top_venues_from_df(data_df):
    cluster_size = data_df.shape[0]
    top_venues_list = []
    for row in data_df.values:
        top_venues_list.extend(row)

    top_venues_d = {}
    for venue in top_venues_list:
        if venue in top_venues_d:
            top_venues_d[venue] = top_venues_d[venue] + 1
        else:
            top_venues_d[venue] = 1

    for venue in top_venues_d:
        occurrence = 100 * top_venues_d[venue] / cluster_size
        if occurrence >= 10:
            print("{0:2.0f}% -- {1}".format(occurrence, venue))

print("\n\nprint_top_venues_from_df() defined\n")



print_top_venues_from_df() defined



In [20]:
# For each cluster, see which venues made it into the top 3 for each neighborhood
# percentage tells us how many of the cluster's neighborhoods had that type of venue in
# its top three venues.

print("\n\nRESULTS: Cluster venue types")

for cluster_id in range(kclusters):

    one_cluster = toronto_merged[toronto_merged['Cluster Labels'] == cluster_id]
    top3_of_cluster = one_cluster.iloc[:,6:9]

    print("\nCluster {0} ({1} neighborhoods), venues in top 3:".format(cluster_id, top3_of_cluster.shape[0]))
    print_top_venues_from_df(top3_of_cluster)

print("\n")



RESULTS: Cluster venue types

Cluster 0 (34 neighborhoods), venues in top 3:
62% -- Coffee Shop
21% -- Restaurant
15% -- Hotel
47% -- Café
18% -- Park

Cluster 1 (3 neighborhoods), venues in top 3:
100% -- Park
67% -- Playground
67% -- Trail
33% -- Tennis Court
33% -- Jewelry Store

Cluster 2 (1 neighborhoods), venues in top 3:
100% -- Garden
100% -- Drugstore
100% -- Dim Sum Restaurant




As a summary of the 3 categories for neighborhoods, whose borough contains 'Toronto':

- cluster 0 is by far the biggest, and it consists mainly of cafes, restaurants and hotels.
- cluster 1 is quite small, contains activity areas like parks, playgrounds, trails etc.
- cluster 2 is a odd one, it has a garden and no cafe so it could belong into cluster 1

## Step 5 - Show data on map

In [21]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('\n\nThe geograpical coordinate of Toronto are {}, {}.\n'.format(latitude, longitude))



The geograpical coordinate of Toronto are 43.653963, -79.387207.



In [22]:
# drop NaN -values, if there are some
toronto_merged = toronto_merged[toronto_merged['Cluster Labels'].notnull()]

In [23]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    cluster = int(cluster)
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster))
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

#### Insight of clusters from map

Cluster 0 (red on map), which was mostly characterized by cafes, restaurants and hotels, is dense in downtown Toronto, but also spreads widely around Toronto area.  Cluster 1 and 2 (blue and green on map) are further away from the downtown and sealine, a bit more inland where there is more room for recreational areas.

Thank you for reading all the way down here!