# Segmenting and Clustering Neighborhoods in Toronto

## I. Data manging

In [15]:
# Data / analytic
import pandas as pd
import numpy as np
import lxml.html as LH

# Scrap / IO
import requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# ML
from sklearn.cluster import KMeans

# Cartography
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

# Visualization
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    altair-3.1.0               |           py36_0         724 KB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be 

The purpose of this section is to collect information about the Toronto boroughs and try to clusterize them according to the locations in these boroughs.  
This sections consists of 3 steps:
- to fetch the list of boroughs & neighbourhoods including the postal codes from the wikipedia web site.
- to clean the data
- to append the location by using the Geocoder package or a csv file which contains the locations for each postal codes.

### I.1 Fetch raw data

In [2]:
# scrap data from Wikipedia web site and save the result as a dataframe called 'raw_data'
url_wiki = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

raw_data = pd.read_html(url_wiki, header=0)[0]
raw_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [3]:
print("raw data shape:", raw_data.shape)

raw data shape: (288, 3)


### I.2 Clean raw data

#### Remove rows when Borough is not assigned  
  
Remove rows when there are missing values (i.e. the mention 'not assigned') in the column 'Borough' and save the result in a new dataframe called 'cleaned_data'

In [4]:
cleaned_data = raw_data.copy(deep=True)
cleaned_data["Borough"] = cleaned_data["Borough"].replace({"Not assigned": np.nan})
cleaned_data = cleaned_data.dropna(axis=0)
cleaned_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


#### Replace Neighbourhood by Borough when Neighbourhood is not assigned

In [5]:
# detect the rows which have unassigned neighbourhoods
mask = (cleaned_data["Neighbourhood"]=="Not assigned")
print("Rows with neighbourhood is not assigned:\n")
print(cleaned_data[mask])

# replace unassigned neighbourhoods by the borough
cleaned_data.loc[mask, "Neighbourhood"] = cleaned_data["Borough"]
print("\nAfter replacement:\n")
print(cleaned_data[mask])

Rows with neighbourhood is not assigned:

  Postcode       Borough Neighbourhood
8      M7A  Queen's Park  Not assigned

After replacement:

  Postcode       Borough Neighbourhood
8      M7A  Queen's Park  Queen's Park


#### Regroup Neighbourhoods that have the same postal code  
  
More than one neighborhood can exist in one postal code area.  
Combined rows which have the same borough into one row with the neighborhoods separated with a comma

In [6]:
cleaned_data = cleaned_data.groupby(by=['Postcode', 'Borough'])["Neighbourhood"].apply(', '.join).reset_index(drop=False)

print("shape of the dataframe after cleaning:", cleaned_data.shape)
cleaned_data.head()

shape of the dataframe after cleaning: (103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### I.3 Append locations

In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.  
We will read the CSV file which contains the latitude & longitude for each postcode and save it as a temporary dataframe called 'tmp'.  
Then this dataframe will merged with the dataframe 'cleaned_data'.

In [7]:
url_loc = "https://cocl.us/Geospatial_data"

In [8]:
tmp = pd.read_csv(url_loc, sep=",", encoding="utf-8")
tmp.columns = ["Postcode", "Latitude", "Longitude"]
cleaned_data = cleaned_data.merge(tmp, how="left", on="Postcode")
cleaned_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [9]:
print("data shape:", cleaned_data.shape)

data shape: (103, 5)


## II. Quick overview

The purpose of this section is to take a glance at data.  
Just display a few statistics computed on the boroughs and post each location on a map using library Folium.

### II.1 Few stats

In [10]:
# List all the distinct boroughs
boroughs = cleaned_data.Borough.unique()
print("They are {0} distincts boroughs:\n".format(boroughs.size))
print(boroughs)

They are 11 distincts boroughs:

['Scarborough' 'North York' 'East York' 'East Toronto' 'Central Toronto'
 'Downtown Toronto' 'York' 'West Toronto' "Queen's Park" 'Mississauga'
 'Etobicoke']


In [11]:
# count the number of neighbourhood per borough
tmp = pd.DataFrame( cleaned_data["Borough"] )
tmp["Neighbourhood Count"] = cleaned_data.Neighbourhood.str.split(",").str.len()
tmp.groupby("Borough").sum()

Unnamed: 0_level_0,Neighbourhood Count
Borough,Unnamed: 1_level_1
Central Toronto,17
Downtown Toronto,37
East Toronto,7
East York,6
Etobicoke,45
Mississauga,1
North York,38
Queen's Park,1
Scarborough,37
West Toronto,13


### II.2 Create a map of Toronto with neighborhoods superimposed on top.

There are 2 steps to draw a map of Toronto with the positions of the boroughs:  
- Get the location of Toronto city  
- Create a folium map and post all the boroughs locations with the appropiate labels

In [12]:
# This function returns the latitude & logitude given an adress: x.
# If x is empty, it returns the location of Toronto city.
def getLocation(x=""):
    
    address = 'Toronto, Ontario'
    if x:
        address = '{}, Toronto, Ontario'.format(x)

    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    return latitude, longitude

In [13]:
latitude, longitude = getLocation()
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [16]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(cleaned_data['Latitude'], cleaned_data['Longitude'], cleaned_data['Borough'], cleaned_data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## III Filter data - prepare the subset to analyze

In this section, we will continue the data preparation.  
We will focus only on boroughs that contain the word 'Toronto'  
First of all, we have to filter the dataframe 'cleaned_data' to get only the **'*Toronto*'** boroughs that will be stored as a new dataframe called 'toronto_data'  
Then we will post the results (i.e. the locations of the boroughs in toronto_data'

### III.1 Filter data

In [17]:
toronto_data = cleaned_data[cleaned_data['Borough'].str.contains('Toronto')].reset_index(drop=True)
print("Filtered dataframe:\n")
toronto_data.head()

Filtered dataframe:



Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [18]:
print("Toronto data shape:", toronto_data.shape)

Toronto data shape: (38, 5)


### III.2 Draw Toronto map

Post each locations stored in toronto_data on the Toronto folium map

In [19]:
# get the geographical coordinates of Toronto borough.
latitude, longitude = getLocation("Toronto")
print('The geograpical coordinate of Toronto borough are {}, {}.'.format(latitude, longitude))

# create map
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

The geograpical coordinate of Toronto borough are 43.653963, -79.387207.


## IV. Data enriching (foursquare)

In this section:
- We will use the Foursquare API in order to append the venue information to the data (toronto_data).  
- We will extract the category related to each venue
- Then we will generate the features based on these categories. These features will be used to clusterize the boroughs (k-means).  

So the final step consists in having a new dataframe containing a set of features for each neighbourough.

*Define Foursquare Credentials and Version*

In [1]:
# REMOVE BY AUTHOR

### IV.1 Explore Neighborhoods in Toronto

*Get the top 100 venues that are in **all neighborhoods** within a radius of 500 meters*

In [21]:
# define the parameters
LIMIT = 100  # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# Function to repeat the same process to all the neighborhoods in Toronto : concatenate toronto_data and venues information (name, category)
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
   
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

List the top 100 venues for each borough and save the results in a new dataframe called 'toronto_venues'.  

In [22]:
# Enrich toronto data with foursquare API
tmp = getNearbyVenues(names=toronto_data['Neighbourhood'],
                      latitudes=toronto_data['Latitude'],
                      longitudes=toronto_data['Longitude'])

print("toronto venues shape:", tmp.shape)
tmp.head()

toronto venues shape: (1717, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


In the next steps, we will do some tests with toronto_venues.   
To avoid regenerating data with function getNearbyVenues (which could generate some 'Groups' errors sometimes),  
we made a backup called 'tmp' and we work on 'toronto_venues'

In [68]:
toronto_venues = tmp.copy(True)

Display the number of venues found for each borough

In [69]:
pd.DataFrame(toronto_venues.groupby('Neighborhood')["Venue"].count())

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
"Adelaide, King, Richmond",100
Berczy Park,57
"Brockton, Exhibition Place, Parkdale Village",21
Business Reply Mail Processing Centre 969 Eastern,20
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",17
"Cabbagetown, St. James Town",43
Central Bay Street,87
"Chinatown, Grange Park, Kensington Market",100
Christie,15
Church and Wellesley,90


In [70]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 235 uniques categories.


They are 235 uniques features for 38 uniques neighborhoods (rows).  
They are too many features regarding the number of rows, furthemore we observed that several Venue Categories could be related to a common parent category.  
So we will try to normalize the categories: we will gather most of them to reduce the number of features.  
It requiers to explore the values in toronto_venues, sometimes to search further information on the web in order to build a mapping (see function below).

In [71]:
def normalizeCat(cell):
    
    if list(filter(lambda x: x in cell, ["Pharmacy", "Health", "Medical"])): return "Health"
    elif list(filter(lambda x: x in cell, ["Plane", "Airport", "Rental", "Bus", "Moving", "Metro", "Train", "Rail", "Boat", "Travel"])): return "Transport"
    elif list(filter(lambda x: x in cell, ["Nightclub", "Strip", "Bodega", "Roof Deck", "Brewery", "Bar", "Pub", "Bistro"])): return "Night"
    elif list(filter(lambda x: x in cell, ["Wings", "Poke", "Soup", "Snack", "Noodle", "BBQ", "Salad", "Cheese", "Lounge", "Food", "Taco", "Creperie", "Burrito", "Fried", "Diner", "Burger", "Gastro", "Restaurant", "Steakhouse", "Sandwich", "Pizza", "Ice Cream", "Tea Room", "Café", "Coffee", "Breakfast"])): return "Food"
    elif list(filter(lambda x: x in cell, ["Plaza", "Outdoors", "Dog Run", "Park", "Lake", "Garden", "Beach", "Scenic"])): return "Nature-Tourism"
    elif list(filter(lambda x: x in cell, ["Neighborhood", "Aquarium", "Marina", "Gift Shop", "Poutine Place", "Fountain", "Monument", "Historic", "Building", "Church"])): return "City-Tourism"
    elif list(filter(lambda x: x in cell, ["Sporting", "College Rec", "Trail", "Tennis", "Sport", "Skating", "Swim", "Playground", "Gym", "Yoga", "Stadium", "Martial", "Spa", "Tanning"])): return "Sport-Wellness"
    elif list(filter(lambda x: x in cell, ["Butcher", "Store", "Shop", "Bookstore", "Market", "Supermarket", "Bakery", "Grocery", "Boutique"])): return "Stores"
    elif list(filter(lambda x: x in cell, ["Museum", "Art", "Theater", "Opera", "Recording", "Dance", "Jazz Club", "Jazz", "Music", "Concert"])): return "Culture"
    elif list(filter(lambda x: x in cell, ["Gaming", "Entertainment", "Speakeasy"])): return "Entertainment"
    elif list(filter(lambda x: x in cell, ["Hotel", "Hostel"])): return "Hotel"
    elif list(filter(lambda x: x in cell, ["Post", "Bank", "Office", "Workshop"])): return "Services"
    return cell

In [140]:
toronto_venues['Venue Category'] = toronto_venues['Venue Category'].map(normalizeCat)
print('After the normalization, there are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))
toronto_venues['Venue Category'].value_counts()

After the normalization, there are 13 uniques categories.


Food              921
Stores            286
Night             165
Sport-Wellness     87
Culture            70
Nature-Tourism     55
Hotel              41
City-Tourism       32
Transport          27
Services           12
Entertainment       9
Health              9
Intersection        3
Name: Venue Category, dtype: int64

### IV.3 Analyze Each Neighborhood

#### Create features

*a. Transform a column of qualitative variable to multiple binary variables*  

The goal is to represent the observations in a vector space by using one hot encoding technique.*  
The result will be saved as a new dataframe called 'toronto_onehot'

In [73]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print("toronto_onehot shape:", toronto_onehot.shape, "\n")
toronto_onehot.head()

toronto_onehot shape: (1717, 14) 



Unnamed: 0,Neighborhood,City-Tourism,Culture,Entertainment,Food,Health,Hotel,Intersection,Nature-Tourism,Night,Services,Sport-Wellness,Stores,Transport
0,The Beaches,0,0,0,0,0,0,0,0,0,0,1,0,0
1,The Beaches,0,0,0,0,1,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,1,0,0,0,0
3,The Beaches,1,0,0,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,1,0,0,0,0,0,0,0,0,0


*b. Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category*  

On one hand we have a dataframe that conatins binary variables for each venue,  
On the other hand we want a dataframe containing a set of features for each neighbouroug not for each venue.  
So we have to group the rows by neighboroug.

In [74]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
print("toronto grouped shape:", toronto_grouped.shape, "\n")
toronto_grouped.head()

toronto grouped shape: (38, 14) 



Unnamed: 0,Neighborhood,City-Tourism,Culture,Entertainment,Food,Health,Hotel,Intersection,Nature-Tourism,Night,Services,Sport-Wellness,Stores,Transport
0,"Adelaide, King, Richmond",0.02,0.08,0.01,0.6,0.0,0.03,0.0,0.01,0.07,0.0,0.04,0.13,0.01
1,Berczy Park,0.017544,0.070175,0.0,0.473684,0.0,0.017544,0.0,0.035088,0.122807,0.0,0.035088,0.22807,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.047619,0.0,0.52381,0.0,0.0,0.047619,0.0,0.047619,0.0,0.142857,0.190476,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.05,0.0,0.2,0.0,0.0,0.0,0.2,0.05,0.05,0.1,0.2,0.15
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.058824,0.0,0.0,0.058824,0.0,0.0,0.0,0.058824,0.058824,0.0,0.0,0.058824,0.705882


*c. Print each neighborhood along with the top 10 most common venues and save the results in a dataFrame*

The objective is to compare the categories of the top 10 most common venues and the predicted groups to check the consistency of clustering.

In [75]:
# This function returns the categories corresponding to the 10 most common venues for each row , i.e. each borough/neighborougs
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [130]:
num_top_venues = 10

# for column labelling
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Food,Stores,Culture,Night,Sport-Wellness,Hotel,City-Tourism,Transport,Nature-Tourism,Entertainment
1,Berczy Park,Food,Stores,Night,Culture,Sport-Wellness,Nature-Tourism,Hotel,City-Tourism,Transport,Services
2,"Brockton, Exhibition Place, Parkdale Village",Food,Stores,Sport-Wellness,Night,Intersection,Culture,Transport,Services,Nature-Tourism,Hotel
3,Business Reply Mail Processing Centre 969 Eastern,Stores,Nature-Tourism,Food,Transport,Sport-Wellness,Services,Night,Culture,Intersection,Hotel
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Transport,Stores,Night,Nature-Tourism,Food,City-Tourism,Sport-Wellness,Services,Intersection,Hotel


## V. Cluster Neighborhoods

### V.1 Cluster Toronto data with k-means

*Run k-means to cluster the neighborhood into 5 clusters.*  
  
we hope to get neighborhood groups that are representative of the most common types of places found in these neighborhoods.

In [131]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print("groups:", np.unique(kmeans.labels_))
print("\nFirst 10 groups:", kmeans.labels_[0:10])

groups: [0 1 2 3 4]

First 10 groups: [1 1 1 3 0 1 1 1 1 1]


We will append the predicted group to data plus the categories of the top 10 most common venues to check the consistency of clustering.

In [132]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data
toronto_merged.rename({"Neighbourhood":"Neighborhood"}, axis=1, inplace=True)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
print("toronto merged columns:\n", list(toronto_merged.columns))

toronto_merged.head() # check the last columns!

toronto merged columns:
 ['Postcode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude', 'Cluster Labels', '1st Most Common Venue', '2nd Most Common Venue', '3rd Most Common Venue', '4th Most Common Venue', '5th Most Common Venue', '6th Most Common Venue', '7th Most Common Venue', '8th Most Common Venue', '9th Most Common Venue', '10th Most Common Venue']


Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Sport-Wellness,Night,Health,City-Tourism,Transport,Stores,Services,Nature-Tourism,Intersection,Hotel
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,1,Food,Stores,Night,Sport-Wellness,Transport,Services,Nature-Tourism,Intersection,Hotel,Health
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,1,Food,Stores,Night,Nature-Tourism,Transport,Sport-Wellness,Intersection,Culture,Services,Hotel
3,M4M,East Toronto,Studio District,43.659526,-79.340923,1,Food,Stores,Sport-Wellness,Night,Services,Nature-Tourism,City-Tourism,Transport,Intersection,Hotel
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2,Transport,Sport-Wellness,Nature-Tourism,Stores,Services,Night,Intersection,Hotel,Health,Food


In [133]:
# Distribution of the clusters
toronto_merged["Cluster Labels"].value_counts().sort_index()

0     1
1    27
2     4
3     5
4     1
Name: Cluster Labels, dtype: int64

### V.2 Visualization

In [134]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### V.3 Examine Clusters

#### Cluster 1

In [135]:
# Display data whose cluster label = 0 and don't show columns: Postcode, Borough, Neighborhood, Latitude, Longitude
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
27,Downtown Toronto,0,Transport,Stores,Night,Nature-Tourism,Food,City-Tourism,Sport-Wellness,Services,Intersection,Hotel


#### Cluster 2

In [136]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,East Toronto,1,Food,Stores,Night,Sport-Wellness,Transport,Services,Nature-Tourism,Intersection,Hotel,Health
2,East Toronto,1,Food,Stores,Night,Nature-Tourism,Transport,Sport-Wellness,Intersection,Culture,Services,Hotel
3,East Toronto,1,Food,Stores,Sport-Wellness,Night,Services,Nature-Tourism,City-Tourism,Transport,Intersection,Hotel
7,Central Toronto,1,Food,Stores,Night,Sport-Wellness,Nature-Tourism,Health,Transport,Services,Intersection,Hotel
9,Central Toronto,1,Food,Stores,Night,Sport-Wellness,Transport,Services,Nature-Tourism,Intersection,Hotel,Health
11,Downtown Toronto,1,Food,Stores,Night,Sport-Wellness,Nature-Tourism,Services,Health,Entertainment,City-Tourism,Transport
12,Downtown Toronto,1,Food,Night,Stores,Sport-Wellness,Nature-Tourism,Hotel,Culture,Transport,Intersection,Health
13,Downtown Toronto,1,Food,Stores,Sport-Wellness,Culture,Night,Nature-Tourism,Services,Hotel,Health,City-Tourism
14,Downtown Toronto,1,Food,Stores,Sport-Wellness,Night,Culture,Nature-Tourism,Services,Hotel,City-Tourism,Transport
15,Downtown Toronto,1,Food,Stores,Night,Hotel,Culture,Nature-Tourism,City-Tourism,Sport-Wellness,Services,Entertainment


#### Cluster 3

In [137]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,2,Sport-Wellness,Night,Health,City-Tourism,Transport,Stores,Services,Nature-Tourism,Intersection,Hotel
4,Central Toronto,2,Transport,Sport-Wellness,Nature-Tourism,Stores,Services,Night,Intersection,Hotel,Health,Food
8,Central Toronto,2,Sport-Wellness,Food,Transport,Stores,Services,Night,Nature-Tourism,Intersection,Hotel,Health
10,Downtown Toronto,2,Sport-Wellness,Nature-Tourism,City-Tourism,Transport,Stores,Services,Night,Intersection,Hotel,Health


#### Cluster 4

In [138]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Central Toronto,3,Food,Stores,Sport-Wellness,Nature-Tourism,Hotel,Transport,Services,Night,Intersection,Health
6,Central Toronto,3,Stores,Food,Sport-Wellness,Night,Nature-Tourism,Transport,Services,Intersection,Hotel,Health
23,Central Toronto,3,Stores,Sport-Wellness,Nature-Tourism,Food,Transport,Services,Night,Intersection,Hotel,Health
31,West Toronto,3,Stores,Food,Night,Health,Sport-Wellness,Services,Nature-Tourism,Culture,Transport,Intersection
37,East Toronto,3,Stores,Nature-Tourism,Food,Transport,Sport-Wellness,Services,Night,Culture,Intersection,Hotel


#### Cluster 5

In [139]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Central Toronto,4,Nature-Tourism,Culture,Transport,Stores,Sport-Wellness,Services,Night,Intersection,Hotel,Health


We can observe that the predicted groups are unbalanced. One group (#2)  contains 70% of the rows.  
Maybe the features are not well chosen or processed, maybe further features are missing or maybe the setting of the normalization is bad.  
But we can relativise, the group 2 contains neighborhoods whose 1st most common venue seems to be the same, i.e. restaurants, lounge, etc.  
Cluster 3 contains neighborhoods whose 1st most common venues are Sport-Wellness.
Cluster 4: Stores  
Cluster 5: Nature-Tourism  
So it seems the clustering is not so bad regarding the 1st most common venues.