<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

### 1. Read data from html and format it properly

In [1]:
# Import necessary packages
import bs4
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

First, we read in raw data from the html. 

In [2]:
with open('List of postal codes of Canada_ M - Wikipedia.htm') as wiki: 
    soup = BeautifulSoup(wiki)

head = soup.find('div', class_="mw-parser-output").table.thead.tr

Then we extract the values we want into a dataframe. 

In [3]:
# Get the column names
cols = []
for t in head.find_all('th'):
    col = t.text.strip()
    cols.append(col)

In [4]:
# Get the table body
body = soup.find('div', class_="mw-parser-output").table.tbody
data_raw = body.find_all('tr')

# Make a dataframe to store all data
df_1 = pd.DataFrame(columns = cols)

# Loop through all rows
for data in data_raw: 
    row = [] # make an empty list to store data of one row
    
    # Store the data
    for cell in data.find_all('td'):
        cell = cell.text.strip()
        row.append(cell)
    
    # Append this row to our dataframe if it is assigned a borough
    if 'Not assigned' not in row[1]:
        df_1.loc[len(df_1), :] = row

# Check the dataframe
df_1.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


### 2. Data Wrangling

First, we take care of the missing values in the 'Neighbourhood' column. 

In [5]:
# Change all "Not assigned" in the Neighbourhood with Borough name in the same row
i = 0
for row in df_1.itertuples(): 
    
    if row.Neighbourhood == "Not assigned":
        df_1.loc[i, 'Neighbourhood'] = df_1.loc[i, 'Borough']
    
    i+=1
    
df_1.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Then we merge neighbourhood names with the same postcode, put them into one single cell, separated by ",". 

In [6]:
# Create a new dataframe to store wrangled data
df = pd.DataFrame(columns = cols)

# Group dataset by postcode, loop through each group to get the correct data in correct format
for name, group in df_1.groupby('Postcode', as_index = False):
    
    # Get the corresponding data in correct format
    post = name
    bor = group['Borough'].tolist()[0]
    neigh = ', '.join(group['Neighbourhood'].tolist())
    row = [post, bor, neigh] # Combine all data into a list
    df.loc[len(df), :] = row # Add list to the dataframe

The cleaned data should be like this: 

In [7]:
# Display dataframe
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### Shape of the cleaned dataset

In [8]:
df.shape

(103, 3)

### 3. Adding coordinates to the dataset

Here, I took a lazy step and used the dataset provided by IBM in the assignment description. 

In [9]:
geo = pd.read_csv("http://cocl.us/Geospatial_data")
geo.rename(columns = {'Postal Code' : 'Postcode'}, inplace = True)

In [10]:
df = pd.merge(df, geo, on = 'Postcode')
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### 4. Segmenting and Clustering neighborhoods

**In this part, I will cluster neighborhoods using types of venues (coffee shop, gym, etc.) that are most common in each neighborhood, using K-Means Clustering.**    

I will use Foursquare data to compute the frequency of venue types occurring in each neighborhood, and select the top-5 most commonly-seen venues for each neighborhood.   
Based on the venue types and their respective frequency, neighborhoods will be put into one cluster if they have similar venue types and their respective frequency. 

For the sake of time and convenience and the well-being of my laptop, I'm using a subset of the above dataset -- boroughs that contains the word "Toronto". 

In [117]:
# Get the smaller dataset
df_tor = df[df['Borough'].str.contains('Toronto')] 
df_tor.head(10)
#df_tor.shape

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


In [12]:
# Download and import necessary packages
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import json # library to handle JSON files

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /anaconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.6.16          |           py37_1         149 KB  conda-forge
    conda-4.7.10               |           py37_0         3.0 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.1 MB

The following packages will be UPDATED:

  ca-certificates    pkgs/main::ca-certificates-2019.5.15-0 --> conda-forge::ca-certificates-2019.6.16-hecc5488_0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi                                         pkgs/main --> conda-forge
  conda                                           pkgs/main --> conda-forge
  openssl       

We start by generating a map of Toronto!  

In [13]:
# Get the coordinates of Toronto 
address = 'Toronto, Canada'
geolocator = Nominatim(user_agent = "tr_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [14]:
# Create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location = [latitude, longitude], zoom_start = 11)

# Add markers to the map
for lat, lng, borough, neigh in zip(df_tor['Latitude'], df_tor['Longitude'], df_tor['Borough'], df_tor['Neighbourhood']):
    label = '{}, {}'.format(neigh, borough)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 4,
        popup = label,
        color = 'gray',
        fill = True,
        fill_color = '#ffffff',
        parse_html = False).add_to(map_toronto)  
    
map_toronto

Now, we use Four Square to get information on each neighbourhood's venues.  

In [118]:
# Set up Four Square credentials 
CLIENT_ID = 'DU1RZNIYZKMR4ND3KK4F3FASRDB1GJGKM0TYVLLLF0NFQKZU' # Foursquare ID
CLIENT_SECRET = '30BO30FW3YDGZM0BQO0VV0VR0WDPNYESACKI4UFCQD1IAPX4' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('My credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentails:
CLIENT_ID: DU1RZNIYZKMR4ND3KK4F3FASRDB1GJGKM0TYVLLLF0NFQKZU
CLIENT_SECRET:30BO30FW3YDGZM0BQO0VV0VR0WDPNYESACKI4UFCQD1IAPX4


Then of course, we borrow the 'get_category_type' function from the Foursquare lab and use it to get the venues' types. 

In [16]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

We define a function that gets the top 100 venues within 500 miles of the neighbourhood. 

In [2]:
def getNearbyVenues(names, latitudes, longitudes, radius = 10000):
    LIMIT = 100
    urlformat = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&limit={}'
    
    venues_list=[]
    
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = urlformat.format(CLIENT_ID, 
                               CLIENT_SECRET, 
                               VERSION, 
                               lat, 
                               lng, 
                               radius, 
                               LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Community', 
                  'Community Latitude', 
                  'Community Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Apply the function to our dataset. (It takes so long even with the tiny subset of the dataset)

In [3]:
toronto_venues = getNearbyVenues(names=df_tor['Neighbourhood'],
                                 latitudes=df_tor['Latitude'],
                                 longitudes=df_tor['Longitude']
                                )

NameError: name 'df_tor' is not defined

Let's have a quick look at the number of venues in each neighborhood. 

In [123]:
# Count number of venues in each neighborhood, sort the result ascendingly. 
num = pd.DataFrame(toronto_venues.groupby('Neighborhood').count()['Venue Category']).sort_values(by = 'Venue Category').reset_index()
num

Unnamed: 0,Neighborhood,Venue Category
0,Roselawn,2
1,Lawrence Park,3
2,The Beaches,4
3,"Moore Park, Summerhill East",4
4,"Forest Hill North, Forest Hill West",4
5,Rosedale,5
6,Davisville North,8
7,Christie,15
8,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",15
9,"Parkdale, Roncesvalles",15


Apparently, some neighborhoods hardly have any venues in place. If we want to cluster these neighborhoods based on the **_top 5 most frequently occurred venues_**, these neighborhoods with **_less than 5 venues_** may not be well-clustered.   
**For the following analysis, I will drop neighborhoods whose venues are less than 10.**  

In [147]:
neigh_to_drop = num[num['Venue Category'] < 10]['Neighborhood'].tolist() # get the names of neighborhoods that I would drop

for neigh in neigh_to_drop: 
    toronto_venues = toronto_venues[toronto_venues.Neighborhood != neigh] # drop them

toronto_venues.head() # check new dataframe

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
5,"The Danforth West, Riverdale",43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop
6,"The Danforth West, Riverdale",43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop
7,"The Danforth West, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
8,"The Danforth West, Riverdale",43.679557,-79.352188,La Diperie,43.67753,-79.352295,Ice Cream Shop


In [148]:
# Again, count number of venues in each neighborhood, sort the result ascendingly. 
num = pd.DataFrame(toronto_venues.groupby('Neighborhood').count()['Venue Category']).sort_values(by = 'Venue Category').reset_index()
num

Unnamed: 0,Neighborhood,Venue Category
0,Christie,15
1,"Parkdale, Roncesvalles",15
2,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",15
3,"CN Tower, Bathurst Quay, Island airport, Harbo...",17
4,"Dovercourt Village, Dufferin",17
5,North Toronto West,18
6,"The Beaches West, India Bazaar",18
7,Business Reply Mail Processing Centre 969 Eastern,19
8,"Brockton, Exhibition Place, Parkdale Village",21
9,"High Park, The Junction South",23


#### Now it's time to count the number of venues of each type in each neighborhood. 

In [149]:
# One hot encoding
toronto_onehot = pd.get_dummies(toronto_venues['Venue Category'], prefix="", prefix_sep="")

# Move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_onehot = toronto_onehot.drop('Neighborhood', axis=1)  

In [150]:
# Put Neighbourhood names back to the one-hot-coded dataframe
names = pd.DataFrame(toronto_venues['Neighborhood'])
toronto = pd.concat([names, toronto_onehot], axis=1)
toronto.head()

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [151]:
# Get the frenquency of occurances of each type of venue for each neighborhood
toronto_grouped = toronto.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can have a look at the top 5 most common venues for each neighborhood. 

In [185]:
# Define a fundtion that returns the most common venue of a neighborhood
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [195]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# Create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Steakhouse,Thai Restaurant,Bar
1,Berczy Park,Coffee Shop,Bakery,Cocktail Bar,Seafood Restaurant,Beer Bar
2,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Breakfast Spot,Café,Convenience Store,Burrito Place
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Yoga Studio,Auto Workshop,Park,Pizza Place
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Terminal,Airport Lounge,Boutique,Sculpture Garden


#### With the cleaned dataset, we can apply _K-Means_ to cluster the neighborhoods into 5 clusters.

In [196]:
# Set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(toronto_grouped_clustering)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 2, 0, 0, 0, 3, 0], dtype=int32)

In [197]:
# Add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,0,"Adelaide, King, Richmond",Coffee Shop,Café,Steakhouse,Thai Restaurant,Bar
1,0,Berczy Park,Coffee Shop,Bakery,Cocktail Bar,Seafood Restaurant,Beer Bar
2,0,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Breakfast Spot,Café,Convenience Store,Burrito Place
3,0,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Yoga Studio,Auto Workshop,Park,Pizza Place
4,2,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Terminal,Airport Lounge,Boutique,Sculpture Garden


In [198]:
# I just found out that the column names for 'neighborhood' are spelled differently in different datasets...
# so I'm doing a quick fix
df_tor.rename(columns = {'Neighbourhood':'Neighborhood'}, inplace = True)

# Merge toronto_grouped with toronto data to add latitude/longitude for each neighborhood
toronto_merged = df_tor
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood').dropna(axis = 0)
# For some reason the labels changed to floats after the join. Quick fix!
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)
toronto_merged.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Furniture / Home Store
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,0,Park,Pet Store,Sushi Restaurant,Gym,Pizza Place
43,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,Gastropub,Italian Restaurant,Bakery
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,0,Sporting Goods Shop,Coffee Shop,Yoga Studio,Mexican Restaurant,Diner
47,M4S,Central Toronto,Davisville,43.704324,-79.38879,0,Sandwich Place,Dessert Shop,Pizza Place,Sushi Restaurant,Café


#### Let's visualize the clustered neighborhoods on the map! 

In [199]:
# Create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html = True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

**As we can see from the map, the majority of neighborhoods are put into one cluster, which does not give us much useful information except that most neighborhoods in Toronto like the same venues.**   
Let's have a closer look into each cluster to see what's wrong. 

In [200]:
# Cluster 0
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
41,East Toronto,0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Furniture / Home Store
42,East Toronto,0,Park,Pet Store,Sushi Restaurant,Gym,Pizza Place
43,East Toronto,0,Café,Coffee Shop,Gastropub,Italian Restaurant,Bakery
46,Central Toronto,0,Sporting Goods Shop,Coffee Shop,Yoga Studio,Mexican Restaurant,Diner
47,Central Toronto,0,Sandwich Place,Dessert Shop,Pizza Place,Sushi Restaurant,Café
51,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Italian Restaurant,Pub
52,Downtown Toronto,0,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Restaurant
53,Downtown Toronto,0,Coffee Shop,Park,Pub,Café,Bakery
54,Downtown Toronto,0,Coffee Shop,Clothing Store,Cosmetics Shop,Fast Food Restaurant,Middle Eastern Restaurant
55,Downtown Toronto,0,Coffee Shop,Italian Restaurant,Restaurant,Café,Hotel


So, apparently, Toronto people love coffee. 

In [201]:
# Cluster 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
83,West Toronto,1,Breakfast Spot,Gift Shop,Movie Theater,Restaurant,Eastern European Restaurant


Umm except for this place which doesn't have coffee shops (No)

In [202]:
# Cluster 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
68,Downtown Toronto,2,Airport Service,Airport Terminal,Airport Lounge,Boutique,Sculpture Garden


Ok that makes sense. This is the airport. 

In [203]:
# Cluster 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
75,Downtown Toronto,3,Grocery Store,Café,Park,Restaurant,Nightclub


In [204]:
# Cluster 4
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
49,Central Toronto,4,Pub,Coffee Shop,Sushi Restaurant,American Restaurant,Supermarket


### 5. Summary

Then I tried to re-run the model with more clusters. It made things better, but not by much. 
I think that using **_Frequently-Occurred Venues_** to cluster neighborhoods may not be the best choice for Toronto, or anywhere else, since some shops -- coffee shops, for example -- may easily occur much more frequently than, say, a gym, since it requests less space and expense to open, etc. 

I think it would be a better way to cluster neighborhoods using **_top-rated venues_** or something else. I will be sure to try that next time.   

Thanks for reading it through! 