# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

#### Author: Altaaf Khan

In this assignment, we will explore, segment and cluster the neighborhoods in the city of Toronto. 

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, we will replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

In [1]:
# Import various modules
import pandas as pd
import numpy as np
import requests

from pandas.io.json import json_normalize  # tranform JSON file into a pandas dataframe
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from bs4 import BeautifulSoup

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Folium for generating mapping
import folium # plotting library

# k-means from clustering stage
from sklearn.cluster import KMeans

## 1. Scrape the Wikipedia page
Link: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 

In [2]:
wiki_page='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_source = requests.get(wiki_page).text

soup = BeautifulSoup(wiki_source, 'lxml')

table = soup.find("table")
table_rows = table.tbody.find_all("tr")

pc = []

for tr in table_rows:
    td = tr.find_all("td")
    row = [tr.text.strip() for tr in td]
    
    # Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    if row != [] and row[1] != "Not assigned":
        # If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.
        if "Not assigned" in row[2]: 
            row[2] = row[1]
        pc.append(row)

# Dataframe of three columns: PostalCode, Borough, and Neighborhood
df_pc = pd.DataFrame(pc, columns = ["PostalCode", "Borough", "Neighborhood"])
df_pc.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [3]:
# Group all neighbourhoods with the same postal code
df_pc = df_pc.groupby(["PostalCode", "Borough"])["Neighborhood"].apply(", ".join).reset_index()
df_pc.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [4]:
# Number of rows in the dataframe dataframe
print("Shape: ", df_pc.shape)

Shape:  (103, 3)


##  2. Get the latitude and the longitude coordinates of each neighborhood

Given that geocoder package can be very unreliable, we will use geographical coordinates of each postal code provided in a csv file through this link: http://cocl.us/Geospatial_data

In [5]:
# Read coordinates data
df_geo = pd.read_csv("http://cocl.us/Geospatial_data")
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [6]:
# Merge the two dataframes together
df_toronto = pd.merge(df_pc, df_geo, how='left', left_on = 'PostalCode', right_on = 'Postal Code')

# Drop duplicate Postal Code column
df_toronto.drop("Postal Code", axis=1, inplace=True)
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## 3. Exploring and clustering Toronto neighbourhoods

In this section we will replicate similar analysis we did for New York City data, instead using Toronto neighbourhood data

###  3.1 Get latitude and longitude for Toronto

In [7]:
address = "Toronto, ON"

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Toronto city are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto city are 43.6534817, -79.3839347.


### 3.2 Visualise Toronto neighbourhoods

In [8]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### 3.3 Assignment suggests looking at certain boroughs. I have chosen to work only with boroughs whose name contains *Toronto*

In [9]:
df_toronto_b = df_toronto[df_toronto['Borough'].str.contains("Toronto")].reset_index(drop=True)
df_toronto_b.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


###  3.4 Visualise neighbourhoods that belong to boroughs whose name contains*York*

In [10]:
# create map of neighbourhoods with the names that contain Toronto using latitude and longitude values
map_toronto_b = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_toronto_b['Latitude'], df_toronto_b['Longitude'], df_toronto_b['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_b)  
    
map_toronto_b

###  3.5 Use Foursquare to explore the first neighbourhood in the Toronto borough dataframe

In [11]:
# Foursquare credentials. Please note I have removed these for proivacy purposes

# your Foursquare ID
CLIENT_ID = 'TQ1BNVVJWH4WBOY44ABFPNYCHKRGJOTQPOEKXBLABNQRQQKY' 

# your Foursquare Secret
CLIENT_SECRET = 'LQUMB4JPNBKXNSJRMSK1PPTOTZPEKY42JEII4E1BUSCLI3RR' 

# Foursquare API version
VERSION = '20180605' 

#### 3.5.1 Get the neighbourhood's name 

In [12]:
df_toronto_b.loc[0, 'Neighborhood']

'The Beaches'

#### 3.5.2 Get the neighborhood's latitude and longitude values.

In [13]:
neighborhood_latitude = df_toronto_b.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_toronto_b.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = df_toronto_b.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


#### 3.5.3 Now, let's get the top 100 venues that are in The Beaches within a radius of a kilometer (1000 meters)

In [14]:
# limit of number of venues returned by Foursquare API
LIMIT = 100 

# define radius of 1000m (1km)
radius = 1000 

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

# display URL
url

'https://api.foursquare.com/v2/venues/explore?&client_id=TQ1BNVVJWH4WBOY44ABFPNYCHKRGJOTQPOEKXBLABNQRQQKY&client_secret=LQUMB4JPNBKXNSJRMSK1PPTOTZPEKY42JEII4E1BUSCLI3RR&v=20180605&ll=43.67635739999999,-79.2930312&radius=1000&limit=100'

#### 3.5.4 Send the GET request and examine the results

In [15]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5efe919ae116f90c6c044440'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 77,
  'suggestedBounds': {'ne': {'lat': 43.685357409000005,
    'lng': -79.28061062898105},
   'sw': {'lat': 43.66735739099998, 'lng': -79.30545177101895}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'lab

#### 3.5.5 Clean the json and structure it into a pandas dataframe

In [16]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [17]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,Tori's Bakeshop,Vegetarian / Vegan Restaurant,43.672114,-79.290331
2,Beaches Bake Shop,Bakery,43.680363,-79.289692
3,The Beech Tree,Gastropub,43.680493,-79.288846
4,Ed's Real Scoop,Ice Cream Shop,43.67263,-79.287993


In [18]:
# number of venues returned by Foursquare
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

77 venues were returned by Foursquare.


### 3.6 Now let's explore all neighbourhoods in boroughs whose name contains *Toronto*; Central Toronto, Downtown Toronto, East Toronto, West Toronto

#### 3.6.1 Function to repeat the same process to all the neighborhoods in North York, East York & York

In [19]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [20]:
toronto_venues = getNearbyVenues(names=df_toronto_b['Neighborhood'],
                                   latitudes=df_toronto_b['Latitude'],
                                   longitudes=df_toronto_b['Longitude']
                                  )


The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West,  Lawrence Park
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North & West, Forest Hill Road Park
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High

#### 3.6.2 Check the size of the resulting dataframe, how many venues returned for each neighbourhood and the unique number of categories

In [21]:
# size of dataframe
print(toronto_venues.shape)
toronto_venues.head()

(3196, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,Tori's Bakeshop,43.672114,-79.290331,Vegetarian / Vegan Restaurant
2,The Beaches,43.676357,-79.293031,Beaches Bake Shop,43.680363,-79.289692,Bakery
3,The Beaches,43.676357,-79.293031,The Beech Tree,43.680493,-79.288846,Gastropub
4,The Beaches,43.676357,-79.293031,Ed's Real Scoop,43.67263,-79.287993,Ice Cream Shop


In [22]:
# number of venues by neighbourhood
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,100,100,100,100,100,100
"Brockton, Parkdale Village, Exhibition Place",100,100,100,100,100,100
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",47,47,47,47,47,47
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",14,14,14,14,14,14
Central Bay Street,100,100,100,100,100,100
Christie,100,100,100,100,100,100
Church and Wellesley,100,100,100,100,100,100
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,100,100,100,100,100,100
Davisville North,100,100,100,100,100,100


In [23]:
# unique categories of all returned venues
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 279 uniques categories.


### 3.7 Now let's analyse each neighbourhoods in boroughs whose name contains *Toronto*; Central Toronto, Downtown Toronto, East Toronto, West Toronto

#### 3.7.1 Neighbourhoods and categories

In [24]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Zoo,Accessories Store,Airport,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Art Gallery,Art Museum,...,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# examine the new dataframe size
toronto_onehot.shape

(3196, 279)

#### 3.7.2 Group neighbourhood ros by taking the mean of the frequency of occurence of each category

In [26]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Zoo,Accessories Store,Airport,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Art Gallery,...,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,...,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.01,...,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.021277,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.03,...,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02


In [27]:
# examine the new dataframe size 
toronto_grouped.shape

(39, 279)

#### 3.7.3 Print each neighbourhood with top 3 venues

In [28]:
num_top_venues = 3

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
         venue  freq
0  Coffee Shop  0.12
1         Café  0.06
2        Hotel  0.05


----Brockton, Parkdale Village, Exhibition Place----
         venue  freq
0         Café  0.07
1   Restaurant  0.06
2  Coffee Shop  0.06


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
         venue  freq
0         Park  0.09
1      Brewery  0.06
2  Pizza Place  0.06


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
             venue  freq
0             Café  0.14
1  Harbor / Marina  0.14
2      Coffee Shop  0.14


----Central Bay Street----
                 venue  freq
0          Coffee Shop  0.09
1     Sushi Restaurant  0.04
2  Japanese Restaurant  0.03


----Christie----
               venue  freq
0  Korean Restaurant  0.13
1               Café  0.07
2        Coffee Shop  0.07


----Church and Wellesley----
              venue  freq
0       Coffee Shop  0.11
1  Sus

#### 3.7.4 Put the results into a pandas dataframe

In [29]:
# function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [30]:
# create a new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Café,Japanese Restaurant,Hotel,Restaurant,Park,Bakery,Gastropub,Creperie,Breakfast Spot
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Restaurant,Bar,Bakery,Furniture / Home Store,Gift Shop,Tibetan Restaurant,Italian Restaurant,Arts & Crafts Store
2,"Business reply mail Processing Centre, South C...",Park,Pizza Place,Coffee Shop,Brewery,Fast Food Restaurant,Sushi Restaurant,Bakery,Italian Restaurant,Electronics Store,Diner
3,"CN Tower, King and Spadina, Railway Lands, Har...",Harbor / Marina,Café,Coffee Shop,Park,Scenic Lookout,Sculpture Garden,Dog Run,Track,Dance Studio,Garden
4,Central Bay Street,Coffee Shop,Sushi Restaurant,Art Gallery,Café,Japanese Restaurant,Ramen Restaurant,Park,Burrito Place,Plaza,Pizza Place


###  3.8 Clustering Neighbourhoods

#### 3.8.1 Run k-means to cluster the neighbourhood into 5 clusters

In [31]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 3, 4, 0, 0, 0, 2, 3, 0], dtype=int32)

#### 3.8.2 Create a new dataframe that includes the cluster as well as the top 10 venues for each neighbourhood

In [32]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_toronto_b

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

# replace NaN Cluster Label vlaues with 0
toronto_merged['Cluster Labels'].fillna(0, inplace=True)

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Coffee Shop,Pub,Pizza Place,Bakery,Japanese Restaurant,Beach,Breakfast Spot,Tea Room,Caribbean Restaurant,Bar
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Café,Coffee Shop,Pub,Bank,Fast Food Restaurant,Italian Restaurant,Ramen Restaurant,Bakery,Pizza Place
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,0,Indian Restaurant,Café,Coffee Shop,Beach,Grocery Store,Fast Food Restaurant,Sandwich Place,Bakery,Gym,Restaurant
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Coffee Shop,Bar,Café,Brewery,Vietnamese Restaurant,American Restaurant,Diner,Bakery,Italian Restaurant,French Restaurant
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,1,Park,Bookstore,College Quad,Gym / Fitness Center,College Gym,Coffee Shop,Café,Trail,Yoga Studio,Doner Restaurant


#### 3.8.3 Visualise the clusters

In [33]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(
        toronto_merged['Latitude'], 
        toronto_merged['Longitude'], 
        toronto_merged['Neighborhood'], 
        toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 3.9 Examine Clusters

#### 3.9.1 Cluster 1

In [34]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,0,Coffee Shop,Pub,Pizza Place,Bakery,Japanese Restaurant,Beach,Breakfast Spot,Tea Room,Caribbean Restaurant,Bar
1,East Toronto,0,Greek Restaurant,Café,Coffee Shop,Pub,Bank,Fast Food Restaurant,Italian Restaurant,Ramen Restaurant,Bakery,Pizza Place
2,East Toronto,0,Indian Restaurant,Café,Coffee Shop,Beach,Grocery Store,Fast Food Restaurant,Sandwich Place,Bakery,Gym,Restaurant
3,East Toronto,0,Coffee Shop,Bar,Café,Brewery,Vietnamese Restaurant,American Restaurant,Diner,Bakery,Italian Restaurant,French Restaurant
5,Central Toronto,0,Coffee Shop,Italian Restaurant,Fast Food Restaurant,Restaurant,Café,Pizza Place,Sushi Restaurant,Dessert Shop,Gym,Bookstore
6,Central Toronto,0,Coffee Shop,Italian Restaurant,Sporting Goods Shop,Skating Rink,Café,Restaurant,Mexican Restaurant,Diner,Park,Yoga Studio
11,Downtown Toronto,0,Gastropub,Park,Café,Diner,Japanese Restaurant,Performing Arts Venue,Dance Studio,Pub,Italian Restaurant,Taiwanese Restaurant
12,Downtown Toronto,0,Coffee Shop,Sushi Restaurant,Men's Store,Diner,Park,Japanese Restaurant,Ice Cream Shop,Burger Joint,Thai Restaurant,Clothing Store
14,Downtown Toronto,0,Coffee Shop,Japanese Restaurant,Gastropub,Italian Restaurant,Ramen Restaurant,Café,Restaurant,Poke Place,Plaza,Pizza Place
17,Downtown Toronto,0,Coffee Shop,Sushi Restaurant,Art Gallery,Café,Japanese Restaurant,Ramen Restaurant,Park,Burrito Place,Plaza,Pizza Place


#### 3.9.2 Cluster 2

In [35]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,1,Park,Bookstore,College Quad,Gym / Fitness Center,College Gym,Coffee Shop,Café,Trail,Yoga Studio,Doner Restaurant


#### 3.9.3 Cluster 3

In [36]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,Downtown Toronto,2,Coffee Shop,Café,Restaurant,Japanese Restaurant,Gastropub,Italian Restaurant,Clothing Store,Plaza,Bookstore,Cosmetics Shop
16,Downtown Toronto,2,Coffee Shop,Café,Japanese Restaurant,Hotel,Restaurant,Park,Bakery,Gastropub,Creperie,Breakfast Spot
18,Downtown Toronto,2,Coffee Shop,Café,Hotel,Theater,Tea Room,Gym,Restaurant,Art Gallery,Japanese Restaurant,Sushi Restaurant
19,Downtown Toronto,2,Coffee Shop,Café,Hotel,Park,Brewery,Gym,Japanese Restaurant,Theater,Restaurant,Plaza
20,Downtown Toronto,2,Coffee Shop,Café,Hotel,Restaurant,Japanese Restaurant,Italian Restaurant,Concert Hall,Theater,Park,Vegetarian / Vegan Restaurant
21,Downtown Toronto,2,Coffee Shop,Café,Hotel,Restaurant,Japanese Restaurant,Concert Hall,Gastropub,Gym,Cosmetics Shop,Seafood Restaurant
28,Downtown Toronto,2,Coffee Shop,Café,Restaurant,Japanese Restaurant,Hotel,Gastropub,Bakery,Park,Seafood Restaurant,Breakfast Spot
29,Downtown Toronto,2,Coffee Shop,Café,Hotel,Restaurant,Concert Hall,Theater,Seafood Restaurant,Thai Restaurant,Park,Gym
32,West Toronto,2,Café,Bar,Restaurant,Vegetarian / Vegan Restaurant,Bakery,Pizza Place,Cocktail Bar,Asian Restaurant,Italian Restaurant,Japanese Restaurant
33,West Toronto,2,Café,Coffee Shop,Restaurant,Bar,Bakery,Furniture / Home Store,Gift Shop,Tibetan Restaurant,Italian Restaurant,Arts & Crafts Store


#### 3.9.4 Cluster 4

In [37]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,Central Toronto,3,Italian Restaurant,Coffee Shop,Sushi Restaurant,Café,Pizza Place,Indian Restaurant,Restaurant,Gym,Dessert Shop,Fast Food Restaurant
8,Central Toronto,3,Coffee Shop,Grocery Store,Italian Restaurant,Park,Gym,Thai Restaurant,Bagel Shop,Pub,Restaurant,Sandwich Place
9,Central Toronto,3,Coffee Shop,Sushi Restaurant,Thai Restaurant,Park,Italian Restaurant,Sandwich Place,Grocery Store,Liquor Store,Bagel Shop,Pub
10,Downtown Toronto,3,Coffee Shop,Grocery Store,Park,Convenience Store,Athletics & Sports,Filipino Restaurant,Breakfast Spot,Candy Store,Bistro,Sandwich Place
13,Downtown Toronto,3,Coffee Shop,Diner,Park,Theater,Café,Pub,Italian Restaurant,Restaurant,Breakfast Spot,Bakery
22,Central Toronto,3,Sushi Restaurant,Pharmacy,Coffee Shop,Spa,Café,Bank,Italian Restaurant,Bagel Shop,Fruit & Vegetable Store,Pet Store
23,Central Toronto,3,Park,Coffee Shop,Café,Bank,Sushi Restaurant,Japanese Restaurant,Italian Restaurant,Gym / Fitness Center,Trail,Burger Joint
35,West Toronto,3,Breakfast Spot,Sushi Restaurant,Café,Bar,Thai Restaurant,Coffee Shop,Pizza Place,Grocery Store,Bank,Bakery
36,West Toronto,3,Coffee Shop,Café,Pizza Place,Bakery,Italian Restaurant,Pub,Restaurant,Falafel Restaurant,Spa,Bank
38,East Toronto,3,Park,Pizza Place,Coffee Shop,Brewery,Fast Food Restaurant,Sushi Restaurant,Bakery,Italian Restaurant,Electronics Store,Diner


#### 3.9.5 Cluster 5

In [38]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
27,Downtown Toronto,4,Harbor / Marina,Café,Coffee Shop,Park,Scenic Lookout,Sculpture Garden,Dog Run,Track,Dance Studio,Garden


This notebook was created by Altaaf Khan