## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

### Importing libraries needed to scrape data from Wikipedia

In [1]:
import requests
import lxml.html as lh
import pandas as pd
import numpy as np

### *Scraping* data from Wikipedia and converting it into a dataframe

Data to be scraped with Pandas.

In [2]:
# URL with data
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Create a handle, named page, to handle the contents of the website
df = pd.read_html(url, header = 0)

# Pick the first table from the extracted html object
df=df[0]
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### *Cleaning* dataframe

Drop rows in which Borough is "Not assigned."

In [3]:
df1=df[~df['Borough'].str.contains('Not assigned')]
df1

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Combine Neighborhoods with same Postal Code.

In [4]:
df2 = df1.groupby(["Postal Code", "Borough"], as_index=False).agg(lambda x: ", ".join(x))

For Neighborhood with "Not assigned," make the value equal to Borough.

In [5]:
for index, row in df2.iterrows():
    if row["Neighborhood"] == "Not assigned":
        row["Neighborhood"] = row["Borough"]

Verify dataframe obtained is the same as that shown in assignment.

In [6]:
# create a new test dataframe
col_names = ["Postal Code", "Borough", "Neighborhood"]
test_df = pd.DataFrame(columns=col_names)

test_codes = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in test_codes:
    test_df = test_df.append(df2[df2["Postal Code"]==postcode], ignore_index=True)
    
test_df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


Cleaned dataframe shape.

In [7]:
print('The shape of the cleaned dataframe is: ',df2.shape)

The shape of the cleaned dataframe is:  (103, 3)


### Import data coordinates from website

Download dataset instead of running API calls as they were not working properly.

In [8]:
geo_coord = pd.read_csv('https://cocl.us/Geospatial_data')
geo_coord

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


### Merge the cleaned Wikipedia and Geospatial Coordinate tables

In [9]:
df3 = df2.merge(geo_coord, on="Postal Code", how="left")
df3.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Verify dataframe obtained is the same as that shown in assignment.

In [10]:
# create the test dataframe
column_names2 = ["Postal Code", "Borough", "Neighborhood", "Latitude", "Longitude"]
test_df2 = pd.DataFrame(columns=column_names2)

test_list2 = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in test_list2:
    test_df2 = test_df2.append(df3[df3["Postal Code"]==postcode], ignore_index=True)
    
test_df2

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Wexford, Maryvale",43.750072,-79.295849
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442


### Import libraries needed for visualization and clustering

In [11]:
!pip install folium
!pip install geopy

import json
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
import geopy.distance
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

print("Libraries imported.")

Libraries imported.


### Visualize map of Toronto with Borough and Neighborhood information 

Retrieve Toronto's geographical location.

In [12]:
address = 'Toronto'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Toronto coordinates are latitude {}, longitude {}.'.format(latitude, longitude))

Toronto coordinates are latitude 43.6534817, longitude -79.3839347.


Produce a map of Toronto with Borough and Neighborhoods pinpointed.

In [13]:
# Map of Toronto
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df3['Latitude'], df3['Longitude'], df3['Borough'], df3['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

### Filter Neighborhoods within 5kms of Toronto's city centre and request top venue data via Foursquare

Find Neighboorhoods within 5 km from Toronto city centre.

In [14]:
neighborhoods_within5k_Toronto = []

for x in range(len(df3)):
    distance_citycenter=geopy.distance.geodesic((latitude, longitude),(df3['Latitude'][x],df3['Longitude'][x])).km 
    if distance_citycenter <= 5.0:
          neighborhoods_within5k_Toronto.append(df3['Neighborhood'][x])

Produce a new dataframe with filtered Neighborhoods.

In [15]:
df4 = pd.DataFrame(neighborhoods_within5k_Toronto, columns = ['Neighborhood']) 
df5 = pd.merge(df3, df4, how='inner')
df5

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
1,M4M,East Toronto,Studio District,43.659526,-79.340923
2,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
3,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049
4,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
5,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675
6,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
7,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
8,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
9,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418


Visualize map of Toronto with Neighborhoods within 5km of the city's centre.

In [16]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df5['Latitude'], df5['Longitude'], df5['Borough'], df5['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

### Request information about venues in Neighborhoods close to Toronto's city centre

Foursquare authentication info.

In [33]:
# define Foursquare Credentials and Version
CLIENT_ID = 'Client ID' # your Foursquare ID
CLIENT_SECRET = 'Client secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: Client ID
CLIENT_SECRET:Client secret


Foursquare API call for top 100 venues in a radius of 500 m from every Neighborhood close to the city centre.

In [18]:
radius = 500
LIMIT = 100

venues = []

for lat, long, post, borough, neighborhood in zip(df5['Latitude'], df5['Longitude'], df5['Postal Code'], df5['Borough'], df5['Neighborhood']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append((
            post, 
            borough,
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

Extract information from API call and cast it into a Pandas dataframe.

In [19]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Postal Code', 'Borough', 'Neighborhood', 'BoroughLatitude', 'BoroughLongitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(1407, 9)


Unnamed: 0,Postal Code,Borough,Neighborhood,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
2,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
3,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop
4,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,La Diperie,43.67753,-79.352295,Ice Cream Shop


Check how many venues were retreived per Neighboorhood - max. set @ 100, but there could be less.

In [20]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Postal Code,Borough,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Berczy Park,57,57,57,57,57,57,57,57
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23,23,23
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",15,15,15,15,15,15,15,15
Central Bay Street,64,64,64,64,64,64,64,64
Christie,17,17,17,17,17,17,17,17
Church and Wellesley,77,77,77,77,77,77,77,77
"Commerce Court, Victoria Hotel",100,100,100,100,100,100,100,100
"First Canadian Place, Underground city",100,100,100,100,100,100,100,100
"Garden District, Ryerson",100,100,100,100,100,100,100,100
"Harbourfront East, Union Station, Toronto Islands",100,100,100,100,100,100,100,100


Check the number of unique venue categories - the types will be used in the kmeans algorithm to cluster Neighboorhoods.

In [21]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())-1))

There are 222 uniques categories.


### Analyse the Neighborhoods close to Toronto's city centre

Onehot encoding of the of the data retrieved from Foursquare in preparation for running the clustering algorithm. 

In [22]:
# one hot encoding
toronto_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = venues_df['Neighborhood']
toronto_onehot['Postal Code'] = venues_df['Postal Code'] 
toronto_onehot['Borough'] = venues_df['Borough'] 

# move neighborhood column to the first column
cols = list(toronto_onehot.columns.values) # Make a list of all of the columns in the df
cols.pop(cols.index('Neighborhood')) #Remove Neighborhood from list
cols.pop(cols.index('Postal Code')) #Remove Postal Code from list
cols.pop(cols.index('Borough')) #Remove Borough from list
toronto_onehot = toronto_onehot[['Postal Code','Borough','Neighborhood']+cols] #Create new dataframe with columns in the order you want

print(toronto_onehot.shape)
toronto_onehot.head()

(1407, 225)


Unnamed: 0,Postal Code,Borough,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M4K,East Toronto,"The Danforth West, Riverdale",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M4K,East Toronto,"The Danforth West, Riverdale",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M4K,East Toronto,"The Danforth West, Riverdale",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4K,East Toronto,"The Danforth West, Riverdale",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4K,East Toronto,"The Danforth West, Riverdale",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Group rows by Neighborhood and then compute the mean frequency of occurrence for each venue category.

In [23]:
toronto_onehot_grouped = toronto_onehot.groupby(["Postal Code", "Borough", "Neighborhood"]).mean().reset_index()

print(toronto_onehot_grouped.shape)
toronto_onehot_grouped

(26, 225)


Unnamed: 0,Postal Code,Borough,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M4K,East Toronto,"The Danforth West, Riverdale",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381
1,M4M,East Toronto,Studio District,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02439,0.0,0.0,0.02439
2,M4T,Central Toronto,"Moore Park, Summerhill East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0
4,M4W,Downtown Toronto,Rosedale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,M4X,Downtown Toronto,"St. James Town, Cabbagetown",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M4Y,Downtown Toronto,Church and Wellesley,0.012987,0.0,0.0,0.0,0.0,0.0,0.0,...,0.012987,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,0.0,0.025974
7,M5A,Downtown Toronto,"Regent Park, Harbourfront",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021277
8,M5B,Downtown Toronto,"Garden District, Ryerson",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0
9,M5C,Downtown Toronto,St. James Town,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.012346,0.0,0.0,0.012346,0.0,0.0,0.0


Display the 5 most frequent venues per Neighborhood to understand the dataset better. 

In [24]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
areaColumns = ['Postal Code', 'Borough', 'Neighborhood']
freqColumns = []
for ind in np.arange(num_top_venues):
    try:
        freqColumns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        freqColumns.append('{}th Most Common Venue'.format(ind+1))
columns = areaColumns+freqColumns

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = toronto_onehot_grouped['Postal Code']
neighborhoods_venues_sorted['Borough'] = toronto_onehot_grouped['Borough']
neighborhoods_venues_sorted['Neighborhood'] = toronto_onehot_grouped['Neighborhood']

for ind in np.arange(toronto_onehot_grouped.shape[0]):
    row_categories = toronto_onehot_grouped.iloc[ind, :].iloc[3:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    neighborhoods_venues_sorted.iloc[ind, 3:] = row_categories_sorted.index.values[0:num_top_venues]

# neighborhoods_venues_sorted.sort_values(freqColumns, inplace=True)
print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted

(26, 8)


Unnamed: 0,Postal Code,Borough,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M4K,East Toronto,"The Danforth West, Riverdale",Greek Restaurant,Italian Restaurant,Coffee Shop,Restaurant,Ice Cream Shop
1,M4M,East Toronto,Studio District,Café,Coffee Shop,Gastropub,Bakery,Brewery
2,M4T,Central Toronto,"Moore Park, Summerhill East",Park,Playground,Summer Camp,Restaurant,College Rec Center
3,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",Pub,Coffee Shop,Bagel Shop,Supermarket,Bank
4,M4W,Downtown Toronto,Rosedale,Park,Playground,Trail,Cuban Restaurant,Donut Shop
5,M4X,Downtown Toronto,"St. James Town, Cabbagetown",Coffee Shop,Chinese Restaurant,Restaurant,Pub,Italian Restaurant
6,M4Y,Downtown Toronto,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant
7,M5A,Downtown Toronto,"Regent Park, Harbourfront",Coffee Shop,Pub,Park,Bakery,Theater
8,M5B,Downtown Toronto,"Garden District, Ryerson",Clothing Store,Coffee Shop,Restaurant,Japanese Restaurant,Café
9,M5C,Downtown Toronto,St. James Town,Coffee Shop,Café,Cocktail Bar,Gastropub,American Restaurant


**Run the clustering algorithm, setting the number of clusters to 5.**

In [25]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_onehot_grouped.drop(["Postal Code", "Borough", "Neighborhood"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 1, 0, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
       3, 1, 1, 1])

Take the dataframe with the 5 most frequent venues and append the Cluster Labels obtained using the kmeans algorithm.

In [26]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
toronto_merged = df5.copy()

# add clustering labels
toronto_merged["Cluster Labels"] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.drop(["Borough", "Neighborhood"], 1).set_index("Postal Code"), on="Postal Code")

# sort the results by Cluster Labels
toronto_merged.sort_values(["Cluster Labels"], inplace=True)

#print(toronto_merged.shape)
print(toronto_merged.shape)
toronto_merged.head() # check the last columns!

(26, 11)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,0,Park,Playground,Summer Camp,Restaurant,College Rec Center
0,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,1,Greek Restaurant,Italian Restaurant,Coffee Shop,Restaurant,Ice Cream Shop
23,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975,1,Bar,Restaurant,Coffee Shop,Asian Restaurant,Men's Store
21,M5X,Downtown Toronto,"First Canadian Place, Underground city",43.648429,-79.38228,1,Coffee Shop,Café,Gym,Restaurant,Japanese Restaurant
20,M5W,Downtown Toronto,Stn A PO Boxes,43.646435,-79.374846,1,Coffee Shop,Café,Japanese Restaurant,Cocktail Bar,Italian Restaurant


**Visualize the clustered Neighborhoods in Toronto's map.**

In [27]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, post, bor, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Postal Code'], toronto_merged['Borough'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup('{} ({}): {} - Cluster {}'.format(bor, post, poi, cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Show the Neighborhoods per Cluster Label to gain further insights.

In [28]:
print('Cluster 0')
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1]+[2] + list(range(5, toronto_merged.shape[1]))]]

Cluster 0


Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Central Toronto,"Moore Park, Summerhill East",0,Park,Playground,Summer Camp,Restaurant,College Rec Center


In [29]:
print('Cluster 1')
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + [2] + list(range(5, toronto_merged.shape[1]))]]

Cluster 1


Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,East Toronto,"The Danforth West, Riverdale",1,Greek Restaurant,Italian Restaurant,Coffee Shop,Restaurant,Ice Cream Shop
23,West Toronto,"Little Portugal, Trinity",1,Bar,Restaurant,Coffee Shop,Asian Restaurant,Men's Store
21,Downtown Toronto,"First Canadian Place, Underground city",1,Coffee Shop,Café,Gym,Restaurant,Japanese Restaurant
20,Downtown Toronto,Stn A PO Boxes,1,Coffee Shop,Café,Japanese Restaurant,Cocktail Bar,Italian Restaurant
18,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",1,Café,Mexican Restaurant,Vietnamese Restaurant,Coffee Shop,Bakery
17,Downtown Toronto,"University of Toronto, Harbord",1,Café,Japanese Restaurant,Restaurant,Bar,Italian Restaurant
16,Central Toronto,"The Annex, North Midtown, Yorkville",1,Café,Sandwich Place,Coffee Shop,BBQ Joint,Burger Joint
15,Downtown Toronto,"Commerce Court, Victoria Hotel",1,Coffee Shop,Café,Restaurant,Hotel,Gym
14,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",1,Coffee Shop,Hotel,Café,Restaurant,Salad Place
13,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",1,Coffee Shop,Aquarium,Hotel,Café,Italian Restaurant


In [30]:
print('Cluster 2')
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + [2] + list(range(5, toronto_merged.shape[1]))]]

Cluster 2


Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
19,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",2,Airport Service,Airport Lounge,Boat or Ferry,Sculpture Garden,Plane


In [31]:
print('Cluster 3')
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + [2] + list(range(5, toronto_merged.shape[1]))]]

Cluster 3


Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
22,Downtown Toronto,Christie,3,Grocery Store,Café,Park,Candy Store,Diner


In [32]:
print('Cluster 4')
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + [2] + list(range(5, toronto_merged.shape[1]))]]

Cluster 4


Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
4,Downtown Toronto,Rosedale,4,Park,Playground,Trail,Cuban Restaurant,Donut Shop


## Observations from the clusters

+ **Cluster 0:** Neighborhoods in Central Toronto borough with parks, playgrounds and summercamps.   
+ **Cluster 1:** Largest cluster with Neighborhoods from Central, East, West and Downtown Toronto, where venues like restaurants, coffee shops and bars feature as the most recurring type of establishments. The ambiance is probably what one would associate with vibrant, city life.
+ **Cluster 2:** Neighborhoods part of Downtown Toronto in an island offshore with an airport closeby, so venues associated with airport services are the most common.
+ **Cluster 3:** Single Neighborhood in Downtown Toronto close to the 5km radius fringe set, where establishments associated with suburban type of life start showing up.
+ **Cluster 4:** Single Neighborhood in Downtown Toronto, where establishments associated with suburban type of life and recreational areas can be found.