In [119]:
# import needed libraries
import pandas as pd
import numpy as np
import requests
import lxml.html as lh

In this project we will be analyzing neighborhoods in Toronto. First, we will need the data! Fortunately, Wikipedia has a list of postal codes that we can use.  
Let's load it.

In [120]:
# Set the URL we want to scrape
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#Create a handle, page, to handle the contents of the website
page = requests.get(url)

#Store the contents of the website under doc
doc = lh.fromstring(page.content)

#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

In [121]:
# Check the length of the first 12 rows --> this is for sanity check, all rows have to have the same width
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

In [122]:
# Let's parse the first rows as header
tr_elements = doc.xpath('//tr')

#Create empty list
col=[]
i=0

#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print( '%d:"%s"'%(i,name))
    col.append((name,[]))

1:"Postcode"
2:"Borough"
3:"Neighbourhood
"


In [123]:
#Since our first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [124]:
# Just to be sure, let's check the length of each column --> they should be the same
[len(C) for (title,C) in col]

[287, 287, 287]

In [125]:
# Now we are ready to create our pandas dataframe
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood\n
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


Great, we have the data! However, notice the 'Not assigned' values and the \n in Neighbourhood column.   
We have to fix these. Let's wrangle with the data a bit.

In [126]:
# Let's fix the \n issue first; replace the \n with empty string and rename the column
df['Neighbourhood\n'] = df['Neighbourhood\n'].str.replace(r'\n', '')
df.rename(columns={"Postcode": "PostalCode","Neighbourhood\n": "Neighborhood"}, inplace=True)

In [127]:
# Now let's handle the 'Not assigned' values
df = df[df.Borough != 'Not assigned']

# We removed all the 'Not assigned' values from the Borough col
# Now, let's turn our attention to concatenating the neighborhoods based on borough

# first, group by borough
df_grouped = df.groupby(['PostalCode'])['Neighborhood'].apply(', '.join).reset_index()
df = df_grouped.merge(df, on = 'PostalCode') # join the grouped df with the regular one
df.drop_duplicates(subset= "PostalCode", inplace=True) # remove duplicate postal codes from joined df

# second, let's remove the unnecessary column and rename our headers
df = df.drop(columns = 'Neighborhood_y')
df.rename(columns={"Postcode": "PostalCode","Neighborhood_x": "Neighborhood"}, inplace=True)

# Now let's handle to 'Not assigned' in Neighborhood column
df.loc[(df.Neighborhood == 'Not assigned'),'Neighborhood'] = df.loc[(df.Neighborhood == 'Not assigned'),'Borough']

# Last, rearange column order 
cols = ['PostalCode', 'Borough', 'Neighborhood']
df = df[cols]

# Now, let's get the shape of our dataframe and print out the first 12 rows
print('The shape of our dataframe is: ',df.shape)
df.head(12)

The shape of our dataframe is:  (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
2,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
5,M1E,Scarborough,"Guildwood, Morningside, West Hill"
8,M1G,Scarborough,Woburn
9,M1H,Scarborough,Cedarbrae
10,M1J,Scarborough,Scarborough Village
11,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
14,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
17,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
20,M1N,Scarborough,"Birch Cliff, Cliffside West"


Our dataframe looks good! Let's now add the coordinates.

In [128]:
# We have our coordinates stored in a .csv; let's store the data in a pandas df
coord = pd.read_csv('Geospatial_Coordinates.csv')

# We can merge our two dfs based on postal code
coord.rename(columns = {'Postal Code': 'PostalCode'}, inplace = True)
df = df.merge(coord, on = 'PostalCode') # join the grouped df with the regular one

# Let's see our df
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


We are done with the data wrangling for now, so let the fun begin!  
Let's explore the city of Toronto a little bit. We will use Foursquare API to get venues data that we can map and cluster.

In [129]:
# First, let's load additional libraries
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

Now that we have all the necessary libraries, we can locate Toronto's coordinates.

In [130]:
address = 'Toronto'
# let's use geolocator to find the coordinates of Toronto
geolocator = Nominatim(user_agent="ca_exp")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


Let's create the map of Toronto now, with our boroughs superimposed on the map.

In [131]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}; {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Next, we are going to use Foursquare API to explore venues

In [132]:
CLIENT_ID = 'SYGGTJ5AS5QEBE0VA5XIOVWRN0MROFX5LI5C5UGSKHNTS3DA' # your Foursquare ID
CLIENT_SECRET = 'HMKAHE0SZNZXRGRJKSHAPCLV52CRLDPBNCRJBBPFUAMODJ0S' # your Foursquare Secret
VERSION = '20191216' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: SYGGTJ5AS5QEBE0VA5XIOVWRN0MROFX5LI5C5UGSKHNTS3DA
CLIENT_SECRET:HMKAHE0SZNZXRGRJKSHAPCLV52CRLDPBNCRJBBPFUAMODJ0S


Let's explore the first borough in our dataframe

In [133]:
dt_data = df[df['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
dt_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


In [134]:
address = 'Downtown Toronto'

geolocator = Nominatim(user_agent="ca_exp")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


Let's map all the neighborhoods in Downtown Toronto.

In [135]:
# create map of Downtown Toronto using latitude and longitude values
map_dt = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(dt_data['Latitude'], dt_data['Longitude'], dt_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dt)  
    
map_dt

Now, let's explore 'Church and Wellesley' more in depth using Foursquare API.

In [136]:
# First, let's filter the data
neighborhood_latitude = dt_data.loc[2, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = dt_data.loc[2, 'Longitude'] # neighborhood longitude value

neighborhood_name = dt_data.loc[2, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Church and Wellesley are 43.6658599, -79.38315990000001.


 Let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.

In [137]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

# Store the results
results = requests.get(url).json()


In the foursquare API, all the relevant information is in the 'items'. We will create a function to get the category types out of the json and then continue the analysis.


In [138]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Let's clean the json and store it in pandas dataframe.

In [139]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Storm Crow Manor,Theme Restaurant,43.66684,-79.381593
1,DanceLifeX Centre,Dance Studio,43.666956,-79.385297
2,Smith,Breakfast Spot,43.666927,-79.381421
3,The Alley,Bubble Tea Shop,43.665922,-79.385567
4,Sansotei Ramen 三草亭,Ramen Restaurant,43.666735,-79.385353


In [140]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

83 venues were returned by Foursquare.


Now that we got venues data from one neighborhood, let's scale it up by getting venues data from all neighborhoods in Downtown Toronto.

In [141]:
# The following function will get venues data from all neighborhoods:

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [142]:
# Let's pass the function to our dataframe

dt_venues = getNearbyVenues(names=dt_data['Neighborhood'],
                                   latitudes=dt_data['Latitude'],
                                   longitudes=dt_data['Longitude']
                                  )



Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Queen's Park


Let's check the size of the resulting dataframe.

In [143]:
print(dt_venues.shape)
dt_venues.head()

(1279, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
3,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,"Cabbagetown, St. James Town",43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner


In [144]:
# Amount of venues returned per Neighborhood
dt_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,55,55,55,55,55,55
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",15,15,15,15,15,15
"Cabbagetown, St. James Town",43,43,43,43,43,43
Central Bay Street,84,84,84,84,84,84
"Chinatown, Grange Park, Kensington Market",94,94,94,94,94,94
Christie,17,17,17,17,17,17
Church and Wellesley,83,83,83,83,83,83
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
"Design Exchange, Toronto Dominion Centre",100,100,100,100,100,100


In [145]:
# Amount of unique venues in the borough
print('There are {} uniques categories.'.format(len(dt_venues['Venue Category'].unique())))

There are 205 uniques categories.


Now that we have made some descriptive analysis, we can start looking into clustering the neighborhood.  
First, we need to one-hot encode the categorical variables, then we can set up a clustering using k-means method.

In [146]:
# one hot encoding
dt_onehot = pd.get_dummies(dt_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dt_onehot['Neighborhood'] = dt_venues['Neighborhood'] 
dt_onehot = dt_onehot[ ['Neighborhood'] + [ col for col in dt_onehot.columns if col != 'Neighborhood' ] ]
dt_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Rosedale,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,"Cabbagetown, St. James Town",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's group the data by neigborhood and by taking the mean of the frequency of occurancy of a venue.

In [147]:
dt_grouped = dt_onehot.groupby('Neighborhood').mean().reset_index()


Let's print each neighborhood out along with the top 5 venues categories.

In [148]:
num_top_venues = 5

for hood in dt_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = dt_grouped[dt_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
             venue  freq
0      Coffee Shop  0.07
1             Café  0.05
2  Thai Restaurant  0.04
3       Steakhouse  0.04
4      Salad Place  0.03


----Berczy Park----
         venue  freq
0  Coffee Shop  0.09
1       Bakery  0.04
2   Steakhouse  0.04
3         Café  0.04
4  Cheese Shop  0.04


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
                venue  freq
0      Airport Lounge  0.13
1     Airport Service  0.13
2    Airport Terminal  0.13
3             Airport  0.07
4  Airport Food Court  0.07


----Cabbagetown, St. James Town----
         venue  freq
0  Coffee Shop  0.09
1   Restaurant  0.07
2         Café  0.05
3  Pizza Place  0.05
4       Bakery  0.05


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.14
1      Sandwich Place  0.05
2  Italian Restaurant  0.05
3        Burger Joint  0.04
4      Ice Cream Shop  0.04


----Chinatown, Grange

Let's put this into a pandas dataframe.

In [149]:
# function to sort venues in descending
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = dt_grouped['Neighborhood']

for ind in np.arange(dt_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dt_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Steakhouse,Thai Restaurant,Bakery,Restaurant,Burger Joint,Salad Place,Sushi Restaurant,Asian Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Cheese Shop,Beer Bar,Farmers Market,Seafood Restaurant,Bakery,Steakhouse,Café,French Restaurant
2,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Lounge,Airport Service,Airport Terminal,Harbor / Marina,Bar,Boutique,Boat or Ferry,Rental Car Location,Sculpture Garden,Airport
3,"Cabbagetown, St. James Town",Coffee Shop,Restaurant,Pub,Flower Shop,Italian Restaurant,Café,Bakery,Pizza Place,Park,Breakfast Spot
4,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Burger Joint,Ice Cream Shop,Café,Japanese Restaurant,Fried Chicken Joint,Salad Place,Spa


Let's run the k-means clustering.

In [150]:
# set number of clusters
kclusters = 5

dt_grouped_clustering = dt_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dt_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 4, 0, 0, 3, 2, 0, 0, 0])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [151]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dt_merged = dt_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
dt_merged = dt_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

dt_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,1.0,Park,Playground,Trail,Deli / Bodega,Electronics Store,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Discount Store
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,0.0,Coffee Shop,Restaurant,Pub,Flower Shop,Italian Restaurant,Café,Bakery,Pizza Place,Park,Breakfast Spot
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,0.0,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Gay Bar,Restaurant,Gastropub,Men's Store,Mediterranean Restaurant,Hotel,Gym
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,0.0,Coffee Shop,Park,Bakery,Pub,Café,Restaurant,Mexican Restaurant,Breakfast Spot,Theater,Ice Cream Shop
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Middle Eastern Restaurant,Fast Food Restaurant,Bubble Tea Shop,Diner,Bookstore,Pizza Place


Now let's examine our clusters.

In [165]:
dt_merged.loc[dt_merged['Cluster Labels'] == 0, dt_merged.columns[[1] + list(range(5, dt_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Downtown Toronto,0.0,Coffee Shop,Restaurant,Pub,Flower Shop,Italian Restaurant,Café,Bakery,Pizza Place,Park,Breakfast Spot
2,Downtown Toronto,0.0,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Gay Bar,Restaurant,Gastropub,Men's Store,Mediterranean Restaurant,Hotel,Gym
3,Downtown Toronto,0.0,Coffee Shop,Park,Bakery,Pub,Café,Restaurant,Mexican Restaurant,Breakfast Spot,Theater,Ice Cream Shop
4,Downtown Toronto,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Middle Eastern Restaurant,Fast Food Restaurant,Bubble Tea Shop,Diner,Bookstore,Pizza Place
5,Downtown Toronto,0.0,Coffee Shop,Café,Restaurant,Hotel,Clothing Store,Italian Restaurant,American Restaurant,Breakfast Spot,Diner,Beer Bar
6,Downtown Toronto,0.0,Coffee Shop,Cocktail Bar,Cheese Shop,Beer Bar,Farmers Market,Seafood Restaurant,Bakery,Steakhouse,Café,French Restaurant
7,Downtown Toronto,0.0,Coffee Shop,Italian Restaurant,Sandwich Place,Burger Joint,Ice Cream Shop,Café,Japanese Restaurant,Fried Chicken Joint,Salad Place,Spa
8,Downtown Toronto,0.0,Coffee Shop,Café,Steakhouse,Thai Restaurant,Bakery,Restaurant,Burger Joint,Salad Place,Sushi Restaurant,Asian Restaurant
9,Downtown Toronto,0.0,Coffee Shop,Aquarium,Café,Hotel,Restaurant,Fried Chicken Joint,Brewery,Italian Restaurant,Scenic Lookout,Baseball Stadium
10,Downtown Toronto,0.0,Coffee Shop,Café,Hotel,Restaurant,Bar,Deli / Bodega,Gastropub,Seafood Restaurant,Bakery,American Restaurant


Looks like our first cluster is characterized by having many coffee shops and restaurants.

In [158]:
dt_merged.loc[dt_merged['Cluster Labels'] == 1, dt_merged.columns[[1] + list(range(5, dt_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,1.0,Park,Playground,Trail,Deli / Bodega,Electronics Store,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Discount Store


Our second one has park and playground as most common venue.

In [159]:
dt_merged.loc[dt_merged['Cluster Labels'] == 2, dt_merged.columns[[1] + list(range(5, dt_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Downtown Toronto,2.0,Grocery Store,Café,Park,Baby Store,Coffee Shop,Convenience Store,Nightclub,Diner,Athletics & Sports,Restaurant


The third cluster characterizes a more residential area with the most prominent place being a grocery store.

In [160]:
dt_merged.loc[dt_merged['Cluster Labels'] == 3, dt_merged.columns[[1] + list(range(5, dt_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Downtown Toronto,3.0,Café,Bookstore,Theater,Japanese Restaurant,Bar,Bakery,Restaurant,Sandwich Place,Italian Restaurant,Sushi Restaurant
13,Downtown Toronto,3.0,Café,Bar,Mexican Restaurant,Vietnamese Restaurant,Coffee Shop,Bakery,Chinese Restaurant,Dumpling Restaurant,Vegetarian / Vegan Restaurant,Donut Shop


The fourth cluster is similar to the first one, although café places usually sell food next to the coffee.  
We could assume that the neighborhoods in this cluster are smaller in population or less crowded in office hours, since there is no need for coffee houses.

In [161]:
dt_merged.loc[dt_merged['Cluster Labels'] == 4, dt_merged.columns[[1] + list(range(5, dt_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Downtown Toronto,4.0,Airport Lounge,Airport Service,Airport Terminal,Harbor / Marina,Bar,Boutique,Boat or Ferry,Rental Car Location,Sculpture Garden,Airport


Our last cluster clearly shows a neighborhood that contains an airport.