# Segmenting and Clustering Neighborhoods in Toronto

## 1. Create a data frame of Toronto neighborhoods

Import the needed libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

Load the wikipage of Canada postal codes. Then use BeautifulSoup library to do web scraping and wrangle the data so that you can read the individual table elements into a data frame.

In [2]:
# load the wikipage of Canada postal codes
page=requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup=BeautifulSoup(page.content,'html.parser')
# Find the table containing postal codes within hmtl code
wikitable=soup.find('table', class_='wikitable sortable') 
tag_elements=wikitable.find_all('td')

Read first all data from the table in the wikipage into the data frame.

In [3]:
#Create a data frame of the desired format
df_can=pd.DataFrame(columns=['PostalCode','Borough','Neighborhood'])
n=int(len(tag_elements)/3)
for i in range(n):
    df_can.loc[i,'PostalCode']=tag_elements[3*i].get_text()
    df_can.loc[i,'Borough']=tag_elements[3*i+1].get_text()
    df_can.loc[i,'Neighborhood']=tag_elements[3*i+2].get_text().replace('\n','')   

Then remove those rows where Borough is not assigned.

In [4]:
#Ignore cells where Borough is not assigned
df_can=df_can[df_can['Borough']!='Not assigned'].reset_index(drop=True)

Merge several neighborhoods into one row within the same postalcode.

In [5]:
#Merge several neighborhoods into one row within the same postalcode
df_grp=pd.DataFrame(df_can.groupby(['PostalCode','Borough'],as_index=False)['Neighborhood'].apply(lambda tags: ','.join(tags)))
df_grp.rename(columns={0:'Neighborhood'},inplace=True)
df_grp.reset_index(inplace=True)

Insert Borough name into Neighbordhood if Neighborhood is not assigned.

In [6]:
#Insert Borough name into Neighborhood if Neighborhood is 'Not assigned'
df_grp.loc[df_grp['Neighborhood']=='Not assigned','Neighborhood']=df_grp.loc[df_grp['Neighborhood']=='Not assigned','Borough']
df_grp

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


Print the number of rows in the data frame.

In [7]:
print('Number of rows in the data frame: ',df_grp.shape[0])

Number of rows in the data frame:  103


## 2. Add latitude and longitude coordinates of each neighborhood

Read in the csv file containing the latitude and longitude values for the postal codes.

In [8]:
df_geo=pd.read_csv("Geospatial_Coordinates.csv")
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Ensure that the key column is with the same name, so that join operation works.
Then join the earlier created data frame containing neighborhoods, and this new data frame containing location coordinates.
The join will be made based on PostalCode column data.

In [9]:
df_geo.rename(columns={'Postal Code':'PostalCode'},inplace=True)
df_grp=df_grp.join(df_geo.set_index('PostalCode'),on='PostalCode')
df_grp

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


## 3. Explore and cluster the neighborhoods in Toronto

In [10]:
df_toronto=df_grp[df_grp['Borough'].str.contains('Toronto')].reset_index(drop=True)
df_toronto

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049


In [11]:
df_toronto.shape

(38, 5)

Get latitude and longitude values of Toronto.

In [12]:
from geopy.geocoders import Nominatim
address='Toronto,ON'
geolocator=Nominatim()
location=geolocator.geocode(address)
latitude=location.latitude
longitude=location.longitude
print('The geographical coordinates of Toronto are {}, {}. '.format(latitude,longitude))

The geographical coordinates of Toronto are 43.653963, -79.387207. 


Visualise Toronto and the neighborhoods in it.

In [13]:
import folium
map_toronto=folium.Map(location=[latitude,longitude],zoom_start=11)
for lat, lng, label in zip(df_toronto['Latitude'],df_toronto['Longitude'],df_toronto['Neighborhood']):
    label=folium.Popup(label,parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(map_toronto)
map_toronto

Utilize Foursquare API to explore the neighborhoods and segment them

This part of the code is removed since foursquare api credentials are secret.

Copy the getNearbyVenues function from New York lab.

In [14]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Create a new data frame called toronto_venues, utilizing getNearbyVenues function. 
This will get at max 100 venues within a radius of 500 meters for each neighborhood in Toronto.

In [18]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
toronto_venues = getNearbyVenues(names=df_toronto['Neighborhood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                  )

The Beaches
The Danforth West,Riverdale
The Beaches West,India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park,Summerhill East
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
Rosedale
Cabbagetown,St. James Town
Church and Wellesley
Harbourfront,Regent Park
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Roselawn
Forest Hill North,Forest Hill West
The Annex,North Midtown,Yorkville
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place,Underground city
Christie
Dovercourt Village,Dufferin
Little Portugal,Trinity
Brockton,Exhibition Place,Parkdale Village
High Park,The Junction South
Parkdale,Roncesvall

Let's check the size of the resulting data frame.

In [19]:
print(toronto_venues.shape)
toronto_venues.head()

(1709, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
1,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
2,The Beaches,43.676357,-79.293031,Starbucks,43.678798,-79.298045,Coffee Shop
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West,Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


Let's check how many venues were returned for each neighborhood.

In [20]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Berczy Park,55,55,55,55,55,55
"Brockton,Exhibition Place,Parkdale Village",20,20,20,20,20,20
Business Reply Mail Processing Centre 969 Eastern,17,17,17,17,17,17
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",13,13,13,13,13,13
"Cabbagetown,St. James Town",43,43,43,43,43,43
Central Bay Street,84,84,84,84,84,84
"Chinatown,Grange Park,Kensington Market",100,100,100,100,100,100
Christie,16,16,16,16,16,16
Church and Wellesley,89,89,89,89,89,89


Let's check also how many unique categories there are in all the results.

In [21]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 240 uniques categories.


Create a new data frame where each category is in own column, transformed by one hot encoding.

In [22]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 
# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
#Let's check the shape of dataframe
toronto_onehot.shape

(1709, 240)

Next, group rows by neighborhoods and take the mean of frequency of occurrence of each category.

In [24]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,"Adelaide,King,Richmond",0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0
2,"Brockton,Exhibition Place,Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.0,0.0,0.0,0.076923,0.076923,0.076923,0.153846,0.153846,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
#Let's check the shape of dataframe
toronto_grouped.shape

(38, 240)

Copy the function return_most_common_venues() from New York lab.This sorts the venues in descending order in the next step.

In [26]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)   
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create a new dataframe and display the top 10 venues for each neighborhood.

In [27]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Bar,Café,Thai Restaurant,Steakhouse,Sushi Restaurant,Restaurant,Burger Joint,Bakery,American Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant,Café,Farmers Market,Bakery,Steakhouse,Pub,Seafood Restaurant,Cheese Shop
2,"Brockton,Exhibition Place,Parkdale Village",Breakfast Spot,Coffee Shop,Café,Gym,Furniture / Home Store,Falafel Restaurant,Bar,Stadium,Burrito Place,Restaurant
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Garden,Garden Center,Auto Workshop,Recording Studio,Skate Park,Brewery,Fast Food Restaurant,Spa,Burrito Place
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Airport Terminal,Airport Service,Airport Lounge,Airport Gate,Sculpture Garden,Plane,Harbor / Marina,Airport Food Court,Airport,Boat or Ferry
5,"Cabbagetown,St. James Town",Coffee Shop,Restaurant,Café,Pizza Place,Italian Restaurant,Pub,Bakery,Pharmacy,Playground,Deli / Bodega
6,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Bubble Tea Shop,Burger Joint,Bar,Café,Thai Restaurant,Salad Place,Spa
7,"Chinatown,Grange Park,Kensington Market",Café,Vietnamese Restaurant,Bar,Vegetarian / Vegan Restaurant,Coffee Shop,Bakery,Chinese Restaurant,Mexican Restaurant,Dessert Shop,Dim Sum Restaurant
8,Christie,Café,Grocery Store,Park,Restaurant,Italian Restaurant,Baby Store,Diner,Nightclub,Coffee Shop,Convenience Store
9,Church and Wellesley,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Restaurant,Burger Joint,Pub,Men's Store,Mediterranean Restaurant,Gym


Next, cluster the neighborhoods. This is done with k-means clustering, by clustering the neighborhoods into 5 clusters.

In [28]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [29]:
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       4, 0, 3, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [30]:
toronto_merged = df_toronto
# add clustering labels
toronto_merged['Cluster Labels'] = kmeans.labels_
# merge df_toronto with neighborhoods_venues data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head() 

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Pub,Health Food Store,Coffee Shop,Wings Joint,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Indian Restaurant,Bakery,Sports Bar,Japanese Restaurant,Spa,Juice Bar
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,0,Sandwich Place,Coffee Shop,Ice Cream Shop,Liquor Store,Light Rail Station,Burger Joint,Burrito Place,Fast Food Restaurant,Fish & Chips Shop,Steakhouse
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,Italian Restaurant,Gastropub,American Restaurant,Bakery,Brewery,Bank,Bar,Fish Market
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Bus Line,Park,Swim School,Wings Joint,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


Let's visualize the resulting clusters.

In [31]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)      
map_clusters

Next, each cluster is examined to determine the discriminating venue categories that distinguish each cluster. 

### Cluster 1

In [32]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,0,Pub,Health Food Store,Coffee Shop,Wings Joint,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant
1,East Toronto,0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Indian Restaurant,Bakery,Sports Bar,Japanese Restaurant,Spa,Juice Bar
2,East Toronto,0,Sandwich Place,Coffee Shop,Ice Cream Shop,Liquor Store,Light Rail Station,Burger Joint,Burrito Place,Fast Food Restaurant,Fish & Chips Shop,Steakhouse
3,East Toronto,0,Café,Coffee Shop,Italian Restaurant,Gastropub,American Restaurant,Bakery,Brewery,Bank,Bar,Fish Market
4,Central Toronto,0,Bus Line,Park,Swim School,Wings Joint,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
5,Central Toronto,0,Pizza Place,Asian Restaurant,Hotel,Food & Drink Shop,Dance Studio,Park,Clothing Store,Sandwich Place,Burger Joint,Breakfast Spot
6,Central Toronto,0,Sporting Goods Shop,Coffee Shop,Yoga Studio,Gift Shop,Fast Food Restaurant,Diner,Metro Station,Mexican Restaurant,Dessert Shop,Park
7,Central Toronto,0,Sandwich Place,Pizza Place,Dessert Shop,Coffee Shop,Italian Restaurant,Café,Sushi Restaurant,Restaurant,Gourmet Shop,Indian Restaurant
8,Central Toronto,0,Restaurant,Gym,Playground,Tennis Court,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
9,Central Toronto,0,Coffee Shop,Pub,Medical Center,American Restaurant,Sushi Restaurant,Bagel Shop,Fried Chicken Joint,Sports Bar,Supermarket,Convenience Store


### Cluster 2

In [33]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Downtown Toronto,1,Coffee Shop,Italian Restaurant,Sandwich Place,Bubble Tea Shop,Burger Joint,Bar,Café,Thai Restaurant,Salad Place,Spa
27,Downtown Toronto,1,Airport Terminal,Airport Service,Airport Lounge,Airport Gate,Sculpture Garden,Plane,Harbor / Marina,Airport Food Court,Airport,Boat or Ferry


### Cluster 3

In [34]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
28,Downtown Toronto,2,Coffee Shop,Restaurant,Café,Seafood Restaurant,Pub,Fast Food Restaurant,Cocktail Bar,Hotel,Creperie,Park


### Cluster 4

In [35]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
24,Central Toronto,3,Coffee Shop,Sandwich Place,Café,Pizza Place,American Restaurant,Liquor Store,Burger Joint,Jewish Restaurant,BBQ Joint,Pub


### Cluster 5

In [36]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Central Toronto,4,Garden,Filipino Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


## 4. Analyse the clusters and name them

Create a dataframe containing counts of category values per cluster, when top three most common venues are taken into account for each cluster

In [111]:
toronto_categories=toronto_merged.loc[:,toronto_merged.columns[[1]+list(range(5,9))]]
toronto_categories.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,East Toronto,0,Pub,Health Food Store,Coffee Shop
1,East Toronto,0,Greek Restaurant,Coffee Shop,Ice Cream Shop
2,East Toronto,0,Sandwich Place,Coffee Shop,Ice Cream Shop
3,East Toronto,0,Café,Coffee Shop,Italian Restaurant
4,Central Toronto,0,Bus Line,Park,Swim School


Create new cluster labels

In [116]:
frequency=[]
category=[]
toronto_merged['Cluster Category']=np.nan

In [117]:
for i in range(5):
  counts= toronto_categories.loc[toronto_categories ['Cluster Labels']==i,'1st Most Common Venue':'3rd Most Common Venue'].apply(pd.Series.value_counts)
  #Calculate a total column, giving total count for each category (per cluster)
  counts['Total']=counts.sum(axis=1).astype("int")    
  # The max number of occurrences for this cluster for the most common Venue
  frequency.append(counts['Total'].max()) # add this to frequency list
  # The most common category for this cluster
  category.append(counts['Total'].idxmax()) # add this to category list
  print( "In cluster ", i+1, "the most common category is: ", category[i], "with frequency count: ", frequency[i])
  toronto_merged.loc[toronto_merged['Cluster Labels']==i, 'Cluster Category']=category[i]

In cluster  1 the most common category is:  Coffee Shop with frequency count:  20
In cluster  2 the most common category is:  Airport Lounge with frequency count:  1
In cluster  3 the most common category is:  Café with frequency count:  1
In cluster  4 the most common category is:  Café with frequency count:  1
In cluster  5 the most common category is:  Farmers Market with frequency count:  1


For cluster 1 we can allocate the name by the frequency count, and for cluster 2 by the uniqueness of the category.
* Cluster 1: Coffee Shop
* Cluster 2: Airport

For clusters 3, 4 and 5 there is no single category in them that has high frequency.
For this reason, let's define the name for them by selecting the next most common category, that is not yet in use in the other clusters.

Let's select as a name, the '2nd Most Common Venue' for clusters 3,4 and the '1st Most Common Venue' for 5.
* Cluster 3: Restaurant
* Cluster 4: Sandwich Place
* Cluster 5: Garden