# __Neighbourhood segmentation and clustering in Toronto__   

# 1. Get a Dataframe from a table
## _1.1. Creating a Dataframe from Toronto Post Codes table in Wikipedia_
__We use the BeautifulSoup library__

First, [install and] import dependancies:

In [1]:
#!pip install beautifulsoup4
#!pip install lxml
#!pip show beautifulsoup4

In [2]:
from bs4 import BeautifulSoup
import requests
import lxml
#import html5lib
import pandas as pd
pd.set_option('precision', 8)

__The site we want to parse is [here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)__  

In [3]:
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [4]:
# Getting webpage to 'requests' object
raw_page = requests.get(wikipedia_link)

In [5]:
# Initialization of 'BeautifulSoup' object with parser 'lxml'
# and passing the text from 'requests' object as input
soup = BeautifulSoup(raw_page.text,'lxml')

In [6]:
toronto = soup.table
#print(toronto.prettify())

The structure of the table is as follows:

< tr >    
< td >  
  Postcode  
 < /td >  
 < td >  
  Borough  
 < /td >  
 < td >  
  Neighbourhood  
 < /td >  
< /tr >  

Now we can create separate lists for each column and populate them in a loop

In [7]:
postcode = []
borough = []
neighbourhood = []

for row in toronto.find_all('tr'):
    try:
        postcode.append(row.find_all('td')[0].text)
        borough.append(row.find_all('td')[1].text)
        neighbourhood.append(row.find_all('td')[2].text)
    except:
        #ignore the 1st row tagged with <tr>
        pass

In [8]:
#Check the equality of arrays' lengths

print('postcode: ',len(postcode))
print('borough: ',len(borough))
print('neighbourhood: ',len(neighbourhood))

postcode:  288
borough:  288
neighbourhood:  288


Now we can create and preview Dataframe:

In [9]:
df = pd.DataFrame({'Postcode':postcode,'Borough':borough,'Neighbourhood':neighbourhood})
df['Neighbourhood'].replace(regex='\\n',value='',inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## _1.2. Cleaning the table_

__a) Ignore cells with a borough that is Not assigned:__

In [10]:
df = df[df['Borough'] != 'Not assigned']
df.reset_index(drop=True,inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


__b) We have several neighborhoods for some of postcodes:__

In [11]:
print(len(df['Postcode'].unique()))
print(len(df['Neighbourhood'].unique()))

103
209


Let's combine them:

In [12]:
df = df.groupby('Postcode').agg(
    {'Borough': lambda x: list(x)[0],
     'Neighbourhood': lambda x:', '.join(map(str, list(x)))}).reset_index()

In [13]:
print(len(df['Postcode'].unique()))
print(len(df['Neighbourhood'].unique()))
df.head()

103
103


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


__c) Let's find 'Not assigned' Neighborhoods and assign them the borough name:__

In [14]:
df['Neighbourhood'].replace('Not assigned',value=df['Borough'],inplace=True)

In [15]:
df.shape

(103, 3)

# 2. Getting coordinates from address (geocoding)

In [16]:
!pip install geocoder
#!pip install geopy
#from geopy.geocoders import Nominatim # import geocoder
from map_api import mapquest #import credentials from config file



In [17]:
import geocoder
import numpy as np

__Add and initialize two new columns__

In [18]:
df['Latitude'] = 'NA'
df['Longitude'] = 'NA'

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",,
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",,
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",,
3,M1G,Scarborough,Woburn,,
4,M1H,Scarborough,Cedarbrae,,


__Extracting locations via Mapquest API__

In [19]:
mq = mapquest()
#mq.secret

In [20]:
for row in df.index:
    location = None
    postcode = df.loc[row]['Postcode']
    location = geocoder.mapquest('{}, Toronto, Ontario'.format(postcode),key=mq.key,maxRows=5)          
    
    try:
        df.loc[row]['Latitude'] = np.mean(np.asarray(location.lat))
        df.loc[row]['Longitude'] = np.mean(np.asarray(location.lng))
    except:
        pass

In [21]:
location = geocoder.mapquest('{}, Toronto, Ontario'.format('M1H'),key=mq.key,maxRows=5)
location.lat

43.651893

In [22]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.81302,-79.2432
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.79388,-79.12455
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.76812,-79.19745
3,M1G,Scarborough,Woburn,43.651893,-79.381713
4,M1H,Scarborough,Cedarbrae,43.651893,-79.381713
5,M1J,Scarborough,Scarborough Village,43.651893,-79.381713
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.7435,-79.26414
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.651893,-79.381713
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.71709,-79.24936
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.651893,-79.381713


__Getting coordinates from .csv file__  
Since Mapquest coordinates here seems to be not very precise, let's apply ready data

In [23]:
geo = pd.read_csv('Geospatial_Coordinates.csv')
geo.head()


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.8066863,-79.1943534
1,M1C,43.7845351,-79.1604971
2,M1E,43.7635726,-79.1887115
3,M1G,43.7709921,-79.2169174
4,M1H,43.773136,-79.2394761


In [24]:
df1 = df[['Postcode','Borough','Neighbourhood']].join(geo.set_index('Postal Code'), on='Postcode',)
df1.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.8066863,-79.1943534
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.7845351,-79.1604971
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7635726,-79.1887115
3,M1G,Scarborough,Woburn,43.7709921,-79.2169174
4,M1H,Scarborough,Cedarbrae,43.773136,-79.2394761
5,M1J,Scarborough,Scarborough Village,43.7447342,-79.2394761
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.7279292,-79.2620294
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.7111117,-79.2845772
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.2394761
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.2648481


# 3. Neighborhood analysis in Toronto

## _3.1. Visualizing neighbourhoods on map_ 

In [25]:
# Find Toronto location
location = geocoder.mapquest('Toronto, Ontario',key=mq.key)
print(location.lat,location.lng)

43.651893 -79.381713


In [26]:
import folium

__A map of Toronto with neighbourhoods on it:__  

In [27]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[location.lat, location.lng], zoom_start=10,width='50%')

# add markers to map
for lat, lng, borough, neighborhood in zip(df1['Latitude'], df1['Longitude'], df1['Borough'], df1['Neighbourhood']):
    label = '{} ({})'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

__Let's work with only boroughs that contain the word Toronto__

In [28]:
toronto_data = df1[df1['Borough'].str.match('.*Toronto.*')].reset_index(drop=True)
print(toronto_data.shape)
toronto_data.head()

(38, 5)


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.6763574,-79.2930312
1,M4K,East Toronto,"The Danforth West, Riverdale",43.6795571,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.6689985,-79.3155716
3,M4M,East Toronto,Studio District,43.6595255,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.7280205,-79.3887901


__The map below shows neighbourhoods only within "...Toronto" boroughs:__  

In [29]:
map_toronto = folium.Map(location=[location.lat, location.lng], zoom_start=12,width='50%')

for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighbourhood']):
    label = '{} ({})'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

## _3.2. Obtaining neighbourhood profiles with Foursquare data_  

In [30]:
#get API credentials
from map_api import foursquare
fs = foursquare()

__We define a function returning the top N venues that are within a radius of R meters of each neighbourhood.__  
The function takes neighbourhood names and coordinates and returns a dataframe.

In [31]:
def getNearbyVenues(names, latitudes, longitudes, radius_=500,limit_=100):
    '''This function a Dataframe with nearby venues for given neighbourhoods'''
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            fs.id, 
            fs.secret, 
            fs.version, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results]) # use only 1st category if more than 1

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list]) 
    # nested list comprehension
    
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

__We apply the above-defined function to toronto_data dataframe__  

We get the table of top-100 venues within 300 m for each neighbourhood  
Let's explore venues in radius 300 m, because 500 m seems to be too far for Toronto

In [32]:
radius = 500 # radius to explore
limit = 100
toronto_venues = getNearbyVenues(names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude'],
                                   radius_=radius,
                                   limit_=limit
                                  )
print(toronto_venues.shape)
toronto_venues.head()

(1702, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.6763574,-79.2930312,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
1,The Beaches,43.6763574,-79.2930312,Grover Pub and Grub,43.67918143,-79.29721536,Pub
2,The Beaches,43.6763574,-79.2930312,Starbucks,43.67879837,-79.29804498,Coffee Shop
3,The Beaches,43.6763574,-79.2930312,Upper Beaches,43.68056321,-79.29286887,Neighborhood
4,"The Danforth West, Riverdale",43.6795571,-79.352188,Pantheon,43.67762124,-79.3514339,Greek Restaurant


Let's count venues in neighbourhoods:

In [33]:
count = toronto_venues.groupby('Neighbourhood').count().reset_index()
count = count[['Neighbourhood','Neighbourhood Latitude']]
count.columns=['Neighbourhood','Venue Count']
count

Unnamed: 0,Neighbourhood,Venue Count
0,"Adelaide, King, Richmond",100
1,Berczy Park,56
2,"Brockton, Exhibition Place, Parkdale Village",22
3,Business Reply Mail Processing Centre 969 Eastern,17
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",15
5,"Cabbagetown, St. James Town",48
6,Central Bay Street,86
7,"Chinatown, Grange Park, Kensington Market",100
8,Christie,16
9,Church and Wellesley,89


In [34]:
pd.set_option("display.max_rows",60)

In [35]:
print('There are {} uniques categories in "..Toronto" boroughs within {} meters of borough center.'.format((len(toronto_venues['Venue Category'].unique())),radius))

There are 233 uniques categories in "..Toronto" boroughs within 500 meters of borough center.


## _Get a profile for each neighbourhood_  

__We can find 10 the most frequent venues in neighbourhoods and consider them as fingerprints__  

Let's start with one-hot encoding of venue categories  

In [36]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


__Count mean occurrence rate of every venue category for every Neighbourhood:__

In [37]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01785714,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04545455
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05882353
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.06666667,0.06666667,0.06666667,0.13333333,0.2,0.13333333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


__This function sorts the venues in descending order and returns the list of top-n venues:__

In [38]:
def get_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [39]:
num_top_venues = 5
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Frequent Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Frequent Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

# populate this dataframe row by row with 
for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = get_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Frequent Venue,2nd Most Frequent Venue,3rd Most Frequent Venue,4th Most Frequent Venue,5th Most Frequent Venue
0,"Adelaide, King, Richmond",Coffee Shop,Thai Restaurant,Café,Steakhouse,American Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Cheese Shop,Steakhouse
2,"Brockton, Exhibition Place, Parkdale Village",Breakfast Spot,Café,Coffee Shop,Yoga Studio,Bar
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Yoga Studio,Garden Center,Brewery,Farmers Market
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Terminal,Airport Lounge,Boat or Ferry,Sculpture Garden


## _3.3. Clustering neighbourhoods in Toronto_  

## __Create clusters__ 

In [40]:
from sklearn.cluster import KMeans

__Prepare the data for clustering: drop names of neighbourhoods__

In [41]:
toronto_for_clustering = toronto_grouped.drop('Neighbourhood', axis=1)
toronto_for_clustering.head()

Unnamed: 0,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01785714,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04545455
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05882353
4,0.0,0.0,0.06666667,0.06666667,0.06666667,0.13333333,0.2,0.13333333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


__Find clusters and get labels__

In [42]:
k = 3 #number of clusters for k-means

clusters = KMeans(n_clusters=k, random_state=0)
clusters.fit(toronto_for_clustering)
clusters.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 0, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

__Add labels of clusters to the table with top-5 venues__

In [43]:
#neighbourhoods_venues_sorted.drop('Cluster Labels',axis=1,inplace=True)
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', clusters.labels_)

In [44]:
neighbourhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighbourhood,1st Most Frequent Venue,2nd Most Frequent Venue,3rd Most Frequent Venue,4th Most Frequent Venue,5th Most Frequent Venue
0,1,"Adelaide, King, Richmond",Coffee Shop,Thai Restaurant,Café,Steakhouse,American Restaurant
1,1,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Cheese Shop,Steakhouse
2,1,"Brockton, Exhibition Place, Parkdale Village",Breakfast Spot,Café,Coffee Shop,Yoga Studio,Bar
3,1,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Yoga Studio,Garden Center,Brewery,Farmers Market
4,1,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Terminal,Airport Lounge,Boat or Ferry,Sculpture Garden


__Create merged table__  
with boroughs, neighbourhoods, coordinates, cluster labels and 5 most frequent types of venues  

In [45]:
toronto_merged = toronto_data[['Borough','Neighbourhood','Latitude','Longitude']]
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Frequent Venue,2nd Most Frequent Venue,3rd Most Frequent Venue,4th Most Frequent Venue,5th Most Frequent Venue
0,East Toronto,The Beaches,43.6763574,-79.2930312,1,Health Food Store,Coffee Shop,Pub,Neighborhood,Dessert Shop
1,East Toronto,"The Danforth West, Riverdale",43.6795571,-79.352188,1,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Bookstore
2,East Toronto,"The Beaches West, India Bazaar",43.6689985,-79.3155716,1,Pub,Fast Food Restaurant,Steakhouse,Ice Cream Shop,Burrito Place
3,East Toronto,Studio District,43.6595255,-79.340923,1,Café,Coffee Shop,Bakery,Italian Restaurant,American Restaurant
4,Central Toronto,Lawrence Park,43.7280205,-79.3887901,0,Bus Line,Park,Swim School,Yoga Studio,Falafel Restaurant


__Show clusters on map__

In [46]:
# create map
cluster_map = folium.Map(location=[location.lat, location.lng], zoom_start=12,width='50%')

# set color scheme for the clusters
import matplotlib.cm as cm
import matplotlib.colors as colors

x = np.arange(k) # k is the number of clusters
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' (Cluster ' + str(cluster) +')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.5).add_to(cluster_map)
       
cluster_map

## _3.4. Explore clusters one by one_  

### __Cluster 1__

In [47]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Frequent Venue,2nd Most Frequent Venue,3rd Most Frequent Venue,4th Most Frequent Venue,5th Most Frequent Venue
4,Lawrence Park,Bus Line,Park,Swim School,Yoga Studio,Falafel Restaurant
10,Rosedale,Park,Playground,Trail,Yoga Studio,Dessert Shop


___Let's call it 'Parks and Health'___

### __Cluster 2__

In [48]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Frequent Venue,2nd Most Frequent Venue,3rd Most Frequent Venue,4th Most Frequent Venue,5th Most Frequent Venue
0,The Beaches,Health Food Store,Coffee Shop,Pub,Neighborhood,Dessert Shop
1,"The Danforth West, Riverdale",Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Bookstore
2,"The Beaches West, India Bazaar",Pub,Fast Food Restaurant,Steakhouse,Ice Cream Shop,Burrito Place
3,Studio District,Café,Coffee Shop,Bakery,Italian Restaurant,American Restaurant
5,Davisville North,Gym,Food & Drink Shop,Park,Breakfast Spot,Clothing Store
6,North Toronto West,Clothing Store,Coffee Shop,Sporting Goods Shop,Yoga Studio,Bagel Shop
7,Davisville,Sandwich Place,Dessert Shop,Café,Sushi Restaurant,Coffee Shop
8,"Moore Park, Summerhill East",Restaurant,Gym,Playground,Department Store,Event Space
9,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",Coffee Shop,Pub,Fried Chicken Joint,American Restaurant,Sushi Restaurant
11,"Cabbagetown, St. James Town",Coffee Shop,Park,Restaurant,Café,Bakery


___Best description is 'Coffee and Café'___  

### __Cluster 3__

In [49]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Frequent Venue,2nd Most Frequent Venue,3rd Most Frequent Venue,4th Most Frequent Venue,5th Most Frequent Venue
22,Roselawn,Garden,Yoga Studio,Falafel Restaurant,Event Space,Ethiopian Restaurant


___'Garden'___

## _3.5. Now we can rename clusters in merged table..._  

In [50]:
toronto_merged['Cluster names'] = toronto_merged['Cluster Labels'].replace(to_replace={0:'Parks and Health',1:'Coffee and Café',2:'Garden'})
toronto_merged.head(2)

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Frequent Venue,2nd Most Frequent Venue,3rd Most Frequent Venue,4th Most Frequent Venue,5th Most Frequent Venue,Cluster names
0,East Toronto,The Beaches,43.6763574,-79.2930312,1,Health Food Store,Coffee Shop,Pub,Neighborhood,Dessert Shop,Coffee and Café
1,East Toronto,"The Danforth West, Riverdale",43.6795571,-79.352188,1,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Bookstore,Coffee and Café


### _and update the map_  

In [51]:
# create map
cluster_map = folium.Map(location=[location.lat, location.lng], zoom_start=12,width='50%')

# set color scheme for the clusters
import matplotlib.cm as cm
import matplotlib.colors as colors

x = np.arange(k) # k is the number of clusters
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, name in zip(
    toronto_merged['Latitude'],
    toronto_merged['Longitude'],
    toronto_merged['Neighbourhood'],
    toronto_merged['Cluster Labels'],
    toronto_merged['Cluster names']):
    label = folium.Popup(str(poi) + ' (Cluster: ' + str(name) +')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.5).add_to(cluster_map)
       
cluster_map

__Note:__ it may look like not very good clustering. However, now I'm just practising.  