# Explore and Cluster the Neighborhoods in Toronto

Now once the data is in a structured format, we can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

## 1. Assignment Requirements

You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

- to add enough Markdown cells to explain what you decided to do and to report any observations you make.
- to generate maps to visualize your neighborhoods and how they cluster together.

## 2. Reading Dataframe and Importing Libraries

First let's import libraries that we will use. And then read and check the resulted dataframe from the last step.

In [1]:
import requests
import folium
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

from pandas.io.json import json_normalize
from sklearn.cluster import KMeans

df = pd.read_csv('toronto_geo_df.csv', index_col=0)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75244,-79.329271
1,M4A,North York,Victoria Village,43.730421,-79.31332
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723125,-79.451589
4,M7A,Queen's Park,Queen's Park,43.661102,-79.391035


## 3. Choosing Boroughs for Analysis

It might be useful first to have a look at all the boroughs in our Toronto dataframe and check how many postal code areas are in each borough.

In [2]:
boroughs_df = df[['PostalCode', 'Borough']].groupby(by='Borough', sort=False).count().reset_index()

boroughs_df.rename(columns={'PostalCode': 'PostalCodeCount'}, inplace=True)
boroughs_df['PostalCodePercentage'] = 100*round(boroughs_df['PostalCodeCount']/boroughs_df['PostalCodeCount'].sum(), 3)

boroughs_df

Unnamed: 0,Borough,PostalCodeCount,PostalCodePercentage
0,North York,24,23.3
1,Downtown Toronto,18,17.5
2,Queen's Park,1,1.0
3,Etobicoke,12,11.7
4,Scarborough,17,16.5
5,East York,5,4.9
6,York,5,4.9
7,East Toronto,5,4.9
8,West Toronto,6,5.8
9,Central Toronto,9,8.7


Let's work with only boroughs that contain the word `'Toronto'`. That way we will cover around 37% of all postal code areas from our initial dataframe. We can extract rows with such boroughs by using `str.contains()` method.  

In [3]:
toronto_df = df[df['Borough'].str.contains('Toronto')]
toronto_df.reset_index(inplace=True, drop=True)
toronto_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65512,-79.36264
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657363,-79.37818
2,M5C,Downtown Toronto,St. James Town,43.65121,-79.375481
3,M4E,East Toronto,The Beaches,43.676845,-79.295225
4,M5E,Downtown Toronto,Berczy Park,43.64516,-79.373675
5,M5G,Downtown Toronto,Central Bay Street,43.656091,-79.38493
6,M6G,Downtown Toronto,Christie,43.668781,-79.42071
7,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.649515,-79.382503
8,M6H,West Toronto,"Dufferin, Dovercourt Village",43.665087,-79.438705
9,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.62347,-79.391507


## 4. Visualizing Chosen Neighborhoods on the Map

Now let's visualize neighborhoods within the chosen boroughs on the map using `folium` library. We will add markers displaying a text (i.e. corresponding neighborhoods in that postal code area) when clicking as well as hovering over the marker. To do that we will use `CircleMarker()` method specifying `popup` and `tooltip` parameters.   	 

In [4]:
# create map of Toronto using latitude and longitude values
location_toronto = [43.678503, -79.383558]
map_toronto = folium.Map(location=location_toronto, zoom_start=12)

# add markers to map
for lat, lng, label in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Neighborhood']):

    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        tooltip=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 5. Exploring Neighborhoods with Foursquare Data

Next, we will utilize the Foursquare API to explore the neighborhoods and segment them.

### 5.1. Making Foursquare API Request

We will create an **explore** Foursquare API request for each neighborhood setting `radius=500` and `limit=100`. Then we will store relevant information of each returned venue (its name and category) in a pandas dataframe.

In [5]:
# open txt file with foursquare credentials and read it into CLIENT_ID and CLIENT_SECRET
with open('foursquare.txt', 'r') as file:
    CLIENT_ID = file.readline().replace('\n', '')
    CLIENT_SECRET = file.readline().replace('\n', '')

# define version, radius and limit
VERSION = '20190402'
RADIUS = 500
LIMIT = 100

First let's check what is returned with a Foursquare API request for one postal code (M5A) setting `limit=2` so that the request returns just two venues. After examining the structure of what is returned we can come up with further solution on how to extract all venues names and categories for all neighborhoods. 

In [6]:
limit = 2
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    toronto_df.loc[0, 'Latitude'],
    toronto_df.loc[0, 'Longitude'],
    RADIUS,
    limit)
url

'https://api.foursquare.com/v2/venues/explore?client_id=REGUIEJ1NBVCK0D0FKILP3424LLPA5Q0J2NDLY3MAGJFPMH5&client_secret=20U4GGUTUHLALGF4ESSTD0G44KMLIF4GD5I1SH5DXPLIDLLE&v=20190402&ll=43.65512000000007,-79.36263979699999&radius=500&limit=2'

In [7]:
result = requests.get(url).json()
result

{'meta': {'code': 200, 'requestId': '5ca4a9226a60712cef19ed87'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 21,
  'suggestedBounds': {'ne': {'lat': 43.65962000450007,
    'lng': -79.35643170823275},
   'sw': {'lat': 43.650619995500065, 'lng': -79.36884788576722}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54ea41ad498e9a11e9e13308',
       'name': 'Roselle Desserts',
       'location': {'address': '362 King St E',
        'crossStreet': 'Trinity St',
        'lat': 43.653446723052674,
        'lng': -79.3620167174383,
        'labeledLatLngs': [{'label'

### 5.2. Exploring Request's Returned Object

So we can see that `json()` method on requested object returns a complex dictionary structure consisting of other dictionaries and lists. We can navigate through that structure by checking `keys()` of the parent dictionary and navigating further down that structure by checking the keys of nested dictionaries or items of nested lists that we might think are relevant. 

In [8]:
result.keys()

dict_keys(['meta', 'response'])

In [9]:
type(result['response'])

dict

In [10]:
result['response'].keys()



In [11]:
type(result['response']['groups'])

list

In [12]:
for i in result['response']['groups']:
    print(i)
    print(type(i))
    print()

{'type': 'Recommended Places', 'name': 'recommended', 'items': [{'reasons': {'count': 0, 'items': [{'summary': 'This spot is popular', 'type': 'general', 'reasonName': 'globalInteractionReason'}]}, 'venue': {'id': '54ea41ad498e9a11e9e13308', 'name': 'Roselle Desserts', 'location': {'address': '362 King St E', 'crossStreet': 'Trinity St', 'lat': 43.653446723052674, 'lng': -79.3620167174383, 'labeledLatLngs': [{'label': 'display', 'lat': 43.653446723052674, 'lng': -79.3620167174383}], 'distance': 192, 'postalCode': 'M5A 1K9', 'cc': 'CA', 'city': 'Toronto', 'state': 'ON', 'country': 'Canada', 'formattedAddress': ['362 King St E (Trinity St)', 'Toronto ON M5A 1K9', 'Canada']}, 'categories': [{'id': '4bf58dd8d48988d16a941735', 'name': 'Bakery', 'pluralName': 'Bakeries', 'shortName': 'Bakery', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/bakery_', 'suffix': '.png'}, 'primary': True}], 'photos': {'count': 0, 'groups': []}}, 'referralId': 'e-0-54ea41ad498e9a11e9e13308-0'}, {

In [13]:
result['response']['groups'][0].keys()

dict_keys(['type', 'name', 'items'])

In [14]:
type(result['response']['groups'][0]['items'])

list

In [15]:
for i in result['response']['groups'][0]['items']:
    print(i)
    print(type(i))
    print()

{'reasons': {'count': 0, 'items': [{'summary': 'This spot is popular', 'type': 'general', 'reasonName': 'globalInteractionReason'}]}, 'venue': {'id': '54ea41ad498e9a11e9e13308', 'name': 'Roselle Desserts', 'location': {'address': '362 King St E', 'crossStreet': 'Trinity St', 'lat': 43.653446723052674, 'lng': -79.3620167174383, 'labeledLatLngs': [{'label': 'display', 'lat': 43.653446723052674, 'lng': -79.3620167174383}], 'distance': 192, 'postalCode': 'M5A 1K9', 'cc': 'CA', 'city': 'Toronto', 'state': 'ON', 'country': 'Canada', 'formattedAddress': ['362 King St E (Trinity St)', 'Toronto ON M5A 1K9', 'Canada']}, 'categories': [{'id': '4bf58dd8d48988d16a941735', 'name': 'Bakery', 'pluralName': 'Bakeries', 'shortName': 'Bakery', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/bakery_', 'suffix': '.png'}, 'primary': True}], 'photos': {'count': 0, 'groups': []}}, 'referralId': 'e-0-54ea41ad498e9a11e9e13308-0'}
<class 'dict'>

{'reasons': {'count': 0, 'items': [{'summary': 'Th

Great, now we understand that the list of dictionaries with venues details can be found in `result['response']['groups'][0]['items']`. Let's assign it to `venues` and explore further. 

In [16]:
venues = result['response']['groups'][0]['items']
type(venues[0])

dict

In [17]:
venues[0].keys()

dict_keys(['reasons', 'venue', 'referralId'])

In [18]:
type(venues[0]['venue'])

dict

In [19]:
venues[0]['venue'].keys()

dict_keys(['id', 'name', 'location', 'categories', 'photos'])

In [20]:
venues[0]['venue']['name']

'Roselle Desserts'

In [21]:
venues[0]['venue']['categories']

[{'id': '4bf58dd8d48988d16a941735',
  'name': 'Bakery',
  'pluralName': 'Bakeries',
  'shortName': 'Bakery',
  'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/bakery_',
   'suffix': '.png'},
  'primary': True}]

In [22]:
venues[0]['venue']['categories'][0].keys()

dict_keys(['id', 'name', 'pluralName', 'shortName', 'icon', 'primary'])

In [23]:
venues[0]['venue']['categories'][0]['name']

'Bakery'

### 5.3. Populating Venues Dataframe

Ok, looks like now we got all the information we need in terms where to find venues names and categories in that json response object when we make Foursquare API request. Now we can iterate through all our neighborhoods making Foursquare API requests and populating a pandas dataframe. 

In [24]:
toronto_df.columns

Index(['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'], dtype='object')

In [25]:
# initialize an empty list of all venues 
venues_list = []

for code, name, lat, lng in zip(
    toronto_df['PostalCode'],
    toronto_df['Neighborhood'],
    toronto_df['Latitude'],
    toronto_df['Longitude']):
    
    # make Foursquare API request
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        lng,
        RADIUS,
        LIMIT)
    
    results = requests.get(url).json()['response']['groups'][0]['items']
    
    # append relevant information for all venues (separated in tuples) in the neighborhood we are iterating through
    venues_list.append([(
        code,
        name,
        venue['venue']['name'],
        venue['venue']['categories'][0]['name']) for venue in results])
    

In [26]:
# creating venues dataframe

venues_df = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
venues_df.columns = ['PostalCode', 'Neighborhood', 'VenueName', 'VenueCategory']

venues_df.head()

Unnamed: 0,PostalCode,Neighborhood,VenueName,VenueCategory
0,M5A,"Harbourfront, Regent Park",Roselle Desserts,Bakery
1,M5A,"Harbourfront, Regent Park",Tandem Coffee,Coffee Shop
2,M5A,"Harbourfront, Regent Park",Figs Breakfast & Lunch,Breakfast Spot
3,M5A,"Harbourfront, Regent Park",Cocina Economica,Mexican Restaurant
4,M5A,"Harbourfront, Regent Park",Morning Glory Cafe,Breakfast Spot


Ok, it seems we managed to create our pandas dataframe and populate it with relevant information. Let's check how many venues our Foursquare API request returned and how many venues per each postal code we have for our further analysis.

In [27]:
venues_count_df = venues_df[['PostalCode', 'VenueName']].groupby('PostalCode', sort=False).count().reset_index()
venues_count_df.rename(columns={'VenueName': 'VenueCount'}, inplace=True)
venues_count_df

Unnamed: 0,PostalCode,VenueCount
0,M5A,21
1,M5B,100
2,M5C,100
3,M4E,4
4,M5E,64
5,M5G,96
6,M6G,9
7,M5H,100
8,M6H,20
9,M5J,7


In [28]:
print('We have {} venues across {} postal code areas.'.format(len(venues_df), len(venues_count_df)))
print('There are {} unique venue categories among those venues.'.format(len(venues_df['VenueCategory'].unique())))

We have 1754 venues across 37 postal code areas.
There are 222 unique venue categories among those venues.


At the moment we can notice that:
- some postal code areas have disproportionaly small number of returned venues (less than 10).
- there are no venues returned for one postal code area (we need to remember that when we manipulate with pandas dataframes further in our analysis). 

## 6. Clustering Neighborhoods with KMeans

Now that we have the list with venues and their categories, we can explore the top venues for each neighborhood, one hot encode categorical values of `VenueCategory` and run KMeans clustering algorithm. 

### 6.1. One Hot Encoding

In [29]:
# one hot encoding
venues_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix='', prefix_sep='')

# add postal code column back to dataframe
venues_onehot.insert(0, 'PostalCode', venues_df[['PostalCode']])

venues_onehot.head()

Unnamed: 0,PostalCode,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Train Station,Transportation Service,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M5A,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
venues_onehot.shape

(1754, 223)

### 6.2. Grouping by Postal Code and Taking Average Frequency

In [31]:
venues_grouped = venues_onehot.groupby(by='PostalCode', sort=False).mean().reset_index()
venues_grouped.head()

Unnamed: 0,PostalCode,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Train Station,Transportation Service,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M5A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M5B,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.01,0.0,0.01,0.01,0.01,0.0,0.01,0.0
2,M5C,0.01,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M5E,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,...,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
venues_grouped.shape

(37, 223)

### 6.3. Creating Dataframe with Top Venues by Neighborhood

In [33]:
# create an empty pandas dataframe

columns = ['PostalCode', '1stTopVenueType', '2ndTopVenueType', '3dTopVenueType']
venues_categ_sorted = pd.DataFrame(columns=columns)
venues_categ_sorted

Unnamed: 0,PostalCode,1stTopVenueType,2ndTopVenueType,3dTopVenueType


In [34]:
columns = ['PostalCode', '1stTopFreq', '2ndTopTopFreq', '3dTopTopFreq']
venues_categ_sorted_check = pd.DataFrame(columns=columns)
venues_categ_sorted_check

Unnamed: 0,PostalCode,1stTopFreq,2ndTopTopFreq,3dTopTopFreq


In [35]:
# numer of top common venue type we will analyse
num_top = 3

# iterate through index, row pairs of venues_grouped dataframe
for i, row in venues_grouped.iterrows():
    
    # populate postal code column
    venues_categ_sorted.loc[i, ['PostalCode']] = row.iloc[0]
    # sorting other columns with frequencies and extracting index values (venue types names) of top 3 frequencies 
    venues_categ_sorted.loc[i, ['1stTopVenueType', '2ndTopVenueType', '3dTopVenueType']] = row.iloc[1:].sort_values(ascending=False).index.array[:num_top]
    
    venues_categ_sorted_check.loc[i, ['PostalCode']] = row.iloc[0]
    venues_categ_sorted_check.loc[i, ['1stTopFreq', '2ndTopTopFreq', '3dTopTopFreq']] = row.iloc[1:].sort_values(ascending=False).array[:num_top]

venues_categ_sorted_check

Unnamed: 0,PostalCode,1stTopFreq,2ndTopTopFreq,3dTopTopFreq
0,M5A,0.238095,0.0952381,0.0952381
1,M5B,0.09,0.07,0.04
2,M5C,0.07,0.07,0.05
3,M4E,0.25,0.25,0.25
4,M5E,0.09375,0.046875,0.046875
5,M5G,0.104167,0.0729167,0.0416667
6,M6G,0.333333,0.222222,0.111111
7,M5H,0.07,0.06,0.06
8,M6H,0.1,0.05,0.05
9,M5J,0.428571,0.142857,0.142857


In [36]:
mask = venues_categ_sorted_check == 0
venues_categ_sorted.mask(mask.get_values(), inplace=True)
venues_categ_sorted

Unnamed: 0,PostalCode,1stTopVenueType,2ndTopVenueType,3dTopVenueType
0,M5A,Coffee Shop,Breakfast Spot,Restaurant
1,M5B,Coffee Shop,Clothing Store,Café
2,M5C,Restaurant,Coffee Shop,Hotel
3,M4E,Pub,Health Food Store,Neighborhood
4,M5E,Coffee Shop,Cocktail Bar,Restaurant
5,M5G,Coffee Shop,Clothing Store,Cosmetics Shop
6,M6G,Café,Grocery Store,Italian Restaurant
7,M5H,Hotel,Coffee Shop,Café
8,M6H,Park,Grocery Store,Smoke Shop
9,M5J,Harbor / Marina,Park,American Restaurant


### 6.4. Clustering and Visualizing on the Map

In [37]:
num_clusters = 5

# extract training set for model 
X = venues_grouped.iloc[:, 1:]

# run kmeans clustering
model = KMeans(n_clusters=num_clusters, random_state=0) 
model.fit(X)

# check cluster labels generated for each row in the dataframe
model.labels_, model.labels_.size

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 1, 3,
        0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 37)

In [38]:
# add clustering labels to venues_categ_sorted dataframe

venues_categ_sorted.insert(0, 'ClusterLabels', model.labels_)
venues_categ_sorted.head()

Unnamed: 0,ClusterLabels,PostalCode,1stTopVenueType,2ndTopVenueType,3dTopVenueType
0,0,M5A,Coffee Shop,Breakfast Spot,Restaurant
1,0,M5B,Coffee Shop,Clothing Store,Café
2,0,M5C,Restaurant,Coffee Shop,Hotel
3,0,M4E,Pub,Health Food Store,Neighborhood
4,0,M5E,Coffee Shop,Cocktail Bar,Restaurant


In [39]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65512,-79.36264
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657363,-79.37818
2,M5C,Downtown Toronto,St. James Town,43.65121,-79.375481
3,M4E,East Toronto,The Beaches,43.676845,-79.295225
4,M5E,Downtown Toronto,Berczy Park,43.64516,-79.373675


In [41]:
# join with toronto_df dataframe to get all initial data (coordinates and neighborhood names)

# note we join on right (using venues_categ_sorted's index)
# since Foursquare API request returned no venues for one postal code area from initial toronto_df dataframe)

toronto_venues_clusters = toronto_df.join(
    venues_categ_sorted.set_index('PostalCode'),
    on='PostalCode',
    how='right').reset_index(drop=True)

# save the resulted dataframe to csv
toronto_venues_clusters.to_csv('toronto_venues_clusters.csv')

toronto_venues_clusters.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,ClusterLabels,1stTopVenueType,2ndTopVenueType,3dTopVenueType
0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65512,-79.36264,0,Coffee Shop,Breakfast Spot,Restaurant
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657363,-79.37818,0,Coffee Shop,Clothing Store,Café
2,M5C,Downtown Toronto,St. James Town,43.65121,-79.375481,0,Restaurant,Coffee Shop,Hotel
3,M4E,East Toronto,The Beaches,43.676845,-79.295225,0,Pub,Health Food Store,Neighborhood
4,M5E,Downtown Toronto,Berczy Park,43.64516,-79.373675,0,Coffee Shop,Cocktail Bar,Restaurant


In [42]:
# create map
map_clusters = folium.Map(location=location_toronto, zoom_start=12)

# set color scheme for the clusters
x = np.arange(num_clusters)
colors_array = cm.rainbow(np.linspace(0, 1, len(x)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, hood, cluster in zip(
    toronto_venues_clusters['Latitude'],
    toronto_venues_clusters['Longitude'],
    toronto_venues_clusters['Neighborhood'],
    toronto_venues_clusters['ClusterLabels']):
    
    folium.CircleMarker(
        [lat, lng],
        radius=6,
        popup=str(hood) + ' Cluster ' + str(cluster),
        tooltip=str(hood) + ' Cluster ' + str(cluster),
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 7. Examining Clusters

Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster.

In [43]:
# let's check how many neighborhood in each cluster

count_df = toronto_venues_clusters[['Neighborhood', 'ClusterLabels']].groupby('ClusterLabels').count().reset_index()
count_df.rename(columns={'Neighborhood': 'NumNeighborhoods'}, inplace=True)
count_df['Percentage'] = round(100*count_df['NumNeighborhoods']/count_df['NumNeighborhoods'].sum(), 1)

count_df

Unnamed: 0,ClusterLabels,NumNeighborhoods,Percentage
0,0,32,86.5
1,1,2,5.4
2,2,1,2.7
3,3,1,2.7
4,4,1,2.7


**Some initial observations:**
- Looks like our postal code areas we have chosen (among boroughs that contain the word 'Toronto') *are dominated by one type of venues which belongs to cluster 0*.
- *Other 4 clusters make up just a bit less than 15%* of all postal code areas analysed.
- *Those 4 clusters are situated relatively at the periphery* from central Toronto compared to most neighborhoods in cluster 0.

### Cluster 0

It seems that the dominant type of venue for cluster 0 are Coffee Shops, Cafes and mostly different types of Restaurants.

In [44]:
cluster0 = toronto_venues_clusters.loc[toronto_venues_clusters['ClusterLabels'] == 0].iloc[:, -4:]
cluster0

Unnamed: 0,ClusterLabels,1stTopVenueType,2ndTopVenueType,3dTopVenueType
0,0,Coffee Shop,Breakfast Spot,Restaurant
1,0,Coffee Shop,Clothing Store,Café
2,0,Restaurant,Coffee Shop,Hotel
3,0,Pub,Health Food Store,Neighborhood
4,0,Coffee Shop,Cocktail Bar,Restaurant
5,0,Coffee Shop,Clothing Store,Cosmetics Shop
6,0,Café,Grocery Store,Italian Restaurant
7,0,Hotel,Coffee Shop,Café
8,0,Park,Grocery Store,Smoke Shop
9,0,Harbor / Marina,Park,American Restaurant


In [45]:
cluster0['1stTopVenueType'].value_counts()

Coffee Shop        16
Café                4
Park                3
Bar                 1
Bus Line            1
Restaurant          1
Pub                 1
Dessert Shop        1
Playground          1
Harbor / Marina     1
Diner               1
Hotel               1
Name: 1stTopVenueType, dtype: int64

In [46]:
cluster0['2ndTopVenueType'].value_counts()

Café                           4
Coffee Shop                    3
Grocery Store                  2
Hotel                          2
Park                           2
Clothing Store                 2
Pizza Place                    2
Restaurant                     2
Steakhouse                     2
Japanese Restaurant            1
Sandwich Place                 1
Furniture / Home Store         1
Health Food Store              1
Eastern European Restaurant    1
Discount Store                 1
Cocktail Bar                   1
Breakfast Spot                 1
Italian Restaurant             1
Bar                            1
Light Rail Station             1
Name: 2ndTopVenueType, dtype: int64

In [47]:
cluster0['3dTopVenueType'].value_counts()

Café                             5
Hotel                            3
Italian Restaurant               3
Restaurant                       3
Coffee Shop                      2
Bakery                           2
Thai Restaurant                  2
American Restaurant              1
Bank                             1
Cosmetics Shop                   1
Sushi Restaurant                 1
Cocktail Bar                     1
Park                             1
Smoke Shop                       1
Vegetarian / Vegan Restaurant    1
Gay Bar                          1
Burger Joint                     1
Medical Center                   1
Neighborhood                     1
Name: 3dTopVenueType, dtype: int64

### Cluster 1

Cluster 1 has some Service types of venues as well as Parks.

In [48]:
toronto_venues_clusters.loc[toronto_venues_clusters['ClusterLabels'] == 1].iloc[:, -4:]

Unnamed: 0,ClusterLabels,1stTopVenueType,2ndTopVenueType,3dTopVenueType
19,1,Home Service,Park,
20,1,Convenience Store,Park,


### Cluster 2

Cluster 2 has Bus Line type of venue as a dominant one.

In [49]:
toronto_venues_clusters.loc[toronto_venues_clusters['ClusterLabels'] == 2].iloc[:, -4:]

Unnamed: 0,ClusterLabels,1stTopVenueType,2ndTopVenueType,3dTopVenueType
17,2,Bus Line,Swim School,


### Cluster 3

Cluster 3 seems to be a Park and Recreation type of area.

In [50]:
toronto_venues_clusters.loc[toronto_venues_clusters['ClusterLabels'] == 3].iloc[:, -4:]

Unnamed: 0,ClusterLabels,1stTopVenueType,2ndTopVenueType,3dTopVenueType
21,3,Playground,Garden,Park


### Cluster 4

Cluster 4 is quite similar to cluster 3, but it also has Convenience Store and Gym types of venues.

In [51]:
toronto_venues_clusters.loc[toronto_venues_clusters['ClusterLabels'] == 4].iloc[:, -4:]

Unnamed: 0,ClusterLabels,1stTopVenueType,2ndTopVenueType,3dTopVenueType
27,4,Playground,Convenience Store,Gym
