# Segmenting and Clustering Neighborhoods in Toronto

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download data and create table</a>

2. <a href="#item2">Get the latitude and the longitude coordinates of each neighborhood</a>

3. <a href="#item3">Explore and cluster the neighborhoods in Toronto</a>
</font>
</div>

<a id='item1'></a>

## 1. Download data and create table

This notebook was created for week 3 assignment of Applied Data Science Capstone course. Let´s import al libraries needed to download and create the table.

In [1]:
import requests
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

In [2]:
#We use the BeautifulSoup library to scrape the web
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
results = requests.get(url).text
soup = BeautifulSoup(results)

In [3]:
table = soup.find("table")
#print(table.prettify())

In [4]:
#From table object identify the column names
columns = table.tr.text.split()
columns

['Postcode', 'Borough', 'Neighbourhood']

In [5]:
# From table get all the rows 
rows = []
for row in table.find_all('tr')[1:]:
    rows.append(row.text.split('\n')[1:-1])
rows[0:5]

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront']]

In [6]:
#Create the dataframe
import pandas as pd
df = pd.DataFrame(rows, columns= columns)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [7]:
#We want to drop the rows with a borough that is Not assigned
indexes = df[df['Borough'] == 'Not assigned'].index.values
df.drop(index = indexes, inplace =  True)
df.reset_index(drop = True, inplace = True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


In [8]:
#Next, we look for rows with Not assigned neighbourhood and assign them the name of the borough
df[df['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood
5,M7A,Queen's Park,Not assigned


In [9]:
df.iloc[5]['Neighbourhood'] = df.iloc[5]['Borough']
df[df['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood


In [10]:
#Finally, we group the neighbourhoods with the same postal code
table_n = df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(lambda x: ', '.join(x))
table_n =pd.DataFrame(table_n)
table_n.reset_index(inplace = True)
table_n.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [11]:
table_n.shape

(103, 3)

<a id='item2'></a>

## 2. Get the latitude and the longitude coordinates of each neighborhood.

In [12]:
#Download the data of the coordinates
!pip install wget
import wget
filename = wget.download("https://cocl.us/Geospatial_data")
filename



'Geospatial_Coordinates (2).csv'

In [13]:
#Convert data into a dataframe 
lat_long = pd.read_csv(filename)
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
#Rename columns to merge the two dataframes
lat_long.columns = ['Postcode', 'Latitude', 'Longitude']
table_n = pd.merge(table_n, lat_long, on = ['Postcode'])
table_n.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<a id='item3'></a>

## 3. Explore and cluster the neighborhoods in Toronto

Get the latitude and longitude coordinates of Toronto

In [15]:
!pip install folium
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans
import folium



In [16]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto city are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto city are 43.653963, -79.387207.


### Create a map of Toronto with neighbourhoods superimposed

In [17]:
tor_map =  folium.Map(location=[latitude, longitude], zoom_start = 11)
for lat, lng, borough, neigh in zip(table_n['Latitude'], table_n['Longitude'], table_n['Borough'], table_n['Neighbourhood']):
    label = '{}, {}'.format(neigh, borough)
    label =  folium.Popup(label, parse_html= True)
    folium.CircleMarker([lat, lng], 
                        radius = 4, 
                        popup=label,
                        color='blue',
                        fill=False,
                        fill_color='#3186cc',
                        fill_opacity=0.5,
                        parse_html=False).add_to(tor_map)
tor_map


Let´s define foursquare credentials to explore neighbourhoods and segment them

In [18]:
# @hidden_cell
CLIENT_ID = 'EYVFV40BHFROPM5BHA3ASZV34ALRDQ0HYW1NNQPVLQI3LJED' # your Foursquare ID
CLIENT_SECRET = 'XGOR11WRUBTTYDF0JL4C2G1CLRFAMHC2DUDBTCCNTTECY0TQ' # your Foursquare Secret
VERSION = '20191118' # Foursquare API version

### We get the top venues in the first neighbourhood within a radius of 500m

In [19]:
neighborhood_latitude = table_n.loc[0,'Latitude']
neighborhood_longitude =  table_n.loc[0, 'Longitude']
LIMIT = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)


In [20]:
results_v = requests.get(url).json()
results_v

{'meta': {'code': 200, 'requestId': '5dd577a40be7b4001bb15d68'},
  'headerLocation': 'Malvern',
  'headerFullLocation': 'Malvern, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 1,
  'suggestedBounds': {'ne': {'lat': 43.8111863045, 'lng': -79.18812958073042},
   'sw': {'lat': 43.80218629549999, 'lng': -79.2005772192696}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bb6b9446edc76b0d771311c',
       'name': "Wendy's",
       'location': {'crossStreet': 'Morningside & Sheppard',
        'lat': 43.80744841934756,
        'lng': -79.19905558052072,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.80744841934756,
          'lng': -79.19905558052072}],
        'distance': 387,
        'cc': 'CA',
        'city': 'Toronto',
    

In [21]:
venues = results_v['response']['groups'][0]['items']

In [22]:
venues

[{'reasons': {'count': 0,
   'items': [{'summary': 'This spot is popular',
     'type': 'general',
     'reasonName': 'globalInteractionReason'}]},
  'venue': {'id': '4bb6b9446edc76b0d771311c',
   'name': "Wendy's",
   'location': {'crossStreet': 'Morningside & Sheppard',
    'lat': 43.80744841934756,
    'lng': -79.19905558052072,
    'labeledLatLngs': [{'label': 'display',
      'lat': 43.80744841934756,
      'lng': -79.19905558052072}],
    'distance': 387,
    'cc': 'CA',
    'city': 'Toronto',
    'state': 'ON',
    'country': 'Canada',
    'formattedAddress': ['Toronto ON', 'Canada']},
   'categories': [{'id': '4bf58dd8d48988d16e941735',
     'name': 'Fast Food Restaurant',
     'pluralName': 'Fast Food Restaurants',
     'shortName': 'Fast Food',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/fastfood_',
      'suffix': '.png'},
     'primary': True}],
   'photos': {'count': 0, 'groups': []}},
  'referralId': 'e-0-4bb6b9446edc76b0d771311c-0'}]

In [23]:
from pandas.io.json import json_normalize
nearby_venues = json_normalize(venues)
nearby_venues

Unnamed: 0,reasons.count,reasons.items,referralId,venue.categories,venue.id,venue.location.cc,venue.location.city,venue.location.country,venue.location.crossStreet,venue.location.distance,venue.location.formattedAddress,venue.location.labeledLatLngs,venue.location.lat,venue.location.lng,venue.location.state,venue.name,venue.photos.count,venue.photos.groups
0,0,"[{'summary': 'This spot is popular', 'type': '...",e-0-4bb6b9446edc76b0d771311c-0,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",4bb6b9446edc76b0d771311c,CA,Toronto,Canada,Morningside & Sheppard,387,"[Toronto ON, Canada]","[{'label': 'display', 'lat': 43.80744841934756...",43.807448,-79.199056,ON,Wendy's,0,[]


In [24]:
#Extract meaningful columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']

nearby_venues = nearby_venues.loc[:,filtered_columns]
nearby_venues

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,Wendy's,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",43.807448,-79.199056


In [25]:
#define a function to extract category
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [26]:
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,Wendy's,Fast Food Restaurant,43.807448,-79.199056


### Apply the same process to all neighbourhoods

In [27]:
LIMIT = 50
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [28]:
toronto_venues = getNearbyVenues(table_n['Neighbourhood'], table_n['Latitude'], table_n['Longitude'], radius=500)

Rouge, Malvern
Highland Creek, Rouge Hill, Port Union
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
Clairlea, Golden Mile, Oakridge
Cliffcrest, Cliffside, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Clarks Corners, Sullivan, Tam O'Shanter
Agincourt North, L'Amoreaux East, Milliken, Steeles East
L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens, Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West, 

In [29]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Scarborough Historical Society,43.788755,-79.162438,History Museum
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


In [30]:
toronto_venues.shape

(1696, 7)

In [31]:
#For each neighborhood were returned 
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",50,50,50,50,50,50
Agincourt,4,4,4,4,4,4
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",3,3,3,3,3,3
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",9,9,9,9,9,9
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Downsview North, Wilson Heights",19,19,19,19,19,19
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
Berczy Park,50,50,50,50,50,50
"Birch Cliff, Cliffside West",4,4,4,4,4,4


In [32]:
#Unique categories
len(toronto_venues['Venue Category'].unique())

250

In [33]:
#Convert categorical variables into indicator variables
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

#Add the neighborhood2 column (there is another column called Neighborhood)
toronto_onehot['Neighborhood2'] = toronto_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood2,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Highland Creek, Rouge Hill, Port Union",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Highland Creek, Rouge Hill, Port Union",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
#Group by rows and take the mean function
toronto_grouped = toronto_onehot.groupby('Neighborhood2').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood2,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,...,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [35]:
toronto_grouped.shape

(100, 251)

In [36]:
toronto_grouped.iloc[0, :].iloc[1: ].sort_values(ascending=False)

American Restaurant           0.06
Steakhouse                    0.06
Café                          0.06
Asian Restaurant              0.06
Pizza Place                   0.04
Hotel                         0.04
Coffee Shop                   0.04
Sushi Restaurant              0.04
Gastropub                     0.04
Breakfast Spot                0.02
Lounge                        0.02
Plaza                         0.02
Burger Joint                  0.02
Speakeasy                     0.02
Juice Bar                     0.02
Deli / Bodega                 0.02
Smoke Shop                    0.02
Brazilian Restaurant          0.02
Salad Place                   0.02
Jazz Club                     0.02
Monument / Landmark           0.02
Bar                           0.02
Japanese Restaurant           0.02
Seafood Restaurant            0.02
Record Shop                   0.02
Concert Hall                  0.02
Burrito Place                 0.02
Greek Restaurant              0.02
Colombian Restaurant

#### Let´s create a dataframe with the top 10 most common venues for each neighborhood

In [37]:
#Define a function to sort the venues 
def most_common_venues(row, num_top_venues):
    row_cat = row.iloc[1:]
    sorted_row_cat = row_cat.sort_values(ascending=False)
    
    return sorted_row_cat.index.values[0:num_top_venues]  

In [38]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

columns

['Neighborhood',
 '1st Most Common Venue',
 '2nd Most Common Venue',
 '3rd Most Common Venue',
 '4th Most Common Venue',
 '5th Most Common Venue',
 '6th Most Common Venue',
 '7th Most Common Venue',
 '8th Most Common Venue',
 '9th Most Common Venue',
 '10th Most Common Venue']

In [39]:
#Create dataframe
toronto_venues_sorted =pd.DataFrame(columns = columns)
toronto_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood2']
toronto_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",,,,,,,,,,
1,Agincourt,,,,,,,,,,
2,"Agincourt North, L'Amoreaux East, Milliken, St...",,,,,,,,,,
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",,,,,,,,,,
4,"Alderwood, Long Branch",,,,,,,,,,


In [40]:
for i in np.arange(toronto_grouped.shape[0]):
    toronto_venues_sorted.iloc[i, 1:] = most_common_venues(toronto_grouped.iloc[i, :], num_top_venues)

toronto_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",American Restaurant,Steakhouse,Café,Asian Restaurant,Pizza Place,Hotel,Coffee Shop,Sushi Restaurant,Gastropub,Breakfast Spot
1,Agincourt,Lounge,Latin American Restaurant,Skating Rink,Breakfast Spot,Yoga Studio,Dessert Shop,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Playground,Coffee Shop,Park,Deli / Bodega,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Fried Chicken Joint,Sandwich Place,Discount Store,Pizza Place,Fast Food Restaurant,Beer Store,Japanese Restaurant,Pharmacy,Garden Center
4,"Alderwood, Long Branch",Pizza Place,Pharmacy,Coffee Shop,Pub,Sandwich Place,Skating Rink,Gym,Gastropub,Creperie,Dog Run


In [41]:
toronto_venues_sorted.shape

(100, 11)

#### Cluster neighborhoods

In [42]:
from sklearn.cluster import KMeans
num_clusters = 3
data = toronto_grouped.drop('Neighborhood2', axis = 1)

kmeans =  KMeans(num_clusters, random_state=0).fit(data)

kmeans.labels_[0:10]

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [44]:
len(toronto_venues_sorted.columns)
#toronto_venues_sorted.drop('Cluster Labels', axis= 1, inplace =True)

11

In [45]:
#Add cluster labels column to dataframe
toronto_venues_sorted.insert(len(toronto_venues_sorted.columns), 'Cluster Labels', kmeans.labels_)
toronto_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,"Adelaide, King, Richmond",American Restaurant,Steakhouse,Café,Asian Restaurant,Pizza Place,Hotel,Coffee Shop,Sushi Restaurant,Gastropub,Breakfast Spot,0
1,Agincourt,Lounge,Latin American Restaurant,Skating Rink,Breakfast Spot,Yoga Studio,Dessert Shop,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Playground,Coffee Shop,Park,Deli / Bodega,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,1
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Fried Chicken Joint,Sandwich Place,Discount Store,Pizza Place,Fast Food Restaurant,Beer Store,Japanese Restaurant,Pharmacy,Garden Center,0
4,"Alderwood, Long Branch",Pizza Place,Pharmacy,Coffee Shop,Pub,Sandwich Place,Skating Rink,Gym,Gastropub,Creperie,Dog Run,0


In [46]:
#Merge dataframes
table_n.columns = ['Postcode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']
toronto_merged = pd.merge(table_n.iloc[:,1:], toronto_venues_sorted, on='Neighborhood')
toronto_merged.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,Scarborough,"Rouge, Malvern",43.806686,-79.194353,Fast Food Restaurant,Department Store,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,0
1,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,History Museum,Bar,Yoga Studio,Department Store,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,0
2,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Pizza Place,Mexican Restaurant,Electronics Store,Medical Center,Breakfast Spot,Intersection,Rental Car Location,Dim Sum Restaurant,Diner,Discount Store,0
3,Scarborough,Woburn,43.770992,-79.216917,Coffee Shop,Korean Restaurant,Yoga Studio,Department Store,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,0
4,Scarborough,Cedarbrae,43.773136,-79.239476,Fried Chicken Joint,Bakery,Hakka Restaurant,Bank,Thai Restaurant,Caribbean Restaurant,Athletics & Sports,Discount Store,Dim Sum Restaurant,Diner,0


In [47]:
toronto_merged['Cluster Labels'].value_counts()

0    81
1    16
2     4
Name: Cluster Labels, dtype: int64

In [48]:
#Create a map to visualize the clusters
import matplotlib.cm as cm
import matplotlib.colors as colors
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(num_clusters)
ys = [i + x + (i*x)**2 for i in range(num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [49]:
latitude, longitude

(43.653963, -79.387207)

#### Analyze each category to assign a name

In [50]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(4, toronto_merged.shape[1]))]].describe(include = 'all')

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
count,81,81,81,81,81,81,81,81,81,81,81,81.0
unique,80,36,47,51,51,51,46,48,48,48,46,
top,Queen's Park,Coffee Shop,Coffee Shop,Coffee Shop,Italian Restaurant,Department Store,Café,Coffee Shop,Diner,Diner,Drugstore,
freq,2,18,11,6,6,4,5,5,5,7,7,
mean,,,,,,,,,,,,0.0
std,,,,,,,,,,,,0.0
min,,,,,,,,,,,,0.0
25%,,,,,,,,,,,,0.0
50%,,,,,,,,,,,,0.0
75%,,,,,,,,,,,,0.0


In [51]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(4, toronto_merged.shape[1]))]].describe(include = 'all')

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
count,16,16,16,16,16,16,16,16,16,16,16,16.0
unique,16,10,9,12,12,8,6,6,6,6,6,
top,East Toronto,Park,Park,Park,Yoga Studio,Department Store,Dim Sum Restaurant,Diner,Dumpling Restaurant,Drugstore,Donut Shop,
freq,1,5,7,3,4,4,4,4,4,4,6,
mean,,,,,,,,,,,,1.0
std,,,,,,,,,,,,0.0
min,,,,,,,,,,,,1.0
25%,,,,,,,,,,,,1.0
50%,,,,,,,,,,,,1.0
75%,,,,,,,,,,,,1.0


In [52]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(4, toronto_merged.shape[1]))]].describe(include = 'all')

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
count,4,4,4,4,4,4,4,4,4,4,4,4.0
unique,4,3,3,3,4,3,2,2,2,2,3,
top,Roselawn,Baseball Field,Home Service,Yoga Studio,Yoga Studio,Ethiopian Restaurant,Dim Sum Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,
freq,1,2,2,2,1,2,2,2,2,2,2,
mean,,,,,,,,,,,,2.0
std,,,,,,,,,,,,0.0
min,,,,,,,,,,,,2.0
25%,,,,,,,,,,,,2.0
50%,,,,,,,,,,,,2.0
75%,,,,,,,,,,,,2.0


For category 0 and 1 we observe that the top 3 common venues are coffee shop and park respectively. For category 2 there is not a predominant venue so we might named the categories as:
- 0 category: Coffee shops and food venues
- 1 category: Parks and recreational activities
- 2 category: Miscellaneous