# Segmenting and Clustering Neighborhoods in Toronto

### Author: Carlos Paiva

This notebook will be used to complete the required assignment for completing the third week of the Applied Data Science Capstone Course, as part of the process to obtain the IBM Data Science Professional Certification.

This notebook is divided in 3 parts (or tasks):
1. To create an initial pandas dataframe containing the basic data about the Postal Code, Borough and Neighborhood of each area of Toronto by scrapping the relevant information from a Wikipedia page given by Coursera
2. To add the coordenates (latitude and longitude) of each neighborhood in the dataframe to be able to fully map their locations
3. To create clusters (via unsupervised machine learning) containing groups of neighborhoods that are segmented according to the different types of venues that can be found around each area and to visualize these clusters in a Toronto map with the help of the Folium visualization package

### 1. Creating dataframe for all areas in Toronto (Postal Code, Borough, and Neighborhood name)

In [1]:
# Importing necessary packages:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
# Defining the Wikipedia web address:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
# Asking for response from web address:
webpage_response = requests.get(url)
webpage = webpage_response.content

In [4]:
# Changing answer into soup object with Beautiful Soup:
soup = BeautifulSoup(webpage, 'html.parser')
# print(soup)

In [5]:
# Arranging results from Beautiful Soup into a tabular dictionary format:
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# Printing first 5 rows to see results:
print(table_contents[:5])

[{'PostalCode': 'M3A', 'Borough': 'North York', 'Neighborhood': 'Parkwoods'}, {'PostalCode': 'M4A', 'Borough': 'North York', 'Neighborhood': 'Victoria Village'}, {'PostalCode': 'M5A', 'Borough': 'Downtown Toronto', 'Neighborhood': 'Regent Park, Harbourfront'}, {'PostalCode': 'M6A', 'Borough': 'North York', 'Neighborhood': 'Lawrence Manor, Lawrence Heights'}, {'PostalCode': 'M7A', 'Borough': "Queen's Park", 'Neighborhood': 'Ontario Provincial Government'}]


In [6]:
# Creating pandas dataframe from scrapped dictionary
df = pd.DataFrame(table_contents)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [7]:
# Exploring the obtained dataframe:
print('Number of unique Postal Codes: ', df.PostalCode.nunique(), '\n')
print('Number of unique Boroughs: ', df.Borough.nunique(), '\n')
print('Number of unique Neighborhoods: ', df.Neighborhood.nunique(), '\n')
print('List of unique Postal Codes: ', df.PostalCode.unique(), '\n')
print('List of unique Boroughs: ', df.Borough.unique(), '\n')
# print('List of unique Neighborhoods: ', df.Neighborhood.unique())

Number of unique Postal Codes:  103 

Number of unique Boroughs:  15 

Number of unique Neighborhoods:  103 

List of unique Postal Codes:  ['M3A' 'M4A' 'M5A' 'M6A' 'M7A' 'M9A' 'M1B' 'M3B' 'M4B' 'M5B' 'M6B' 'M9B'
 'M1C' 'M3C' 'M4C' 'M5C' 'M6C' 'M9C' 'M1E' 'M4E' 'M5E' 'M6E' 'M1G' 'M4G'
 'M5G' 'M6G' 'M1H' 'M2H' 'M3H' 'M4H' 'M5H' 'M6H' 'M1J' 'M2J' 'M3J' 'M4J'
 'M5J' 'M6J' 'M1K' 'M2K' 'M3K' 'M4K' 'M5K' 'M6K' 'M1L' 'M2L' 'M3L' 'M4L'
 'M5L' 'M6L' 'M9L' 'M1M' 'M2M' 'M3M' 'M4M' 'M5M' 'M6M' 'M9M' 'M1N' 'M2N'
 'M3N' 'M4N' 'M5N' 'M6N' 'M9N' 'M1P' 'M2P' 'M4P' 'M5P' 'M6P' 'M9P' 'M1R'
 'M2R' 'M4R' 'M5R' 'M6R' 'M7R' 'M9R' 'M1S' 'M4S' 'M5S' 'M6S' 'M1T' 'M4T'
 'M5T' 'M1V' 'M4V' 'M5V' 'M8V' 'M9V' 'M1W' 'M4W' 'M5W' 'M8W' 'M9W' 'M1X'
 'M4X' 'M5X' 'M8X' 'M4Y' 'M7Y' 'M8Y' 'M8Z'] 

List of unique Boroughs:  ['North York' 'Downtown Toronto' "Queen's Park" 'Etobicoke' 'Scarborough'
 'East York' 'York' 'East Toronto' 'West Toronto' 'East YorkEast Toronto'
 'Central Toronto' 'MississaugaCanada Post Gateway Proce

In [8]:
# Changing some boroughs names for more clarity:
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
print(df.Borough.nunique())
print(df.Borough.unique())

15
['North York' 'Downtown Toronto' "Queen's Park" 'Etobicoke' 'Scarborough'
 'East York' 'York' 'East Toronto' 'West Toronto' 'East York/East Toronto'
 'Central Toronto' 'Mississauga' 'Downtown Toronto Stn A'
 'Etobicoke Northwest' 'East Toronto Business']


In [9]:
# Exploring the final dataframe:
print(df.shape)

(103, 3)


**Task 1 Final Result: Postal Code, Borough and Neighborhood dataframe**

In [10]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### 2. Adding the coordenates of each neighborhood to the original dataframe

For adding the coordenates of each neighborhood, the project will rely on the GeoSpatial Dataset provided by Coursera (the first approach was to get the geographical coordinates of the neighborhoods using the Geocoder package, however the package proved to be quite unreliable).

In [11]:
# Importing csv file containing coordenates information:
coordenates = pd.read_csv('Geospatial_Coordinates.csv')
print(coordenates.shape)
coordenates.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [12]:
# Merging the Latitude and Longitude data with the original dataframe:
toronto_df = pd.merge(df, coordenates, left_on='PostalCode', right_on='Postal Code')
toronto_df.drop('Postal Code', axis=1, inplace=True)

In [13]:
# Checking for missing values during merge:
toronto_df.isnull().sum()

PostalCode      0
Borough         0
Neighborhood    0
Latitude        0
Longitude       0
dtype: int64

The merge did not give any missing values. All neighborhoods have been assigned coordenates (latitude and longitude).

**Task 2 Final Result: Postal Code, Borough, Neighborhood, Latitude, and Longitude dataframe**

In [14]:
toronto_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto Business,Enclave of M4L,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


### 3. Exploring and clustering the neighborhoods in Toronto

#### 3.1 Visualizing all identified neighborhoods in the Toronto map

The first step of the analysis is to visualize all the neighborhoods in a map (before classification). For this, the Nominatim and Folium libraries will be used:

In [15]:
# Importing libraries:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library

In [16]:
# Obtaining the Toronto Coordenates to center the visualization:
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
toronto_lat = location.latitude
toronto_long = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(toronto_lat, toronto_long))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


In [17]:
# Creating map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[toronto_lat, toronto_long], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='black',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### 3.2 Obtaining data for all venues available in each neighborhood

The next step is to use the Foursquare API to obtain the information of the venues (Venue Name, Venue Latitude, Venue Longitude and Venue Category) located in the influence area of each neighborhood to use this information to segment the neighborhoods.

In [18]:
# Defining the Foursquare credentials:
CLIENT_ID = 'S1IC2GEN1LHJWHQ0YD44BW344JDVCL0UBHVJXDCJ4DCIRQHJ' # your Foursquare ID
CLIENT_SECRET = 'ARS0DCNZVEH4EYDYTZNWHNN0TJ4EPK2GTVKH1L2JNO31ULVW' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: S1IC2GEN1LHJWHQ0YD44BW344JDVCL0UBHVJXDCJ4DCIRQHJ
CLIENT_SECRET:ARS0DCNZVEH4EYDYTZNWHNN0TJ4EPK2GTVKH1L2JNO31ULVW


A function for collecting all the top 100 venues that are in every neighborhood of Toronto within a radius of 500 meters will be created. This function will use the Foursquare API call and is based on the one used for segmenting the neighborhoods in New York from the previous exercise within the same course.

In [19]:
def getNearbyVenues(names, latitudes, longitudes, radius=500): # Walking distance radius
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

Now, the created function will be used to obtain the data of all venues per neighborhood included in `toronto_df` (Toronto dataframe).

In [20]:
# Using the created function:
toronto_venues = getNearbyVenues(names=toronto_df['Neighborhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Ontario Provincial Government
Islington Avenue
Malvern, Rouge
Don Mills North
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
The Danforth  East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview East
The Danforth

In [21]:
# Checking the obtained venues dataframe:
print(toronto_venues.shape)
toronto_venues.head()

(2123, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Parkwoods,43.753259,-79.329656,GreenWin pool,43.756232,-79.333842,Pool
4,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena


In [22]:
# Checking the number of venues per neighborhood:
venues_count = pd.DataFrame(toronto_venues.groupby('Neighborhood').Venue.count().reset_index())
venues_count.columns = ['Neighborhood', 'Venue Count']
venues_count.sort_values(by='Venue Count', ascending=False, axis=0, inplace=True)
print('Total number of neighborhoods with at least 1 venue found within 500 m: ', len(venues_count), '\n')
venues_count

Total number of neighborhoods with at least 1 venue found within 500 m:  100 



Unnamed: 0,Neighborhood,Venue Count
34,"First Canadian Place, Underground city",100
17,"Commerce Court, Victoria Hotel",100
40,"Harbourfront East, Union Station, Toronto Islands",100
36,"Garden District, Ryerson",100
87,"Toronto Dominion Centre, Design Exchange",100
...,...,...
54,"Malvern, Rouge",1
96,"Willowdale, Newtonbrook",1
62,"Old Mill South, King's Mill Park, Sunnylea, Hu...",1
74,Scarborough Village,1


**Note: for 3 neighborhoods, no venues in 500 m or less have been found. Since 3 neighborhoods represent less than 3% of the total, and 500 m represent a reasonable walking distance, the radius will be kept as it is.**

In [23]:
# Exploring how many different venue categories are in the obtained dataframe:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 273 uniques categories.


Next step is to perform one-hot encoding to create new columns with a binary choice (0 or 1) for each type of venue for each neighborhood.

In [24]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood Name'] = toronto_venues.Neighborhood 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

(2123, 274)


Unnamed: 0,Neighborhood Name,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now, the one-hot dataframe will be grouped per neighborhood (using the mean of each type of venue) to find the frequency of each type of venue per neighborhood.

In [25]:
toronto_grouped = toronto_onehot.groupby('Neighborhood Name').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood Name,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Willowdale West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,"Willowdale, Newtonbrook",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


A new dataframe will be created for displaying the most common venues (top 10) for each neighborhood in different columns.

In [26]:
# Defining a function for sortering the venues in descending order:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [27]:
# Creating the dataframe:
num_top_venues = 8
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood Name']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted.head()

(100, 9)


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,Agincourt,Skating Rink,Lounge,Breakfast Spot,Latin American Restaurant,Metro Station,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop
1,"Alderwood, Long Branch",Pizza Place,Pharmacy,Gym,Sandwich Place,Pub,Coffee Shop,Middle Eastern Restaurant,Molecular Gastronomy Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Intersection,Shopping Mall,Park,Middle Eastern Restaurant,Mobile Phone Shop,Sandwich Place
3,Bayview Village,Japanese Restaurant,Bank,Chinese Restaurant,Café,Moroccan Restaurant,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant
4,"Bedford Park, Lawrence Manor East",Italian Restaurant,Sandwich Place,Coffee Shop,Toy / Game Store,Butcher,Café,Liquor Store,Restaurant


#### 3.3 Clasifying the neighborhoods with K-Means Clustering:

Using K-Means Clustering for classifying the neighborhoods according to the frequency of each venue type present in the neighborhood. The ideal number of clusters will be explored via iteration by looking at the new map generated below.

Update: after several iterations, k=12 proved to be a good number of clusters providing at least 4 balanced group of neighbours.

In [28]:
# Importing k-means from clustering stage:
from sklearn.cluster import KMeans

In [29]:
# Setting number of clusters:
num_of_clusters = 25
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood Name', 1)

# Running k-means clustering:
kmeans = KMeans(n_clusters = num_of_clusters, random_state=0).fit(toronto_grouped_clustering)

# Checking cluster labels generated for each row in the dataframe:
kmeans.labels_

array([ 1, 21, 21, 13, 21,  1,  1,  1,  1,  8, 21,  1, 21,  1,  1, 21, 18,
        1, 21, 21, 21,  1,  1, 16, 17, 21, 21, 21, 21,  1,  1,  1, 21,  1,
        1, 21,  1,  4, 21,  1,  1,  1,  1, 20, 23, 21, 21, 22,  1,  9,  1,
        9,  1,  1,  6,  8, 21, 15, 21, 21,  1,  1,  5,  1,  1, 21, 21,  1,
        1,  8, 10, 14,  1, 19,  0, 21,  1, 21, 21,  1,  1, 21, 21,  3,  1,
       11, 21,  1,  1, 24, 12, 21,  3, 21,  1, 21,  2,  7, 21,  3],
      dtype=int32)

Creating a new dataframe that combines the cluster as well as the top 8 venues for each neighborhood.

In [30]:
# Adding clustering labels:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# Merging the datasets:
toronto_merged = pd.merge(toronto_df, neighborhoods_venues_sorted, left_on='Neighborhood', right_on='Neighborhood')
toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,21,Food & Drink Shop,Park,Fast Food Restaurant,Pool,Metro Station,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop
1,M4A,North York,Victoria Village,43.725882,-79.315572,24,Coffee Shop,Portuguese Restaurant,Hockey Arena,French Restaurant,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Bakery,Park,Pub,Theater,Café,Breakfast Spot,Bank
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1,Furniture / Home Store,Clothing Store,Accessories Store,Coffee Shop,Event Space,Gift Shop,Athletics & Sports,Boutique
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494,1,Coffee Shop,Sushi Restaurant,Café,College Cafeteria,Beer Bar,Spa,Smoothie Shop,Burrito Place


In [31]:
# Exploring number of neighborhoods per cluster (ideally, the project should find at least 3-4 representative groups)
toronto_merged.value_counts('Cluster Labels')

Cluster Labels
1     39
21    33
3      3
8      3
9      2
0      1
15     1
23     1
22     1
20     1
19     1
18     1
17     1
16     1
12     1
14     1
13     1
11     1
10     1
7      1
6      1
5      1
4      1
2      1
24     1
dtype: int64

As mentioned before, with k=25 the project could find 4 different and balanced groups (Clusters 9, 2, 0 and 7). The rest of the analysis will focus on these 4 clusters, to identify the types of venues that characterize each of these.

In [32]:
toronto_merged.shape

(100, 14)

**Note: 3 neighborhoods are not considered anymore (they did not have any venues within 500 m distance).**

In [33]:
# Importing Matplotlib plotting modules:
import matplotlib.cm as cm
import matplotlib.colors as colors

#### Map of Toronto with all clustered neighborhoods (100 in total):

In [34]:
# Creating cluster map:
map_clusters = folium.Map(location=[toronto_lat, toronto_long], zoom_start=11)

# set color scheme for the clusters
x = np.arange(num_of_clusters)
ys = [i + x + (i*x)**2 for i in range(num_of_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
      
        radius=5,
        popup=label,
        color=rainbow[int(cluster - 1)],
        fill=True,
        fill_color=rainbow[int(cluster - 1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Defining dataframe only including the 4 biggest clusters (9, 2, 0 and 7):

In [35]:
# Defining dataframes for each cluster:
cluster_9 = toronto_merged[toronto_merged['Cluster Labels'] == 9]
cluster_2 = toronto_merged[toronto_merged['Cluster Labels'] == 2]
cluster_0 = toronto_merged[toronto_merged['Cluster Labels'] == 0]
cluster_7 = toronto_merged[toronto_merged['Cluster Labels'] == 7]
clusters = [cluster_9, cluster_2, cluster_0, cluster_7]
only_4_clusters = pd.concat(clusters)
only_4_clusters

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
59,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,9,Park,Swim School,Bus Line,Accessories Store,Mexican Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop
75,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724,9,Park,Sandwich Place,Mobile Phone Shop,Bus Line,Accessories Store,Mexican Restaurant,Modern European Restaurant,Miscellaneous Shop
50,M2M,North York,"Willowdale, Newtonbrook",43.789053,-79.408493,2,Park,Accessories Store,Metro Station,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
31,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,0,Playground,Accessories Store,Metro Station,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
21,M1G,Scarborough,Woburn,43.770992,-79.216917,7,Coffee Shop,Korean BBQ Restaurant,Accessories Store,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Miscellaneous Shop


#### Map of Toronto only including neighborhoods for the main 4 clusters (75 neighborhoods only):

In [36]:
# Creating cluster map:
map_clusters_4 = folium.Map(location=[toronto_lat, toronto_long], zoom_start=10.5)

# set color scheme for the clusters
x = np.arange(num_of_clusters)
ys = [i + x + (i*x)**2 for i in range(num_of_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(only_4_clusters['Latitude'], only_4_clusters['Longitude'], only_4_clusters['Neighborhood'], only_4_clusters['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
      
        radius=5,
        popup=label,
        color=rainbow[int(cluster - 1)],
        fill=True,
        fill_color=rainbow[int(cluster - 1)],
        fill_opacity=0.7).add_to(map_clusters_4)
       
map_clusters_4

Now that the location of each neighborhood of the main 4 clusters is known, the project will evaluate each cluster to characterize it.

**Task 3 Final Result: Segmentation of each cluster and identification of the ideal neighbours for each cluster**

#### Cluster 9

In [38]:
cluster_9.drop(columns=['PostalCode','Latitude','Longitude'], axis=1).head()

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
2,Downtown Toronto,"Regent Park, Harbourfront",9,Coffee Shop,Park,Pub,Bakery,Café,Theater,Breakfast Spot,Restaurant
3,North York,"Lawrence Manor, Lawrence Heights",9,Clothing Store,Furniture / Home Store,Accessories Store,Vietnamese Restaurant,Coffee Shop,Event Space,Boutique,Women's Store
4,Queen's Park,Ontario Provincial Government,9,Coffee Shop,Sushi Restaurant,Yoga Studio,Music Venue,Bank,Bar,Mexican Restaurant,College Cafeteria
8,Downtown Toronto,"Garden District, Ryerson",9,Coffee Shop,Clothing Store,Cosmetics Shop,Hotel,Café,Bubble Tea Shop,Japanese Restaurant,Theater
12,North York,Don Mills South,9,Coffee Shop,Restaurant,Gym,Asian Restaurant,Bubble Tea Shop,Supermarket,Bike Shop,Beer Store


In [39]:
# Exploring count of venue types for Cluster 9 - 1st Most Common Venue
cluster_9_1st = pd.DataFrame(cluster_9.value_counts('1st Most Common Venue')).reset_index()
cluster_9_1st.columns = ['1st Most Common Venue','Count']
cluster_9_1st.head()

Unnamed: 0,1st Most Common Venue,Count
0,Coffee Shop,21
1,Breakfast Spot,2
2,Clothing Store,2
3,Pizza Place,2
4,Airport Service,1


In [40]:
# Exploring count of venue types for Cluster 9 - 2nd Most Common Venue
cluster_9_2nd = pd.DataFrame(cluster_9.value_counts('2nd Most Common Venue')).reset_index()
cluster_9_2nd.columns = ['2nd Most Common Venue','Count']
cluster_9_2nd.head()

Unnamed: 0,2nd Most Common Venue,Count
0,Café,5
1,Coffee Shop,5
2,Sandwich Place,3
3,Sushi Restaurant,2
4,Clothing Store,2


In [41]:
# Exploring count of venue types for Cluster 9 - 3rd Most Common Venue
cluster_9_3rd = pd.DataFrame(cluster_9.value_counts('3rd Most Common Venue')).reset_index()
cluster_9_3rd.columns = ['3rd Most Common Venue','Count']
cluster_9_3rd.head()

Unnamed: 0,3rd Most Common Venue,Count
0,Pub,3
1,Café,3
2,Coffee Shop,3
3,Yoga Studio,2
4,Hotel,2


From the 3 tables above, it can be seen that this first cluster focuses on cafes and coffee shops, breakfast spots, and sandwich places. These neighborhoods would be ideal for a person that is always on the go, and likes to eat fast or on small quantities rather than going to restaurants (not that there is not any restaurant around).

**Preferred neighborhoods for: young people always on the run.**

#### Cluster 2

In [42]:
cluster_2.drop(columns=['PostalCode','Latitude','Longitude'], axis=1).head()

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
6,North York,Don Mills North,2,Caribbean Restaurant,Gym,Baseball Field,Café,Japanese Restaurant,Metro Station,Mexican Restaurant,Men's Store
13,East York,Woodbine Heights,2,Skating Rink,Beer Store,Park,Bus Stop,Athletics & Sports,Curling Ice,Dance Studio,Monument / Landmark
16,Etobicoke,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",2,Pharmacy,Liquor Store,Shopping Plaza,Coffee Shop,Beer Store,Park,Pizza Place,Pet Store
24,Downtown Toronto,Christie,2,Grocery Store,Café,Park,Candy Store,Restaurant,Italian Restaurant,Baby Store,Nightclub
25,Scarborough,Cedarbrae,2,Lounge,Athletics & Sports,Fried Chicken Joint,Caribbean Restaurant,Gas Station,Thai Restaurant,Bank,Bakery


In [43]:
# Exploring count of venue types for Cluster 2 - 1st Most Common Venue
cluster_2_1st = pd.DataFrame(cluster_2.value_counts('1st Most Common Venue')).reset_index()
cluster_2_1st.columns = ['1st Most Common Venue','Count']
cluster_2_1st.head()

Unnamed: 0,1st Most Common Venue,Count
0,Café,4
1,Bakery,2
2,Bar,2
3,Pharmacy,2
4,Bus Line,1


In [44]:
# Exploring count of venue types for Cluster 2 - 2nd Most Common Venue
cluster_2_2nd = pd.DataFrame(cluster_2.value_counts('2nd Most Common Venue')).reset_index()
cluster_2_2nd.columns = ['2nd Most Common Venue','Count']
cluster_2_2nd.head()

Unnamed: 0,2nd Most Common Venue,Count
0,Gym,2
1,Bakery,2
2,Athletics & Sports,1
3,General Entertainment,1
4,Shopping Mall,1


In [45]:
# Exploring count of venue types for Cluster 2 - 3rd Most Common Venue
cluster_2_3rd = pd.DataFrame(cluster_9.value_counts('3rd Most Common Venue')).reset_index()
cluster_2_3rd.columns = ['3rd Most Common Venue','Count']
cluster_2_3rd.head()

Unnamed: 0,3rd Most Common Venue,Count
0,Pub,3
1,Café,3
2,Coffee Shop,3
3,Yoga Studio,2
4,Hotel,2


This cluster shows more variety of venues, yet the presence of some gyms and sporting goods stores show that they might be ideal for people worried about fitness and exercising. There are also some pharmacies and Yoga Studios around that might be also ideal for people with this profile.

**Preferred neighborhood for: people worried/interested about fitness, sports and exercising.**

#### Cluster 0

In [46]:
cluster_0.drop(columns=['PostalCode','Latitude','Longitude'], axis=1).head()

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
1,North York,Victoria Village,0,Pizza Place,Intersection,Hockey Arena,Portuguese Restaurant,Coffee Shop,Mobile Phone Shop,Motel,Moroccan Restaurant
7,East York,"Parkview Hill, Woodbine Gardens",0,Pizza Place,Flea Market,Athletics & Sports,Café,Breakfast Spot,Bank,Gastropub,Intersection
17,Scarborough,"Guildwood, Morningside, West Hill",0,Breakfast Spot,Rental Car Location,Donut Shop,Intersection,Restaurant,Electronics Store,Medical Center,Bank
27,North York,"Bathurst Manor, Wilson Heights, Downsview North",0,Bank,Coffee Shop,Diner,Bridal Shop,Middle Eastern Restaurant,Shopping Mall,Mobile Phone Shop,Sandwich Place
28,East York,Thorncliffe Park,0,Indian Restaurant,Yoga Studio,Park,Sandwich Place,Supermarket,Restaurant,Gas Station,Burger Joint


In [47]:
# Exploring count of venue types for Cluster 0 - 1st Most Common Venue
cluster_0_1st = pd.DataFrame(cluster_0.value_counts('1st Most Common Venue')).reset_index()
cluster_0_1st.columns = ['1st Most Common Venue','Count']
cluster_0_1st.head()

Unnamed: 0,1st Most Common Venue,Count
0,Pizza Place,5
1,Bank,1
2,Breakfast Spot,1
3,Grocery Store,1
4,Indian Restaurant,1


In [48]:
# Exploring count of venue types for Cluster 0 - 2nd Most Common Venue
cluster_0_2nd = pd.DataFrame(cluster_0.value_counts('2nd Most Common Venue')).reset_index()
cluster_0_2nd.columns = ['2nd Most Common Venue','Count']
cluster_0_2nd.head()

Unnamed: 0,2nd Most Common Venue,Count
0,Coffee Shop,3
1,Fast Food Restaurant,1
2,Flea Market,1
3,Grocery Store,1
4,Intersection,1


In [49]:
# Exploring count of venue types for Cluster 0 - 3rd Most Common Venue
cluster_0_3rd = pd.DataFrame(cluster_0.value_counts('3rd Most Common Venue')).reset_index()
cluster_0_3rd.columns = ['3rd Most Common Venue','Count']
cluster_0_3rd.head()

Unnamed: 0,3rd Most Common Venue,Count
0,Athletics & Sports,1
1,Convenience Store,1
2,Diner,1
3,Donut Shop,1
4,Gas Station,1


This cluster also shows a wider variety of venues. Pizza places are abundant in these neighborhoods, yet there are also ther types of restaurants such as Indian, breakfast places, fast food and grocery stores.

**Preferred neighborhood for: people that loves restaurants/different types of food.**

#### Cluster 7

In [50]:
cluster_7.drop(columns=['PostalCode','Latitude','Longitude'], axis=1).head()

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
20,York,Caledonia-Fairbanks,7,Park,Women's Store,Pool,Accessories Store,Mexican Restaurant,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant
66,Central Toronto,Forest Hill North & West,7,Sushi Restaurant,Park,Jewelry Store,Trail,Accessories Store,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop
83,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",7,Intersection,Playground,Park,Middle Eastern Restaurant,Moroccan Restaurant,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant
89,Downtown Toronto,Rosedale,7,Park,Playground,Trail,Accessories Store,Middle Eastern Restaurant,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant


In [51]:
# Exploring count of venue types for Cluster 7 - 1st Most Common Venue
cluster_7_1st = pd.DataFrame(cluster_7.value_counts('1st Most Common Venue')).reset_index()
cluster_7_1st.columns = ['1st Most Common Venue','Count']
cluster_7_1st.head()

Unnamed: 0,1st Most Common Venue,Count
0,Park,2
1,Intersection,1
2,Sushi Restaurant,1


In [52]:
# Exploring count of venue types for Cluster 7 - 2nd Most Common Venue
cluster_7_2nd = pd.DataFrame(cluster_7.value_counts('2nd Most Common Venue')).reset_index()
cluster_7_2nd.columns = ['2nd Most Common Venue','Count']
cluster_7_2nd.head()

Unnamed: 0,2nd Most Common Venue,Count
0,Playground,2
1,Park,1
2,Women's Store,1


In [53]:
# Exploring count of venue types for Cluster 7 - 3rd Most Common Venue
cluster_7_3rd = pd.DataFrame(cluster_7.value_counts('3rd Most Common Venue')).reset_index()
cluster_7_3rd.columns = ['3rd Most Common Venue','Count']
cluster_7_3rd.head()

Unnamed: 0,3rd Most Common Venue,Count
0,Jewelry Store,1
1,Park,1
2,Pool,1
3,Trail,1


This cluster shows several open spaces like parks, playgrounds, pools and trails. These neighborhoods look ideal for people with children, families in general or people who like exercising.

**Preferred neighborhood for: people with children/families.**