# Segmenting and Clustering Neighborhoods in Toronto
Coursera / Applied Data Science Capstone / Peer-Graded Assignment / Week 3  
By Ginanjar Saputra, 08 February 2021

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-1">Import libraries</a></span></li><li><span><a href="#Part-1" data-toc-modified-id="Part-1-2">Part 1</a></span><ul class="toc-item"><li><span><a href="#Scrape-a-Wikipedia-page" data-toc-modified-id="Scrape-a-Wikipedia-page-2.1">Scrape a Wikipedia page</a></span></li><li><span><a href="#Remove-boroughs-that-are-'Not-assigned'" data-toc-modified-id="Remove-boroughs-that-are-'Not-assigned'-2.2">Remove boroughs that are 'Not assigned'</a></span></li><li><span><a href="#Group-boroughs-and-neighborhoods-by-postal-codes" data-toc-modified-id="Group-boroughs-and-neighborhoods-by-postal-codes-2.3">Group boroughs and neighborhoods by postal codes</a></span></li><li><span><a href="#Rename-neighborhoods-that-are-'Not-assigned'" data-toc-modified-id="Rename-neighborhoods-that-are-'Not-assigned'-2.4">Rename neighborhoods that are 'Not assigned'</a></span></li><li><span><a href="#Show-dataframe-shape" data-toc-modified-id="Show-dataframe-shape-2.5">Show dataframe shape</a></span></li></ul></li><li><span><a href="#Part-2" data-toc-modified-id="Part-2-3">Part 2</a></span><ul class="toc-item"><li><span><a href="#Retrieve-geographical-coordinates-using-Geocoder" data-toc-modified-id="Retrieve-geographical-coordinates-using-Geocoder-3.1">Retrieve geographical coordinates using Geocoder</a></span></li></ul></li><li><span><a href="#Part-3" data-toc-modified-id="Part-3-4">Part 3</a></span><ul class="toc-item"><li><span><a href="#Visualization:-Toronto-neighborhoods" data-toc-modified-id="Visualization:-Toronto-neighborhoods-4.1">Visualization: Toronto neighborhoods</a></span></li><li><span><a href="#Select-boroughs-of-interest" data-toc-modified-id="Select-boroughs-of-interest-4.2">Select boroughs of interest</a></span></li><li><span><a href="#Explore-a-particular-neighborhood" data-toc-modified-id="Explore-a-particular-neighborhood-4.3">Explore a particular neighborhood</a></span></li><li><span><a href="#Explore-venues-in-all-neighborhoods" data-toc-modified-id="Explore-venues-in-all-neighborhoods-4.4">Explore venues in all neighborhoods</a></span></li><li><span><a href="#Analyze-each-neighborhood" data-toc-modified-id="Analyze-each-neighborhood-4.5">Analyze each neighborhood</a></span></li><li><span><a href="#Cluster-neighborhoods-using-K-means-algorithm" data-toc-modified-id="Cluster-neighborhoods-using-K-means-algorithm-4.6">Cluster neighborhoods using <em>K-means</em> algorithm</a></span></li><li><span><a href="#Examine-clusters" data-toc-modified-id="Examine-clusters-4.7">Examine clusters</a></span><ul class="toc-item"><li><span><a href="#Cluster-0" data-toc-modified-id="Cluster-0-4.7.1">Cluster 0</a></span></li><li><span><a href="#Cluster-1" data-toc-modified-id="Cluster-1-4.7.2">Cluster 1</a></span></li><li><span><a href="#Cluster-2" data-toc-modified-id="Cluster-2-4.7.3">Cluster 2</a></span></li><li><span><a href="#Cluster-3" data-toc-modified-id="Cluster-3-4.7.4">Cluster 3</a></span></li><li><span><a href="#Cluster-4" data-toc-modified-id="Cluster-4-4.7.5">Cluster 4</a></span></li></ul></li></ul></li></ul></div>

## Import libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
# !pip install geocoder
import geocoder
# !pip install geopy
from geopy.geocoders import Nominatim 
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

## Part 1

### Scrape a Wikipedia page

In [2]:
# send HTTP request to Wikipedia
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url).text

# parse the page, locate HTML tags with relevant information
soup = BeautifulSoup(response, 'lxml')
table = soup.find('table')
fields = table.find_all('td')

In [3]:
# empty lists to store data
postalcode = []
borough = []
neighborhood = []

# loop through all <td> tags to extract data from the table
for td in range(0, len(fields), 3):
    postalcode.append(fields[td].text.strip())
    borough.append(fields[td+1].text.strip())
    neighborhood.append(fields[td+2].text.strip())

# create a dataframe
df = pd.DataFrame({
    'PostalCode':postalcode,
    'Borough':borough,
    'Neighborhood':neighborhood
})
print(f"Dataframe contains {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()

Dataframe contains 180 rows and 3 columns.


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Remove boroughs that are 'Not assigned'

In [4]:
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True)
print(f"Dataframe now contains {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()

Dataframe now contains 103 rows and 3 columns.


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Group boroughs and neighborhoods by postal codes

On the Wikipedia page source, all neighborhoods with the same postal codes had already been grouped together. However, assuming that they had not, the following codes could be used to perform the grouping.

In [5]:
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
df.columns = ['PostalCode', 'Borough', 'Neighborhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Rename neighborhoods that are 'Not assigned'

The code below does the renaming as required in the assignment instructions. Based on the source data, however, each of the 'Not assigned' neighborhoods has a 'Not assigned' borough as well. Therefore the corresponding rows had actually been dropped in an earlier step when we ignore boroughs that are 'Not assigned'.

In [6]:
# 'Not assigned' neighborhoods are given the name of their corresponding borough
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned',
                              df['Borough'],
                              df['Neighborhood']
                             )

### Show dataframe shape

In [7]:
print(f"Cleansed dataframe contains {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()

Cleansed dataframe contains 103 rows and 3 columns.


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
df.to_csv('Toronto-Pt1.csv', index=False)

---

## Part 2

### Retrieve geographical coordinates using Geocoder

The instructions recommend the use of Geocoder library to get the latitude and longitude of each postal code. Unfortunately, calls made to Google were unsuccessful. I used another provider (ArcGIS) instead, which returned slightly different coordinate results.

In [9]:
# a function that retrieves lat/long coordinates
def get_coords(postcode):
    return geocoder.arcgis(f'{postcode}, Toronto, Ontario').latlng

In [10]:
# get coordinates of all postal codes
postcodes = df['PostalCode'].tolist()
coords = [get_coords(code) for code in postcodes] # apply function
print("Coordinates are obtained.")

Coordinates are obtained.


In [11]:
# a new dataframe to store coordinates
coords_arcgis = df.copy()

# add columns for latitudes and longitudes
coords_arcgis['Latitude'] = [coord[0] for coord in coords]
coords_arcgis['Longitude'] = [coord[1] for coord in coords]
coords_arcgis.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.81139,-79.19662
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.78574,-79.15875
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.76575,-79.1747
3,M1G,Scarborough,Woburn,43.76812,-79.21761
4,M1H,Scarborough,Cedarbrae,43.76944,-79.23892


For comparison, the following are the coordinates from the CSV file provided in the instructions, i.e. the expected result if the calls to Google were successful. For the remainder of the analysis, this is the coordinates data to which I would refer.

In [12]:
# load the coordinates file
coords_google = pd.read_csv('Geospatial_Coordinates.csv',
                            header=0,
                            names=['PostalCode', 'Latitude', 'Longitude']
                           )

# merge with the original dataframe
toronto = df.merge(coords_google, on='PostalCode', how='inner')
toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [13]:
toronto.to_csv('Toronto-Pt2.csv', index=False)

---

## Part 3

### Visualization: Toronto neighborhoods

In [14]:
# get the coordinates of Toronto using Nominatim geocoder
geolocator = Nominatim(user_agent='tor_explorer')
location = geolocator.geocode('Toronto, Ontario')
latitude = location.latitude
longitude = location.longitude
print(f"The geograpical coordinates of Toronto are {latitude}, {longitude}.")

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


In [15]:
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, nbhd in zip(toronto['Latitude'],
                                   toronto['Longitude'],
                                   toronto['Borough'],
                                   toronto['Neighborhood']):
    label = f"{nbhd}, {borough}"
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng],
                        radius=5,
                        popup=label,
                        color='purple',
                        fill=True,
                        fill_color='pink',
                        fill_opacity=0.7,
                        parse_html=False
                       ).add_to(map_tor)

map_tor

### Select boroughs of interest
Borough names with the word 'Toronto' in them will be explored further.

In [16]:
tor_boi = toronto[toronto['Borough'].str.contains('Toronto')]
tor_boi.reset_index(drop=True, inplace=True)
print(tor_boi['Borough'].unique().tolist())
tor_boi.head()

['East Toronto', 'Central Toronto', 'Downtown Toronto', 'West Toronto']


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [17]:
map_tor_boi = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, nbhd in zip(tor_boi['Latitude'],
                                   tor_boi['Longitude'],
                                   tor_boi['Borough'],
                                   tor_boi['Neighborhood']):
    label = folium.Popup(f"{nbhd}, {borough}", parse_html=True)
    folium.CircleMarker([lat, lng],
                        radius=5,
                        popup=label,
                        color='purple',
                        fill=True,
                        fill_color='pink',
                        fill_opacity=0.7,
                        parse_html=False
                       ).add_to(map_tor_boi)
map_tor_boi

### Explore a particular neighborhood

In a neighborhood of interest, a total of 100 venues will be explored within a 500 meters radius.

In [18]:
# select a neighborhood and its corresponding coordinates
noi = 'Studio District'
noi_idx = tor_boi[tor_boi['Neighborhood'] == noi].index[0]
noi_lat = tor_boi.loc[noi_idx, 'Latitude']
noi_lng = tor_boi.loc[noi_idx, 'Longitude']
print(f"Latitude and longitude values of {noi} are {noi_lat}, {noi_lng}.")

Latitude and longitude values of Studio District are 43.6595255, -79.340923.


In [40]:
# define Foursquare credentials and parameters
client_id = 'not_sharing_this'
client_secret = 'not_sharing_this_too'
version = '20180605'
limit = 100
radius = 500

In [20]:
# specify url
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    client_id, 
    client_secret, 
    version, 
    noi_lat, 
    noi_lng, 
    radius, 
    limit)

# make an HTTP request and
# store the response in a variable 'results'
results = requests.get(url).json()
print("Request successful.")

Request successful.


Extract data from the JSON, structure them into a Pandas dataframe, and do some cleansing. Each venue is also categorized.

In [21]:
# a function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [22]:
# get all the venues, flatten JSON into a dataframe
venues_json = results['response']['groups'][0]['items'] 
venues = pd.json_normalize(venues_json)

# filter relevant columns
cols_filtered = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
venues = venues.loc[:, cols_filtered]

# apply categorization of venues
venues['venue.categories'] = venues.apply(get_category_type, axis=1)

# rename columns
venues.columns = ['Name', 'Category', 'Latitude', 'Longitude']

print(f"{venues.shape[0]} venues were returned by Foursquare.")
venues.head(10)

36 venues were returned by Foursquare.


Unnamed: 0,Name,Category,Latitude,Longitude
0,Ed's Real Scoop,Ice Cream Shop,43.660656,-79.342019
1,Queen Books,Bookstore,43.660651,-79.342267
2,Hooked,Fish Market,43.660407,-79.343257
3,Te Aro,Coffee Shop,43.661373,-79.338577
4,The Bone House,Pet Store,43.660894,-79.341097
5,Mercury Espresso Bar,Coffee Shop,43.660806,-79.341241
6,Reliable Halibut and Chips,Seafood Restaurant,43.660874,-79.340938
7,WAYLABAR,Gay Bar,43.661234,-79.339597
8,Purple Penguin Cafe,Café,43.660501,-79.342565
9,Leslieville,Neighborhood,43.66207,-79.337856


In [23]:
noi_top_venues = venues['Category'].value_counts()
print(f"The top venues in {noi} are:\n")
noi_top_venues.loc[noi_top_venues > 1]

The top venues in Studio District are:



Coffee Shop            3
Brewery                2
Café                   2
Bakery                 2
Gastropub              2
American Restaurant    2
Name: Category, dtype: int64

### Explore venues in all neighborhoods

Create a function to get nearby venues from all neighborhoods in the boroughs of interest (East/West/Central/Downtown Toronto).

In [24]:
def get_venues(names, lats, lngs, radius=500, limit=100):
    venues_list = []
    for name, lat, lng in zip(names, lats, lngs):
        # specify the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            client_id,
            client_secret,
            version,
            lat,
            lng,
            radius,
            limit)
        # make the request, store the response
        results = requests.get(url).json()['response']['groups'][0]['items']
        # extract relevant information from each venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
    # populate the dataframe with venues list
    venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    venues.columns = ['Neighborhood',
                      'Nbhd Latitude',
                      'Nbhd Longitude',
                      'Venue',
                      'Venue Latitude',
                      'Venue Longitude',
                      'Venue Category']
    return(venues)

In [25]:
# run the above function on each neighborhood,
# store the result as a new dataframe
tor_venues = get_venues(
    tor_boi['Neighborhood'],
    tor_boi['Latitude'],
    tor_boi['Longitude']
)
tor_venues.head()

Unnamed: 0,Neighborhood,Nbhd Latitude,Nbhd Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop


In [26]:
print(f"In all {tor_venues['Neighborhood'].nunique()} Toronto neighborhoods,",
      f"there's a total of {tor_venues.shape[0]} venues",
      f"across {tor_venues['Venue Category'].nunique()} different categories."
     )
tor_venues[['Neighborhood', 'Venue']].groupby('Neighborhood').count().reset_index()

In all 39 Toronto neighborhoods, there's a total of 1585 venues across 229 different categories.


Unnamed: 0,Neighborhood,Venue
0,Berczy Park,54
1,"Brockton, Parkdale Village, Exhibition Place",23
2,"Business reply mail Processing Centre, South C...",16
3,"CN Tower, King and Spadina, Railway Lands, Har...",15
4,Central Bay Street,61
5,Christie,15
6,Church and Wellesley,78
7,"Commerce Court, Victoria Hotel",100
8,Davisville,34
9,Davisville North,8


### Analyze each neighborhood

Venue categories are converted into numerical variables through one-hot encoding. Rows are grouped by neighborhood, and by taking the mean of the frequency of occurence of each venue category.

In [27]:
# one-hot encoding for each venue category
tor_onehot = pd.get_dummies(tor_venues['Venue Category'], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
tor_onehot['Neighborhood'] = tor_venues['Neighborhood'] 
tor_onehot = tor_onehot.groupby('Neighborhood').mean().reset_index()

print(f"Dataframe size: {tor_onehot.shape[0]} rows, {tor_onehot.shape[1]} columns")
tor_onehot.head()

Dataframe size: 39 rows, 229 columns


Unnamed: 0,Neighborhood,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,American Restaurant,Antique Shop,Aquarium,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.018519,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.066667,0.066667,0.066667,0.133333,0.2,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.016393,0.0,0.0,0.016393,0.0,0.016393


Display the top 10 most common venues within each neighborhood.

In [28]:
# a function to sort venues in a descending order of frequency
def top_venues(row, num_venues):
    row_cats = row.iloc[1:]
    row_cats_sorted = row_cats.sort_values(ascending=False)
    return row_cats_sorted.index.values[0:num_venues]

In [29]:
num_venues = 10 # number of top venues
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
cols = ['Neighborhood']
for i in np.arange(num_venues):
    try:
        cols.append(f"{i+1}{indicators[i]} Most Common Venue")
    except:
        cols.append(f"{i+1}th Most Common Venue")

# create a dataframe of 10 most common venues by neighborhood
tor_common = pd.DataFrame(columns=cols)
tor_common['Neighborhood'] = tor_onehot['Neighborhood']

for i in np.arange(tor_onehot.shape[0]):
    tor_common.iloc[i, 1:] = top_venues(tor_onehot.iloc[i, :], num_venues)

tor_common.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Cheese Shop,Bakery,Farmers Market,Beer Bar,Restaurant,Seafood Restaurant,Museum,Café
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Nightclub,Breakfast Spot,Gym,Bakery,Pet Store,Performing Arts Venue,Restaurant,Climbing Gym
2,"Business reply mail Processing Centre, South C...",Light Rail Station,Spa,Recording Studio,Farmers Market,Pizza Place,Burrito Place,Skate Park,Garden,Brewery,Restaurant
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Boutique,Airport,Bar,Coffee Shop,Rental Car Location,Boat or Ferry,Sculpture Garden,Harbor / Marina
4,Central Bay Street,Coffee Shop,Italian Restaurant,Café,Sandwich Place,Bubble Tea Shop,Burger Joint,Salad Place,Donut Shop,Business Service,Ramen Restaurant


### Cluster neighborhoods using _K-means_ algorithm

Apply k-means clustering algorithm to segment and cluster Toronto neighborhoods based on a set of features (venue categories).

In [30]:
Ks = 5 # number of clusters
X = tor_onehot.drop('Neighborhood', 1) # select features

model = KMeans(n_clusters=Ks, random_state=0).fit(X)
model.labels_ # cluster labels generated for each row

array([0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 3, 0, 0, 1, 1, 0, 1, 1, 2, 0,
       1, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])

Create a new dataframe that includes the 10 most common venues, as well as the cluster labels.

In [31]:
tor_common.insert(0, 'Cluster Label', model.labels_)
tor_merged = tor_boi.copy()
tor_merged = tor_merged.join(tor_common.set_index('Neighborhood'), on='Neighborhood')
tor_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Health Food Store,Pub,Trail,Adult Boutique,Music Venue,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Restaurant,Yoga Studio,Pizza Place,Brewery,Bubble Tea Shop
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,1,Pizza Place,Park,Fast Food Restaurant,Italian Restaurant,Pub,Restaurant,Movie Theater,Sandwich Place,Brewery,Pet Store
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Coffee Shop,Brewery,Gastropub,Bakery,American Restaurant,Café,Cheese Shop,Stationery Store,Bookstore,Middle Eastern Restaurant
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,1,Park,Swim School,Bus Line,Adult Boutique,Music Venue,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant


Display visualization of neighborhood clusters.

In [32]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(Ks)
ys = [i + x + (i*x)**2 for i in range(Ks)]
colors_array = cm.gist_rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
for lat, lng, nbhd, cluster in zip(tor_merged['Latitude'],
                                   tor_merged['Longitude'],
                                   tor_merged['Neighborhood'],
                                   tor_merged['Cluster Label']):
    label = folium.Popup(f"Cluster {cluster}: {nbhd}", parse_html=True)
    folium.CircleMarker([lat, lng],
                        radius=5,
                        popup=label,
                        color=rainbow[cluster-1],
                        fill=True,
                        fill_color=rainbow[cluster-1],
                        fill_opacity=0.5
                       ).add_to(map_clusters)
map_clusters

### Examine clusters

Determine the discriminating venue categories that distinguish each cluster.

#### Cluster 0

In [33]:
cluster0 = tor_merged.loc[tor_merged['Cluster Label'] == 0,
                          tor_merged.columns[[2] + list(range(6, tor_merged.shape[1]))]
                         ].reset_index(drop=True)
cluster0

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"The Danforth West, Riverdale",Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Restaurant,Yoga Studio,Pizza Place,Brewery,Bubble Tea Shop
1,Studio District,Coffee Shop,Brewery,Gastropub,Bakery,American Restaurant,Café,Cheese Shop,Stationery Store,Bookstore,Middle Eastern Restaurant
2,"North Toronto West, Lawrence Park",Clothing Store,Coffee Shop,Yoga Studio,Mexican Restaurant,Cosmetics Shop,Spa,Sporting Goods Shop,Seafood Restaurant,Café,Salon / Barbershop
3,Davisville,Pizza Place,Dessert Shop,Sandwich Place,Gym,Italian Restaurant,Coffee Shop,Café,Thai Restaurant,Sushi Restaurant,Park
4,"Summerhill West, Rathnelly, South Hill, Forest...",Coffee Shop,Liquor Store,Restaurant,Bank,Vietnamese Restaurant,Pizza Place,American Restaurant,Sushi Restaurant,Pub,Light Rail Station
5,"St. James Town, Cabbagetown",Bakery,Coffee Shop,Pub,Pizza Place,Convenience Store,Restaurant,Italian Restaurant,Café,Caribbean Restaurant,Liquor Store
6,Church and Wellesley,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Yoga Studio,Dance Studio,Fast Food Restaurant,Hotel,Mediterranean Restaurant
7,"Regent Park, Harbourfront",Coffee Shop,Pub,Park,Bakery,Café,Theater,Breakfast Spot,Event Space,Electronics Store,Farmers Market
8,"Garden District, Ryerson",Coffee Shop,Clothing Store,Cosmetics Shop,Italian Restaurant,Japanese Restaurant,Café,Bubble Tea Shop,Middle Eastern Restaurant,Pizza Place,Movie Theater
9,St. James Town,Coffee Shop,Café,Cocktail Bar,Gastropub,American Restaurant,Creperie,Clothing Store,Seafood Restaurant,Restaurant,Gym


In [34]:
# get all the venue categories in the cluster
cl0_venues_lists = cluster0.iloc[:, 1:].values.tolist()

# flatten the resulting nested list
cl0_venues = [venue for sublist in cl0_venues_lists for venue in sublist]

# count unique values
pd.Series(cl0_venues).value_counts()

Coffee Shop           25
Café                  22
Restaurant            14
Italian Restaurant    11
Gym                    8
                      ..
Candy Store            1
Falafel Restaurant     1
Deli / Bodega          1
Dance Studio           1
Event Space            1
Length: 85, dtype: int64

Caffeine seems to be quite popular in Cluster 0. More than 40 venues in the neighborhoods are coffee shops/cafe.

#### Cluster 1

In [35]:
cluster1 = tor_merged.loc[tor_merged['Cluster Label'] == 1,
                          tor_merged.columns[[2] + list(range(6, tor_merged.shape[1]))]
                         ].reset_index(drop=True)
cluster1

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,The Beaches,Health Food Store,Pub,Trail,Adult Boutique,Music Venue,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop
1,"India Bazaar, The Beaches West",Pizza Place,Park,Fast Food Restaurant,Italian Restaurant,Pub,Restaurant,Movie Theater,Sandwich Place,Brewery,Pet Store
2,Lawrence Park,Park,Swim School,Bus Line,Adult Boutique,Music Venue,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant
3,Davisville North,Gym,Hotel,Park,Department Store,Sandwich Place,Breakfast Spot,Food & Drink Shop,Pizza Place,Hotel Bar,Nightclub
4,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Boutique,Airport,Bar,Coffee Shop,Rental Car Location,Boat or Ferry,Sculpture Garden,Harbor / Marina
5,"Dufferin, Dovercourt Village",Bakery,Pharmacy,Coffee Shop,Middle Eastern Restaurant,Liquor Store,Café,Supermarket,Furniture / Home Store,Music Venue,Bar
6,"Little Portugal, Trinity",Bar,Asian Restaurant,Café,Vegetarian / Vegan Restaurant,Men's Store,Restaurant,Yoga Studio,Gift Shop,Korean Restaurant,Beer Store
7,"High Park, The Junction South",Thai Restaurant,Mexican Restaurant,Café,Bar,Speakeasy,Cajun / Creole Restaurant,Fast Food Restaurant,Fried Chicken Joint,Bookstore,Furniture / Home Store
8,"Parkdale, Roncesvalles",Breakfast Spot,Gift Shop,Movie Theater,Eastern European Restaurant,Bar,Dog Run,Dessert Shop,Italian Restaurant,Restaurant,Bookstore
9,"Business reply mail Processing Centre, South C...",Light Rail Station,Spa,Recording Studio,Farmers Market,Pizza Place,Burrito Place,Skate Park,Garden,Brewery,Restaurant


In [36]:
# get all the venue categories in the cluster
cl1_venues_lists = cluster1.iloc[:, 1:].values.tolist()

# flatten the resulting nested list
cl1_venues = [venue for sublist in cl1_venues_lists for venue in sublist]

# count unique values
pd.Series(cl1_venues).value_counts()

Bar                          5
Restaurant                   4
Middle Eastern Restaurant    3
Café                         3
Park                         3
                            ..
Food & Drink Shop            1
Airport                      1
Supermarket                  1
Airport Lounge               1
Recording Studio             1
Length: 66, dtype: int64

The top venue in Cluster 1 neighborhoods serves alcoholic beverages, rather than the good ol' cup of joe.

#### Cluster 2

In [37]:
cluster2 = tor_merged.loc[tor_merged['Cluster Label'] == 2,
                          tor_merged.columns[[2] + list(range(6, tor_merged.shape[1]))]
                         ].reset_index(drop=True)

cl2_venues_lists = cluster2.iloc[:, 1:].values.tolist()
cl2_venues = [venue for sublist in cl2_venues_lists for venue in sublist]
pd.Series(cl2_venues).value_counts()

Mexican Restaurant           1
Men's Store                  1
Martial Arts School          1
Middle Eastern Restaurant    1
Restaurant                   1
Tennis Court                 1
Mediterranean Restaurant     1
Trail                        1
Adult Boutique               1
Museum                       1
dtype: int64

#### Cluster 3

In [38]:
cluster3 = tor_merged.loc[tor_merged['Cluster Label'] == 3,
                          tor_merged.columns[[2] + list(range(6, tor_merged.shape[1]))]
                         ].reset_index(drop=True)

cl3_venues_lists = cluster3.iloc[:, 1:].values.tolist()
cl3_venues = [venue for sublist in cl3_venues_lists for venue in sublist]
pd.Series(cl3_venues).value_counts()

Middle Eastern Restaurant    2
Men's Store                  2
Museum                       2
Trail                        2
Mexican Restaurant           2
Park                         2
Mediterranean Restaurant     2
Sushi Restaurant             1
Jewelry Store                1
Martial Arts School          1
Playground                   1
Adult Boutique               1
Miscellaneous Shop           1
dtype: int64

#### Cluster 4

In [39]:
cluster4 = tor_merged.loc[tor_merged['Cluster Label'] == 4,
                          tor_merged.columns[[2] + list(range(6, tor_merged.shape[1]))]
                         ].reset_index(drop=True)

cl4_venues_lists = cluster4.iloc[:, 1:].values.tolist()
cl4_venues = [venue for sublist in cl4_venues_lists for venue in sublist]
pd.Series(cl4_venues).value_counts()

Middle Eastern Restaurant     1
Men's Store                   1
Mexican Restaurant            1
New American Restaurant       1
Music Venue                   1
Garden                        1
Adult Boutique                1
Mediterranean Restaurant      1
Miscellaneous Shop            1
Modern European Restaurant    1
dtype: int64

---

Thank you for reviewing my assignment!