# Segmenting and Clustering Neighborhoods in Toronto
Created by Brandon Bellanti | Last updated on 2021-03-26

---

## Load Libraries

In [1]:
import pandas as pd, numpy as np
from bs4 import BeautifulSoup
import requests
import re

## Fetch Toronto neighborhood data from Wikipedia

**[Toronto Postal Codes](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)** on Wikipedia

In [2]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
S = requests.session()
r = S.get(wiki_url)
if r.status_code == 200:
    html = r.text
    print('Request successful')

Request successful


In [3]:
# print partial HTML string to verify
html[0:500]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of postal codes of Canada: M - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"8db48afd-5513-4ef4-835'

In [4]:
# create a Beautiful Soup object
soup = BeautifulSoup(html)

## Parse HTML table data

In [5]:
# search for all tables in the HTML and select the first one
tables = soup.find_all('table')
table = tables[0]

In [29]:
# create df to store neighborhood data
neighborhoods_df = pd.DataFrame(columns=['PostalCode','Borough','Neighborhoods'])

Each cell text in the table contains a postal code, a borough name, and neighborhood name/s. The postal code is the first three characters, so I slice the string to return that value. The nieghborhoods and boroughs are consistently formatted as `Borough(Neighborhood)` – or `Borough(Neighborhood / Neighborhood / Neighborhood)` if there are multiple neighborhoods in the same borough. For example:
    
`M3ANorth York(Parkwoods)`

It's rare, but there are some postal codes that encompass multiple boroughs, such as:
    
`M3CNorth York(Don Mills)South(Flemingdon Park)`

I split the non-postal code string (from the fourth character to the end) on a closing parentheses ( ")" ) first, then iterate through the list created in case there are multiple boroughs. Then, for each neighborhood string, I replace any slashes ( "/" ) with commas so the neighborhoods are dilimited.

According to the project instructions, if a borough does not have an assigned neighborhood name, the neighborhood name should be the same as the borough.

In [30]:
for data in table.find_all('td'):
    text = data.text
    postal_code = text[:4].strip('\n')
    divs = text[4:].rstrip('\n\n').split(')')
    for div in divs:
        if div not in ['','Not assigned']:
            borough_town = div.split('(')
            borough = borough_town[0]
            try:
                neighborhoods = borough_town[1]
            except:
                neighborhoods = borough
            neighborhoods = neighborhoods.replace(' / ',', ')
                
            neighborhoods_df = neighborhoods_df.append(dict(
                                    PostalCode=postal_code,
                                    Borough=borough,
                                    Neighborhoods=neighborhoods),ignore_index=True)

neighborhoods_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhoods
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [41]:
neighborhoods_df.shape

(112, 3)

Note, the call above to the `.shape` method does not account for the few instances where a postal code is repeated. The reason for these duplicates is that there are some postal codes that encompass multiple boroughs (like I mentioned before). The cell below shows the number of unique postal codes.

In [45]:
neighborhoods_df.drop_duplicates('PostalCode').shape # or neighborhoods_df['PostalCode'].nunique()

(103, 3)

## Join geographical data

In [32]:
# load spatial data
geo_df = pd.read_csv('Geospatial_Coordinates.csv')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [46]:
# merge neighborhoods and spatial dataframes by the postal codes
toronto_df = neighborhoods_df.merge(geo_df, left_on='PostalCode',right_on='Postal Code',how='outer').drop(columns='Postal Code')
toronto_df.head(30)

Unnamed: 0,PostalCode,Borough,Neighborhoods,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M3B,North,North,43.745906,-79.352188
9,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937


In [34]:
toronto_df.shape

(112, 5)

# Segmenting and clustering neighborhoods in Toronto

In [47]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [48]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [78]:
# filter toronto dataframe to include only boroughs that contain the word Toronto
toronto_data = toronto_df.loc[toronto_df['Borough'].str.contains('Toronto')].copy()

In [54]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhoods']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [55]:
CLIENT_ID = 'PHPYTVIC22RACPESMK2BTWWGZQLEP1H5X4OM5YD3UWPEQ1MN' # your Foursquare ID
CLIENT_SECRET = 'CKKXK00Y4X4IEBXB0BXABSPIQWRTTXG0CLII4E1RRNVU4ZMN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PHPYTVIC22RACPESMK2BTWWGZQLEP1H5X4OM5YD3UWPEQ1MN
CLIENT_SECRET:CKKXK00Y4X4IEBXB0BXABSPIQWRTTXG0CLII4E1RRNVU4ZMN


In [57]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [88]:
# explode dataframe to search each nieghborhood
toronto_data['Neighborhoods_list'] = toronto_data['Neighborhoods'].apply(lambda x: x.split(', '))
toronto_exploded = toronto_data.explode('Neighborhoods_list').reset_index(drop=True).rename(columns={'Neighborhoods_list':'Neighborhood'})
toronto_exploded.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhoods,Latitude,Longitude,Neighborhood
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Regent Park
1,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Harbourfront
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,Garden District
3,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,Ryerson
4,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,St. James Town
5,M4E,East Toronto,The Beaches,43.676357,-79.293031,The Beaches
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,Berczy Park
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,Central Bay Street
8,M6G,Downtown Toronto,Christie,43.669542,-79.422564,Christie
9,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568,Richmond


In [89]:
toronto_venues = getNearbyVenues(names=toronto_exploded['Neighborhood'],
                                   latitudes=toronto_exploded['Latitude'],
                                   longitudes=toronto_exploded['Longitude']
                                  )

Regent Park
Harbourfront
Garden District
Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond
Adelaide
King
Dufferin
Dovercourt Village
The Danforth  East
Harbourfront East
Union Station
Toronto Islands
Little Portugal
Trinity
The Danforth West
Riverdale
Toronto Dominion Centre
Design Exchange
Brockton
Parkdale Village
Exhibition Place
India Bazaar
The Beaches West
Commerce Court
Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park
The Junction South
North Toronto West
The Annex
North Midtown
Yorkville
Parkdale
Roncesvalles
Davisville
University of Toronto
Harbord
Runnymede
Swansea
Moore Park
Summerhill East
Kensington Market
Chinatown
Grange Park
Summerhill West
Rathnelly
South Hill
Forest Hill SE
Deer Park
CN Tower
King and Spadina
Railway Lands
Harbourfront West
Bathurst Quay
South Niagara
Island airport
Rosedale
Enclave of M5E
St. James Town
Cabbagetown
First Canadian Place
Underground city
Church a

In [95]:
print(f"""There were {toronto_venues.shape[0]} venues in Toronto returned!""")
toronto_venues.head()

There were 3062 venues in Toronto returned!


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Regent Park,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Regent Park,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Regent Park,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,Regent Park,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Regent Park,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


In [None]:
toronto_venue