# Segmenting and Clustering Neighborhoods in Toronto

## Data Retrieval

The very first part of this project requires us to scrape Wikipedia to pull information about the neighborhoods in Toronto. We'll leverage the beautifulsoup package to do just that.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import urllib.request

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')

In [3]:
# Grab all HTML tables from the page.
table = soup.find('table', class_='wikitable sortable')

# Set up some containers
postal_codes = []
boroughs = []
neighborhoods = []

# Iterate through the table cells
for row in table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells) == 3:
        postal_codes.append(cells[0].find(text=True).rstrip('\n'))
        boroughs.append(cells[1].find(text=True).rstrip('\n'))
        
        # We replace / with , in the neighborhoods at this stage to account for duplicates.
        neighborhoods.append(cells[2].find(text=True).rstrip('\n').replace(' /', ','))
        
# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
toronto = pd.DataFrame(postal_codes, columns=['Postal Code'])
toronto['Borough'] = boroughs
toronto['Neighborhood'] = neighborhoods

toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [4]:
# Only process the cells that have an assigned borough. Ignore cells with 
# a borough that is Not assigned.
assigned = toronto[toronto['Borough'] != 'Not assigned']
assigned.reset_index(drop=True, inplace=True)

# If a cell has a borough but a Not assigned neighborhood, then 
# the neighborhood will be the same as the borough.
#
# After spending some time looking at the dataframe and thw original Wikipedia
# table, I discovered there are no cases where we have a borough but no neighborhood.

assigned.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [5]:
# In the last cell of your notebook, use the .shape method to print the number 
# of rows of your dataframe.

assigned.shape

(103, 3)

## Geocoding

The next part of our project requires merging our neighborhood dataframe with latitudes and longitudes. Rather than mess with the Geocoding API (which has proven in my day job to be a nightmare), I will import the data directly from the course's geospacial CSV: http://cocl.us/Geospatial_data

In [6]:
geodata = pd.read_csv('../data/geospacial.csv')
geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
merged = assigned.set_index('Postal Code').join(geodata.set_index('Postal Code'))
merged.reset_index(inplace=True)
merged.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Exploration

First we're going to map the neighborhoods of Toroton based on the above dataset similar to the way we used Folium to visualize New York in the exercises.

In [8]:
import folium
from geopy.geocoders import Nominatim
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [9]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(merged['Latitude'], merged['Longitude'], merged['Borough'], merged['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

From the looks of things, there are more neighborhood clusters the closer we get to downtown Toronto. Let's filter our merged dataset to select _only_ neighborhoods containing the name 'Toronto' and visualize how they're clustered together.

In [10]:
toronto_neighborhoods = merged[merged['Borough'].str.contains('Toronto')]

map_dt_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_neighborhoods['Latitude'], toronto_neighborhoods['Longitude'], toronto_neighborhoods['Borough'], toronto_neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dt_toronto)  
    
map_dt_toronto

Seeing east vs west vs downtown vs central I want to color code things a bit more ...

In [11]:
colored_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

def point_color(borough):
    if borough.startswith('East'):
        return 'red'
    elif borough.startswith('West'):
        return 'green'
    elif borough.startswith('Central'):
        return 'orange'
    else:
        return 'blue'

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_neighborhoods['Latitude'], toronto_neighborhoods['Longitude'], toronto_neighborhoods['Borough'], toronto_neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=point_color(borough),
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(colored_toronto)  
    
colored_toronto

We clearly see _some_ overlap betwen downtown Toronto and the nearby boroughs. This is more interestng than an issue, but we can ultimately see clear clustering between the neighborhoods within each borough.

Given we want to explor this a bit more, let's see how many cafes are in town. We'll use the Foursquare API to fetch the 50 venues from a search query for "coffee" closest to the lat/long of Toronto itself and visualize how they're distributed across the boroughs.

In [12]:
import requests

# This block allows us to import .py files from the parent directory of our /toronto location so we can keep our
# Foursquare credentials secret ;-)
import os,sys,inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir) 

from creds import CLIENT_ID, CLIENT_SECRET
VERSION = '20180604'

In [13]:
search_query = 'coffee'
limit = 50
radius = 20 * 1000 # Toronto's maxium width is 43km, so let's use a 20km radius
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, limit)

results = requests.get(url).json()
venues = results['response']['venues']

cafes = pd.json_normalize(venues)
cafes.head()

Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.crossStreet,location.lat,location.lng,location.labeledLatLngs,location.distance,location.postalCode,location.cc,location.neighborhood,location.city,location.state,location.country,location.formattedAddress,venuePage.id
0,59f784dd28122f14f9d5d63d,HotBlack Coffee,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1588307200,False,245 Queen Street West,at St Patrick St,43.650364,-79.388669,"[{'label': 'display', 'lat': 43.65036434800487...",515,M5V 1Z4,CA,Entertainment District,Toronto,ON,Canada,"[245 Queen Street West (at St Patrick St), Tor...",463001529.0
1,4b44fc77f964a520cc0026e3,Timothy's World Coffee,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1588307200,False,427 University Avenue,,43.654053,-79.38809,"[{'label': 'display', 'lat': 43.65405317976302...",340,,CA,,Toronto,ON,Canada,"[427 University Avenue, Toronto ON, Canada]",
2,4b0aaa8ef964a520272623e3,Timothy's World Coffee,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1588307200,False,"483 Bay St, Bell Trinity Square",Bell Trinity Square,43.653436,-79.382314,"[{'label': 'display', 'lat': 43.653436, 'lng':...",130,M5G 2C9,CA,,Toronto,ON,Canada,"[483 Bay St, Bell Trinity Square (Bell Trinity...",
3,4baa9f6cf964a520817a3ae3,Timothy's World Coffee,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1588307200,False,401 Bay St.,at Richmond St. W,43.652135,-79.381172,"[{'label': 'display', 'lat': 43.65213455850074...",268,M5H 2Y4,CA,,Toronto,ON,Canada,"[401 Bay St. (at Richmond St. W), Toronto ON M...",
4,53e8acc4498ee294fb100183,Timothy's World Coffee,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1588307200,False,425 University Ave,Dundas,43.65427,-79.387448,"[{'label': 'display', 'lat': 43.65427, 'lng': ...",296,M5G 1T6,CA,,Toronto,ON,Canada,"[425 University Ave (Dundas), Toronto ON M5G 1...",


In [14]:
cafes_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, name in zip(cafes['location.lat'], cafes['location.lng'], cafes['name']):
    label = folium.Popup(name, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(cafes_toronto)  
    
cafes_toronto

As our account is only limited to 50 venues at a time, this isn't hugely informative for all of Toronto. Let's instead zoom in further and visualize both the cafes _and_ the neighborhoods exclusive to "Downtown Toronto."

In [15]:
downtown = merged[merged['Borough'] =='Downtown Toronto']

map_dt_cafes = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, borough, neighborhood in zip(downtown['Latitude'], downtown['Longitude'], downtown['Borough'], downtown['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_dt_cafes)  
    
for lat, lng, name in zip(cafes['location.lat'], cafes['location.lng'], cafes['name']):
    label = folium.Popup(name, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dt_cafes)  

map_dt_cafes

Overall, this shows us that the more centrall-located neighborhoods have better access to coffee and those far from the downtown center might be slightly less cafeinated.