# Segmenting and Clustering Neighborhoods in Toronto

## 1. Processing the data

Let's retrieve Toronto's neighborhoods from Wikipedia and convert them into a pandas DataFrame. This can be easily done with pandas' read_html function, though you'll need to install several dependencies for it to work (see below). 

In [1]:
# If necessary, uncomment the following line and install the dependencies in your system
#!pip3 install numpy pandas lxml html5lib bs4 geopy sklearn folium

import pandas as pd

# This is the URL containing the table with Toronto's neighborhoods
# Looking at the page source we realize the table with the desired info
# has the class attribute 'wikitable', which will use for identification
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Note read_html returns a list. Just pick the first element
df = pd.read_html(url, attrs={'class': 'wikitable'})[0]

# Check the parsing has been done correctly
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Now let's prepare the data with the following steps:

- Ignore cells with a borough that is 'Not assigned'.
- If more than one neighborhood exist in one postal code area, combine them into one row with the neighborhoods separated with a comma.
- If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough.

In [3]:
# Drop rows with 'Not assigned' borough
not_assigned = df[df['Borough'] == 'Not assigned'].index
df.drop(not_assigned, inplace=True)

# Group rows with same postal code
df = df.groupby('Postcode', as_index=False).aggregate(lambda x: pd.Series(x.unique()).str.cat(sep=', '))

# If 'Neighborhood' is not assigned, set it to same value as 'Borough'
df.loc[df['Neighborhood'] == 'Not assigned','Neighborhood'] = df['Borough']

Display the shape of the final result and the first rows.

In [4]:
print(df.shape)
df.head()

(103, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## 2. Get Geocode data for neighborhoods

In order to make requests to Foursquare API and add the markers to the map, we need to know the latitude and longitude of each neighborhood. There are several approaches to do this. The instructions for the assignment recommend to use the [geocoder](https://geocoder.readthedocs.io/) package, however I couldn't get any result even after 100 attempts so I discarded it. I then tried with [geopy](https://geopy.readthedocs.io/en/stable/) which did offer some results but for many postal codes I still didn't get any data. At the end, I used the [Google Geocode API](https://github.com/googlemaps/google-maps-services-python), for which I could luckily benefit of a trial period.

Just for illustration purposes, this is the code for geocoder, which unfortunately didn't work for me (you don't need to run the cell).

In [5]:
import geocoder # import geocoder

def get_lat_lng1(postal_code, max_attempts=100):
    lat_lng_coords = None
    attempts = 0

    # loop until you get the coordinates
    while(lat_lng_coords is None and attempts < max_attempts):
        attempts += 1
        loc = '{}, Toronto, Ontario'.format(postal_code)
        print('#{}: Getting latitude and longitude for {}'.format(attempts, loc))
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng

    return lat_lng_coords

And this is the code for geopy, which only offered data for some postal codes (this cell doesn't have to be run either).

In [None]:
from geopy.geocoders import Nominatim
import time

geolocator = Nominatim(user_agent="toronto_explorer")

def get_lat_lng(postal_code):
    # The API fails when making too many requests so wait a bit in between
    time.sleep(2)
    loc = '{}, Toronto, Ontario'.format(postal_code)
    data = geolocator.geocode(loc)
    if data is None:
        print("No data found for {}".format(loc))
        return None
    else:
        print("{}: {}, {}".format(loc, data.latitude, data.longitude))
        return (data.latitude, data.longitude)

Finally, this is the code for Google Geocoding. It requires an API key which is stored in JSON file that is not committed to the repository. In order to make the code easily reproducible in other systems, I've already stored the results in a JSON file.

In [7]:
import json
import googlemaps

def read_json(file_name):
    with open(file_name) as json_file:
        return json.load(json_file)

def write_json(file_name, data):
    with open(file_name, 'w') as json_file:
        json.dump(data, json_file)

credentials = read_json('credentials.json')

gmaps = googlemaps.Client(key=credentials["GOOGLE_API_KEY"])

def get_lat_lng(postal_code):
    loc = '{}, Toronto, Ontario'.format(postal_code)
    results = gmaps.geocode(loc)
    location_data = results[0]["geometry"]["location"]
    return (location_data["lat"], location_data["lng"])

# Instead of calling the API, let's load values already fetched.
# ll = list(map(get_lat_lng, df['Postcode'].values))

ll = read_json('geocode_ll.json')


Now we can use the data to add the 'Latitude' and 'Longitude' columns to our DataFrame.

In [8]:
df['Latitude'] = [ll[postcode][0] for postcode in df['Postcode'].values]
df['Longitude'] = [ll[postcode][1] for postcode in df['Postcode'].values]
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## 3. Explore the data

Now we can use the location data to explore the venues in each neighborhood with the [Foursquare API](https://developer.foursquare.com/). We'll start by defining a function to build the appropiate URL and put the results into a pandas DataFrame. By default we are going to search only for venues serving food.

In [10]:
CLIENT_ID = credentials["FOURSQUARE_CLIENT_ID"]
CLIENT_SECRET = credentials["FOURSQUARE_CLIENT_SECRET"]
VERSION = '20180605'

import requests

def explore_nearby_venues(names, latitudes, longitudes, section="food", radius=500, limit=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&section={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            section, 
            radius, 
            limit)
            
        # make the GET request
        response = requests.get(url).json()["response"]["groups"][0]["items"]
        # write_json('tmp.json', response)

        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in response])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Before calling this function, let's focus only on a fractio of the boroughs (those containing the word 'Toronto') to avoid making too many calls to the API.

In [11]:
toronto_data = df.loc[df['Borough'].str.contains("Toronto")].reset_index(drop=True)

toronto_venues = explore_nearby_venues(
    names=toronto_data['Neighborhood'],
    latitudes=toronto_data['Latitude'],
    longitudes=toronto_data['Longitude'],
)

toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Domino's Pizza,43.679058,-79.297382,Pizza Place
1,The Beaches,43.676357,-79.293031,Fearless Meat,43.680337,-79.290289,Burger Joint
2,The Beaches,43.676357,-79.293031,Seaspray Restaurant,43.678888,-79.298167,Asian Restaurant
3,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
4,"The Danforth West, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant


We want to check now which categories are more popular in each neighborhood. But we cannot apply numerical methods like `.mean()` to a categorical variable. To overcome this we'll use "hot-encoding": create one column per unique category and set the rows matching that category to one, and all the others to zero. This can be easily done thanks to pandas' `get_dummies` function.

In [12]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Belgian Restaurant,...,Steakhouse,Sushi Restaurant,Taco Place,Taiwanese Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we can calculate the mean of each category by neighborhood to have a better understanding of their popularity (the higher the mean, the higher the proportion of restaurants of that category in the neighborhood).

In [13]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,American Restaurant,Arepa Restaurant,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Belgian Restaurant,...,Steakhouse,Sushi Restaurant,Taco Place,Taiwanese Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint
0,"Adelaide, King, Richmond",0.0,0.042105,0.0,0.0,0.063158,0.0,0.010526,0.031579,0.0,...,0.042105,0.031579,0.0,0.0,0.0,0.042105,0.0,0.031579,0.0,0.0
1,Berczy Park,0.0,0.019608,0.0,0.0,0.019608,0.019608,0.019608,0.078431,0.019608,...,0.058824,0.019608,0.0,0.0,0.019608,0.019608,0.0,0.039216,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.133333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0


However, there are too many columns now in the DataFrame and it's difficult to see the big picture. Let's extract the top 3 most popular venue categories per neighborhood and display them separately. Only take into account neighborhoods where the 3 most popular categories add to more than half of the proportion, to make sure they really stand out above the rest.

In [28]:
num_top_venues = 3

for hood in toronto_grouped['Neighborhood']:
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    temp = temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues)
    if temp['freq'].sum() > 0.5:
        print("----"+hood+"----")
        print(temp)
        print('\n')

----Brockton, Exhibition Place, Parkdale Village----
            venue  freq
0            Café  0.27
1          Bakery  0.13
2  Breakfast Spot  0.13


----Business Reply Mail Processing Centre 969 Eastern----
                  venue  freq
0  Fast Food Restaurant   0.4
1           Pizza Place   0.2
2            Restaurant   0.2


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
                      venue  freq
0       American Restaurant   0.5
1          Tapas Restaurant   0.5
2  Mediterranean Restaurant   0.0


----Christie----
                 venue  freq
0                 Café  0.38
1  American Restaurant  0.12
2  Japanese Restaurant  0.12


----Davisville North----
              venue  freq
0       Pizza Place  0.25
1  Asian Restaurant  0.25
2    Sandwich Place  0.25


----Dovercourt Village, Dufferin----
                   venue  freq
0            Pizza Place  0.25
1                 Bakery  0.25
2  Portuguese Restau

Seeing the most common venues of each neighborhood at a glance is quite convenient. Let's create a DataFrame with this information.

In [20]:
import numpy as np

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    suffix = 'th'
    if ind == 0: suffix = 'st'
    elif ind == 1: suffix = 'nd'
    elif ind == 2: suffix = 'rd'
    columns.append('{}{} Most Common Venue'.format(ind+1, suffix))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Café,Restaurant,Sandwich Place,Asian Restaurant,Salad Place,American Restaurant,Thai Restaurant,Burger Joint,Steakhouse,Deli / Bodega
1,Berczy Park,Italian Restaurant,Bakery,Sandwich Place,Steakhouse,Diner,Moroccan Restaurant,Café,French Restaurant,Bistro,Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Café,Breakfast Spot,Bakery,Vietnamese Restaurant,Pizza Place,Restaurant,Italian Restaurant,Japanese Restaurant,Burrito Place,Sandwich Place
3,Business Reply Mail Processing Centre 969 Eastern,Fast Food Restaurant,Pizza Place,Restaurant,Burrito Place,Empanada Restaurant,Deli / Bodega,Dim Sum Restaurant,Diner,Doner Restaurant,Donut Shop
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",American Restaurant,Tapas Restaurant,Wings Joint,Falafel Restaurant,Dim Sum Restaurant,Diner,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant


Now let's group the neighborhoods into clusters based on the categories of venues most frequent in each. For that we'll use the simple and effective k-means algorithm which is already implemented in the sklearn package.

In [21]:
from sklearn.cluster import KMeans

# We set manually the number of clusters
# This can be changed to refine the results
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_data.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()


Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,1.0,Pizza Place,Burger Joint,Asian Restaurant,Deli / Bodega,Dim Sum Restaurant,Diner,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,1.0,Greek Restaurant,Sushi Restaurant,Italian Restaurant,Pizza Place,Restaurant,Breakfast Spot,Fried Chicken Joint,Caribbean Restaurant,Café,Japanese Restaurant
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,1.0,Pizza Place,Sandwich Place,Burger Joint,Fish & Chips Shop,Sushi Restaurant,Steakhouse,Italian Restaurant,Fast Food Restaurant,Burrito Place,Deli / Bodega
3,M4M,East Toronto,Studio District,43.659526,-79.340923,1.0,Café,Bakery,Gastropub,American Restaurant,Comfort Food Restaurant,Pizza Place,Italian Restaurant,Sandwich Place,Middle Eastern Restaurant,Diner
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2.0,Dim Sum Restaurant,Wings Joint,Cuban Restaurant,Diner,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Empanada Restaurant,Ethiopian Restaurant


It's much easier to visualize the clusters using a map. Thanks to Folium we can finish the notebook with a nice display of Toronto neighborhoods clustered by restaurant category.

In [22]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

map_clusters = folium.Map(location=[43.6231216, -79.4108748], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    # ignore neighborhoods that miss venue info
    if not np.isnan(cluster):
        cluster = int(cluster)
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)
       
map_clusters