# Segmentation and Clustering of Toronto Neighborhoods
This notebook will do some segmentation of Toronto neighborhoods. Eric Canton wrote this as part of completing the IBM Data Science Capstone. 
1. Initially, we will get geographic data from <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">this Wikipedia table</a> using Wikipedia's REST API. These data are loaded into a DataFrame.
2. We will then use the Foursquare API to gather some info about local coffee shops (excluding big names like Starbucks, Dunkin, and Tim Hortons) by postal code. 
3. These data we will combine to cluster neighborhoods by similarity based on density of local coffee shops versus all coffee shops.

In [1]:
# Essentials
import pandas as pd
import numpy as np

# JSON and REST API
import requests # Python library for HTTP/API requests.
from bs4 import BeautifulSoup # This package is awesome!
import json

# Clustering and map generation
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

## 1: Get the table JSON from Wikipedia and parse it into a DataFrame
Wikipedia has a <a href="https://www.mediawiki.org/wiki/API:Main_page">really helpful description of their API</a> which is pretty easy to understand.  

One does not need to authenticate to use their API for page requests. The main reference I used was the <a href="https://www.mediawiki.org/wiki/API:Parsing_wikitext">API: Parsing wikitext page</a>.  

__Note!__ Some things about the "List of postal codes..." table have changed since the Coursera assignment was written. For example, postal code M5A no longer appears twice, and there are no entries that have a borough but no neighbourhood. This didn't really simplify anything, and I'm confident my code would produce the desired output with the old table version. I'm concerned that I'll be evaluated based on these non-existent features being processed correctly in my Notebook, though.

In [2]:
# Build the url and request parameters
URL = "https://en.wikipedia.org/w/api.php"

# Specifying prop=wikitext gives the simpler editing format used when editing Wikipedia source. 
PARAMS = {
    "action" : "parse",
    "page"   : "List_of_postal_codes_of_Canada:_M",
    "prop"   : "text",
    "format" : "json"
}

M = requests.get(url=URL, params=PARAMS).json()
page = BeautifulSoup(M['parse']['text']['*'], "html.parser")

In [3]:
table = page.find_all('td')

In [4]:
pc = [] # postal code
br = [] # borough
nb = [] # neighborhood

# Loop through the table we scraped, taking 3 elements at a time.
# (The table has 3 columns.)
for j in range(int((len(table)-2)/3)):
    # There's another table at the bottom of the Wikipedia page not relevant for us. 
    # The extra stuff doesn't have an "M" as the first character, so we just throw those away.
    if table[3*j].text.find("M") != 0:
        continue # Skip this one
    
    # Skip any rows whose Borough is 'Not assigned'.
    if table[3*j+1].text == 'Not assigned':
        continue
        
    
    pc.append(table[3*j].text)
    br.append(table[3*j+1].text)
    
    # There are newlines at the end of the names. Trim these off before appending.
    nbhd = table[3*j+2].text
    nb.append(nbhd[:nbhd.find("\n")]) # Slice the neighborhood string off when "\n" is found!

Now we're going to start building our DataFrame. The first thing we need to do is collate the neighborhoods by postcode, which we accomplish by first making a dictionary whose keys come from the <code>pc</code> list. We'll then loop over this dictionary to make a _new_ dictionary that we can cast into the desired DataFrame. More comments in the code detail when these are happening. 

In [5]:
# A given element of this dictionary will be ['Neighbourhood(s)', 'Borough'] pairs. 
# We're going to use this to group neighborhoods by their postal code/borough. 
nbhds_brs_by_pc = dict.fromkeys(set(pc), 0) 

# Something I discovered while debugging this step: if we initialize the dictionary 
# this way and have [] as the "default" value (instead of 0), every key in the dict
# actually points to **the same** array. Thus, if you add a neighborhood to nbhds_brs_by_pc['M5A'] 
# it adds it to every postcode simultaneously. We end up with one huge string with 
# every neighborhood comma-separated.

In [6]:
# zip creates an iterator spitting out things of the form ('M3A', 'Parkwoods', 'North York')
# This makes for some ugly multi-indices, so for ease of reference/grading:
#    post_nbhd_br[0] is the postal code
#    post_nbhd_br[1] is the neighborhood name
#    post_nbhd_br[2] is the borough name
for post_nbhd_br in zip(pc, nb, br): 
    postal_code = post_nbhd_br[0]
    nbhd_name = post_nbhd_br[1]
    br_name = post_nbhd_br[2]
    if nbhds_brs_by_pc[postal_code] == 0: # This postcode still has 0 for its entry, i.e. first time we encountered it.
        nbhds_brs_by_pc[postal_code] = [nbhd_name, br_name] # Turn it into an array with the format we want.
    else: # Not our first time at this postal code
        nbhds_brs_by_pc[postal_code][0] += ", " + nbhd_name # Makes a comma-separated string.

In [7]:
# Check that the output is what we expect.
# We should see ['Rouge, Malvern', 'Scarborough'], meaning postcode M1B corresponds to the Rouge and Malvern nbhds of Scarborough.
nbhds_brs_by_pc['M1B']

['Rouge, Malvern', 'Scarborough']

### Create the DataFrame
To close out this section, we're going to make the DataFrame from our parsed rows. My preferred method for this is to first create a dictionary, which is then easy to cast to a DataFrame. 

In [8]:
table_dict = {'PostalCode':[], 'Borough':[], 'Neighbourhood':[]}

In [9]:
for k in nbhds_brs_by_pc.keys():
    table_dict['PostalCode'].append(k)
    table_dict['Neighbourhood'].append(nbhds_brs_by_pc[k][0])
    table_dict['Borough'].append(nbhds_brs_by_pc[k][1])

In [10]:
table_df = pd.DataFrame.from_dict(table_dict)

In [11]:
table_df.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3K,North York,"CFB Toronto, Downsview East"
1,M1J,Scarborough,Scarborough Village
2,M4N,Central Toronto,Lawrence Park
3,M2M,North York,"Newtonbrook, Willowdale"
4,M4S,Central Toronto,Davisville
5,M1G,Scarborough,Woburn
6,M5P,Central Toronto,"Forest Hill North, Forest Hill West"
7,M1P,Scarborough,"Dorset Park, Scarborough Town Centre, Wexford ..."
8,M5B,Downtown Toronto,"Ryerson, Garden District"
9,M5S,Downtown Toronto,"Harbord, University of Toronto"


In [12]:
table_df.shape

(103, 3)

## 2: Getting GPS coordinates for the postal codes
The assignment tells us to use either the geocoder library, which did not work well then and doesn't seem to work at all now. I used the CSV linked to, though I'll comment it would have been pretty easy to use the Nominatim part of geopy for this!

In [13]:
coords = pd.read_csv("Geospatial_Coordinates.csv")

In [14]:
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [15]:
coords.columns = ["PostalCode", "Latitude", "Longitude"]

Now we join the two DataFrames by first setting the index to "PostalCode". This nicely adds the columns with GPS coordinates to the left of our existing columns!

In [16]:
table_df = table_df.set_index("PostalCode").join(other=coords.set_index("PostalCode"))

In [17]:
# Sanity check at this point...
table_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3K,North York,"CFB Toronto, Downsview East",43.737473,-79.464763
M1J,Scarborough,Scarborough Village,43.744734,-79.239476
M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493
M4S,Central Toronto,Davisville,43.704324,-79.38879


## 3: Some analysis of the data
Let's find the "coffee havens" in Toronto. We'll restrict to postal codes that have "Toronto" somewhere in the borough name, then get the 50 top coffee places within 500 meters of the GPS coordinates. We will ignore the big coffee chains Starbucks, Dunkin', and Tim Hortons when searching for the postal codes having the most coffee places, though we do remember the total number of coffee places returned to find the places in Toronto where the big chains are more common.

In [18]:
CLIENT_ID = 'B15WRRWF5X1NODZNZP54X11OX53FUHUKC1AFFEZKGI2J5QKE' # your Foursquare ID
CLIENT_SECRET = 'BJMSMISWYHNWOG3WXSIROGL0DSEXIKGSKSEM3DEYXASFAHUN' # your Foursquare Secret

In [19]:
URL = "https://api.foursquare.com/v2/venues/explore?"
PAR = dict(
    limit = '50', # maximum currently allowed by Foursquare
    client_id = CLIENT_ID,
    client_secret = CLIENT_SECRET,
    v = '20190729',
    radius = '500',
    section='coffee',
    ll = '' # to be filled in during our loop over postal codes
)

In [20]:
venues_list = []
postcode_qty = []

for P in table_df.index: # remember! we set the index to PostalCode
    if (table_df.loc[P].Borough).find("Toronto") == -1:
        #print("{} has borough {}, which doesn't contain the word Toronto. Skipping!".format(P, table_df.loc[P].Borough))
        continue
    PAR['ll'] = '{},{}'.format(table_df.loc[P]['Latitude'], table_df.loc[P]['Longitude'])
    
    try:
        print("Getting coffee spots near {}...".format(P))
        results = requests.get(url=URL, params=PAR).json()['response']['groups'][0]['items']
        good_coffee = []
        for res in results:
            if res['venue']['name'] in ["Starbucks", "Dunkin'", "Tim Hortons"]: # List to exclude found by exploring on Foursquare website a bit.
                continue
            good_coffee.append((res['venue']['name'], res['venue']['location']['lat'], res['venue']['location']['lng']))
        try:
            percent = len(good_coffee)/len(results)
        except:
            percent = 0
        postcode_qty.append((P, len(good_coffee), len(results), percent))
        venues_list += good_coffee
        
        #print("I found {} smaller coffee spots near {}, {} total.".format(postcode_qty[P][0], P, postcode_qty[P][1]))

        
    except Exception as inst:
        print("Error when getting venues near {}...".format(P))
        print(type(inst), inst)

Getting coffee spots near M4N...
Getting coffee spots near M4S...
Getting coffee spots near M5P...
Getting coffee spots near M5B...
Getting coffee spots near M5S...
Getting coffee spots near M7A...
Getting coffee spots near M4X...
Getting coffee spots near M4M...
Getting coffee spots near M6R...
Getting coffee spots near M7Y...
Getting coffee spots near M5K...
Getting coffee spots near M5J...
Getting coffee spots near M6J...
Getting coffee spots near M5T...
Getting coffee spots near M6K...
Getting coffee spots near M5E...
Getting coffee spots near M4R...
Getting coffee spots near M5H...
Getting coffee spots near M5A...
Getting coffee spots near M5X...
Getting coffee spots near M6S...
Getting coffee spots near M6G...
Getting coffee spots near M5L...
Getting coffee spots near M4E...
Getting coffee spots near M4L...
Getting coffee spots near M5W...
Getting coffee spots near M5R...
Getting coffee spots near M5N...
Getting coffee spots near M4P...
Getting coffee spots near M5G...
Getting co

In [21]:
# Turn this into a dict of DataFrames which we'll do k-means clustering on, with k=3.
postcode_df = pd.DataFrame(postcode_qty, columns =['PostalCode', 'Small', 'Total', 'Percent']).set_index('PostalCode') 
postcode_df.head()

Unnamed: 0_level_0,Small,Total,Percent
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M4N,0,0,0.0
M4S,7,8,0.875
M5P,0,0,0.0
M5B,31,50,0.62
M5S,19,21,0.904762


## Clustering and analysis
Now, we would like to cluster the Postal codes by similarity of the coffee shop offerings, with _k_=3 clusters, which I chose because I'm expecting to correspond to something like 
1. "Zero/very few coffee shops", 
2. "More coffee shops, but not many local shops", and 
3. "More coffee shops, mostly/all local."  

We're going to work with un-normalized data, because we want to be able to distinguish between a place like M4P which has 100% local coffee, but only 3 shops, and M5S, which has 90.5% local coffee and 21 total places. That is, we want to tell the difference between ">90% local, but only 3 places" and ">90% local, but many places to choose from".

In [22]:
# We'll try with k=3.
kmeans = KMeans(n_clusters=3, random_state=4)

In [23]:
clusts = kmeans.fit(postcode_df)

In [24]:
clust_labels = clusts.labels_

In [25]:
# Let's add a column to compare our clusters
postcode_df['Cluster'] = clust_labels

In [26]:
postcode_df.head(20)

Unnamed: 0_level_0,Small,Total,Percent,Cluster
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M4N,0,0,0.0,0
M4S,7,8,0.875,0
M5P,0,0,0.0,0
M5B,31,50,0.62,1
M5S,19,21,0.904762,2
M7A,10,23,0.434783,2
M4X,6,8,0.75,0
M4M,12,13,0.923077,2
M6R,3,3,1.0,0
M7Y,1,1,1.0,0


In [44]:
venues_list[:5]

[('Jules Cafe Patisserie', 43.70413799694304, -79.38841260442167),
 ('Thobors Boulangerie Patisserie Café',
  43.704513877453266,
  -79.38861602551758),
 ('Second Cup', 43.704344001380505, -79.38865888961692),
 ('Meow Cat Cafe', 43.702927, -79.38819),
 ("Timothy's World Coffee", 43.706027101909015, -79.38936863344793)]

In [51]:
colours = cm.Spectral(np.linspace(0, 1, 3)) # Make our colormap with 3 colors.

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[43.6529, -79.3849], zoom_start=12)

# Generate Folium markers. First for loop is over unique postal codes. Second loop is over venues_list.
# We make a big marker 
for P in postcode_df.index:
    clu = postcode_df.loc[P]['Cluster']
    col = str(colors.rgb2hex(colours[int(clu)][:3])) # Just get the (r, g, b) part of the color, convert that into a #hex colorcode string for fill_color below.
    
    lat = table_df.loc[P].Latitude
    lng = table_df.loc[P].Longitude
    
    text = "{}\n{}\n local: {} total: {}".format(P, table_df.loc[P].Borough, postcode_df.loc[P]['Small'], postcode_df.loc[P]['Total'])
    label = folium.Popup(text, parse_html=False)
    folium.Circle( # I'm using Circle instead of CircleMarker because this way radius is in meters, not pixels. Thus, the 500-meter circles should show the shops!
        [lat, lng],
        radius=500,
        popup=label,
        color='green',
        fill=True,
        fill_color=col,
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 
    
for ven in set(venues_list):
    text = ven[0]
    lat = ven[1]
    lng = ven[2]
    label = folium.Popup(text, parse_html=False)
    folium.Circle(
        [lat, lng],
        radius=10,
        popup=label,
        color='black',
        fill=True
    ).add_to(map_toronto)
map_toronto

From this map, we can see that the clusters are very geographic, with most of the coffee concentrated in Downtown Toronto. It seems the clustering is done based on the total number of shops around, more than the percentage of shops that are local. This was more-or-less what we expected, and is a good example of why one must normalize features if we want to treat them on an equal footing. However, it doesn't seem like the categories really correspond to my guess that the 3 categories would be
1. "Zero/very few coffee shops", 
2. "More coffee shops, but not many local shops", and 
3. "More coffee shops, mostly/all local."  

Nevertheless, I think the outcome of this clustering is satisfactory for seeing that no matter the coffee you're craving, the real coffee haven in Toronto is downtown. 