# Segmentation and Clustering of Toronto Neighborhoods
This notebook will do some segmentation of Toronto neighborhoods. Eric Canton wrote this as part of completing the IBM Data Science Capstone. 
1. Initially, we will get geographic data from <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">this Wikipedia table</a> using Wikipedia's REST API. These data are loaded into a DataFrame.
2. We will then use the Foursquare API to gather some info about local coffee shops (excluding big names like Starbucks, Dunkin, and Caribou) and gyms by postal code. 
3. These data we will combine to cluster neighborhoods by similarity. 

In [1]:
# Essentials
import pandas as pd
import numpy as np

# JSON and REST API
import requests # Python library for HTTP/API requests.
from bs4 import BeautifulSoup # This package is awesome!
import json
from pandas.io.json import json_normalize # Transform JSON into DataFrames

# Clustering and map generation
import folium
from sklearn.cluster import KMeans

## 1: Get the table JSON from Wikipedia and parse it into a DataFrame
Wikipedia has a <a href="https://www.mediawiki.org/wiki/API:Main_page">really helpful description of their API</a> which is pretty easy to understand.  

One does not need to authenticate to use their API for page requests. The main reference I used was the <a href="https://www.mediawiki.org/wiki/API:Parsing_wikitext">API: Parsing wikitext page</a>.  

__Note!__ Some things about the "List of postal codes..." table have changed since the Coursera assignment was written. For example, postal code M5A no longer appears twice, and there are no entries that have a borough but no neighbourhood. This didn't really simplify anything, and I'm confident my code would produce the desired output with the old table version. I'm concerned that I'll be evaluated based on these non-existent features being processed correctly in my Notebook, though.

In [8]:
# Build the url and request parameters
URL = "https://en.wikipedia.org/w/api.php"

# Specifying prop=wikitext gives the simpler editing format used when editing Wikipedia source. 
PARAMS = {
    "action" : "parse",
    "page"   : "List_of_postal_codes_of_Canada:_M",
    "prop"   : "text",
    "format" : "json"
}

M = requests.get(url=URL, params=PARAMS).json()
page = BeautifulSoup(M['parse']['text']['*'], "html.parser")

In [18]:
table = page.find_all('td')

In [82]:
pc = [] # postal code
br = [] # borough
nb = [] # neighborhood

# Loop through the table we scraped, taking 3 elements at a time.
# (The table has 3 columns.)
for j in range(int((len(table)-2)/3)):
    # There's another table at the bottom of the Wikipedia page not relevant for us. 
    # The extra stuff doesn't have an "M" as the first character, so we just throw those away.
    if table[3*j].text.find("M") != 0:
        continue # Skip this one
    
    # Skip any rows whose Borough is 'Not assigned'.
    if table[3*j+1].text == 'Not assigned':
        continue
        
    
    pc.append(table[3*j].text)
    br.append(table[3*j+1].text)
    
    # There are newlines at the end of the names. Trim these off before appending.
    nbhd = table[3*j+2].text
    nb.append(nbhd[:nbhd.find("\n")]) # Slice the neighborhood string off when "\n" is found!

Now we're going to start building our DataFrame. The first thing we need to do is collate the neighborhoods by postcode, which we accomplish by first making a dictionary whose keys come from the <code>pc</code> list. We'll then loop over this dictionary to make a _new_ dictionary that we can cast into the desired DataFrame. More comments in the code detail when these are happening. 

In [133]:
# A given element of this dictionary will be ['Neighbourhood(s)', 'Borough'] pairs. 
# We're going to use this to group neighborhoods by their postal code/borough. 
nbhds_brs_by_pc = dict.fromkeys(set(pc), 0) 

# Something I discovered while debugging this step: if we initialize the dictionary 
# this way and have [] as the "default" value (instead of 0), every key in the dict
# actually points to **the same** array. Thus, if you add a neighborhood to nbhds_brs_by_pc['M5A'] 
# it adds it to every postcode simultaneously. We end up with one huge string with 
# every neighborhood comma-separated.

In [134]:
# zip creates an iterator spitting out things of the form ('M3A', 'Parkwoods', 'North York')
# This makes for some ugly multi-indices, so for ease of reference/grading:
#    post_nbhd_br[0] is the postal code
#    post_nbhd_br[1] is the neighborhood name
#    post_nbhd_br[2] is the borough name
for post_nbhd_br in zip(pc, nb, br): 
    postal_code = post_nbhd_br[0]
    nbhd_name = post_nbhd_br[1]
    br_name = post_nbhd_br[2]
    if nbhds_brs_by_pc[postal_code] == 0: # This postcode still has 0 for its entry, i.e. first time we encountered it.
        nbhds_brs_by_pc[postal_code] = [nbhd_name, br_name] # Turn it into an array with the format we want.
    else: # Not our first time at this postal code
        nbhds_brs_by_pc[postal_code][0] += ", " + nbhd_name # Makes a comma-separated string.

In [126]:
# zip creates an iterator spitting out things of the form ('M3A', 'Parkwoods', 'North York')
# This makes for some ugly multi-indices, so for ease of reference/grading:
#    post_nbhd_br[0] is the postal code
#    post_nbhd_br[1] is the neighborhood name
#    post_nbhd_br[2] is the borough name
for post_nbhd_br in zip(pc, nb, br): 
    postal_code = post_nbhd_br[0]
    nbhd_name = post_nbhd_br[1]
    br_name = post_nbhd_br[2]
    if len(nbhds_brs_by_pc[postal_code]) == 0: # This postcode still has "" for its entry, i.e. first time we encountered it.
        nbhds_brs_by_pc[postal_code] = nbhd_name
        #nbhds_brs_by_pc[postal_code][1] = br_name # Only need to add this once. It shouldn't appear in the else below!
    else:
        nbhds_brs_by_pc[postal_code] += ", " + nbhd_name # Makes a comma-separated string.

In [135]:
# Check that the output is what we expect.
# Indeed, postcode M1B corresponds to the Rouge and Malvern nbhds of Scarborough!
nbhds_brs_by_pc['M1B']

['Rouge, Malvern', 'Scarborough']

### Create the DataFrame
To close out this section, we're going to make the DataFrame from our parsed rows. My preferred method for this is to first create a dictionary, which is then easy to cast to a DataFrame. 

In [93]:
table_dict = {'PostalCode':[], 'Borough':[], 'Neighbourhood':[]}

In [138]:
for k in nbhds_brs_by_pc.keys():
    table_dict['PostalCode'].append(k)
    table_dict['Neighbourhood'].append(nbhds_brs_by_pc[k][0])
    table_dict['Borough'].append(nbhds_brs_by_pc[k][1])

In [140]:
table_df = pd.DataFrame.from_dict(table_dict)

In [141]:
table_df.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3H,North York,"Bathurst Manor, Downsview North, Wilson Heights"
1,M8Z,Etobicoke,"Kingsway Park South West, Mimico NW, The Queen..."
2,M9M,North York,"Emery, Humberlea"
3,M6K,West Toronto,"Brockton, Exhibition Place, Parkdale Village"
4,M6B,North York,Glencairn
5,M4C,East York,Woodbine Heights
6,M4M,East Toronto,Studio District
7,M9B,Etobicoke,"Cloverdale, Islington, Martin Grove, Princess ..."
8,M1J,Scarborough,Scarborough Village
9,M3K,North York,"CFB Toronto, Downsview East"


In [142]:
table_df.shape

(103, 3)

## 2: Getting GPS coordinates for the postal codes
The assignment tells us to use either the geocoder library, which did not work well then and doesn't seem to work at all now. I used the CSV linked to, though I'll comment it would have been pretty easy to use the Nominatim part of geopy for this!

In [144]:
coords = pd.read_csv("Geospatial_Coordinates.csv")

In [145]:
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [147]:
coords.columns = ["PostalCode", "Latitude", "Longitude"]

Now we join the two DataFrames by first setting the index to "PostalCode". This nicely adds the columns with GPS coordinates to the left of our existing columns!

In [152]:
table_df = table_df.set_index("PostalCode").join(other=coords.set_index("PostalCode"))

In [154]:
# Sanity check at this point...
table_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3H,North York,"Bathurst Manor, Downsview North, Wilson Heights",43.754328,-79.442259
M8Z,Etobicoke,"Kingsway Park South West, Mimico NW, The Queen...",43.628841,-79.520999
M9M,North York,"Emery, Humberlea",43.724766,-79.532242
M6K,West Toronto,"Brockton, Exhibition Place, Parkdale Village",43.636847,-79.428191
M6B,North York,Glencairn,43.709577,-79.445073


## 3: Some analysis of the data
There are probably lots of interesting things to be done with this data, but honestly I'm too excited for weeks 4 and 5, for which I already have an idea for a data science project relevant to my life, so I'm going to repeat the analysis we did for Manhattan.