# Segmentation and Clustering of Toronto Neighborhoods
This notebook will do some segmentation of Toronto neighborhoods. Eric Canton wrote this as part of completing the IBM Data Science Capstone. 
1. Initially, we will get geographic data from <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">this Wikipedia table</a> using Wikipedia's REST API. These data are loaded into a DataFrame.
2. We will then use the Foursquare API to gather some info about local coffee shops (excluding big names like Starbucks, Dunkin, and Caribou) and gyms by postal code. 
3. These data we will combine to cluster neighborhoods by similarity. 

In [3]:
# Essentials
import pandas as pd
import numpy as np

# JSON and REST API
import requests # Python library for HTTP/API requests.
import json
from pandas.io.json import json_normalize # Transform JSON into DataFrames

# Clustering and map generation
import folium
from sklearn.cluster import KMeans

## 1: Get the table JSON from Wikipedia and parse it into a DataFrame
Wikipedia has a <a href="https://www.mediawiki.org/wiki/API:Main_page">really helpful description of their API</a> which is pretty easy to understand.  

One does not need to authenticate to use their API for information requests. The main reference I used was the <a href="https://www.mediawiki.org/wiki/API:Parsing_wikitext">API: Parsing wikitext page</a>.

In [26]:
# Build the url and request parameters
URL = "https://en.wikipedia.org/w/api.php"

# Specifying prop=wikitext gives the simpler editing format used when editing Wikipedia source. 
PARAMS = {
    "action" : "parse",
    "page"   : "List_of_postal_codes_of_Canada:_M",
    "prop"   : "wikitext",
    "format" : "json"
}

M = requests.get(url=URL, params=PARAMS).json()
wiki_text = M['parse']['wikitext']['*'] # The wikitext source

### Find and parse the table.
By either looking at the JSON, or navigating to <a href="https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&action=edit&section=1">[edit source]</a> on the Wikipedia page, we can see that the table is delineated by <code>{| .... |}.</code> Our strategy for finding the table, then splitting it into rows and further parsing each row, is the following.
1. Use the <code>wiki_text.find</code> method to locate the table (as the text between {| and |}) and trim the wikitext down into a new string called <code>table</code>.
2. Looking more closely at <code>table</code>, we see that the entries in the table are split up by <code>\n|-\n|</code>. We can use the <code>split</code> method to cut <code>table</code> into its <code>rows</code>. 
3. Each row has a very reliable format (see the output to the cell below) making it easy to further drop any rows missing a borough, then <code>split</code> the rows into their fields.  
4. Once we have 

In [57]:
table_start = wiki_text.find("{|")+2 # skip the leading {|
table_end = wiki_text.find("|}") # string slicing will STOP at this index-1, so this is perfect.
table = wiki_text[table_start:table_end] # slice the string to get just those characters defining the table. 
rows = table.split("\n|-\n|") # split the table into its rows, discarding the newlines and |- to give just the intersting stuff.

# Get the column names, which we'll use to form our DataFrame later on.
cols = rows[0] 
cols = cols[cols.find("!"):] # There's actually one line before the columns, class="wikitable sortable", which we discard.

# Now just keep the actual records. 
rows = rows[1:] 

print(cols)
print(rows[0]) # take a look at the format of rows

! Postcode !! Borough !! Neighbourhood
 M1A || Not assigned || Not assigned


  
  
Per the instructions for this assignment, we want to throw away every entry that has "Not assigned" in borough, which is the first column after the postal code.

Since the <code>find</code> method returns the index where it first discovers a substring, or -1 if the substring isn't found, we just keep those rows for which <code>find</code> returns -1. No row contains a neighborhood name but not a borough name, so we avoid those rows containing <code>Not assigned || Not assigned</code>.

In [58]:
rows = [r for r in rows if r.find("Not assigned || Not assigned") == -1] 

Now we get the column names and process the rows into a new array. 

In [59]:
cols = cols[2:].split(" !! ") # Start slicing at index 2 since cols starts like "! Postal..."

records = []
for r in rows:
    rec = r.split(' || ')
    
    # Some of the records have [[...]] around place names, because Wikipedia has a page on that place and they want to give a link. 
    # This makes Wikipedia great, but is not what we want. Get rid of these as we go.
    for i in range(3):
        rec[i] = rec[i][2:-2] if (rec[i].find("[[") > -1) else rec[i]
    records.append(rec)

### Create the DataFrame
To close out this section, we're going to make the DataFrame from our parsed rows. My preferred method for this is to first create a dictionary, which is then easy to cast to a DataFrame. 

In [96]:
cols = {c : [] for c in cols} # Set up a dict with keys we got from the table columns.

for rec in records:
    cols['Postcode'].append(rec[0])
    cols['Borough'].append(rec[1])
    if rec[2] == "Not assigned": # For the assignment, we should re-use borough name for neighborhood if "Not assigned"
        cols['Neighbourhood'].append(rec[1]) 
    else:
        cols['Neighbourhood'].append(rec[2])
        
M = pd.DataFrame.from_dict(cols)

In [97]:
M.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park|Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


In [93]:
len(cols["Postcode"])

210

In [94]:
len(cols["Borough"])

210

In [95]:
len(cols["Neighbourhood"])

210