<h1>Segmenting and Clustering Neighborhoods in 
Toronto</h1>

<h2>Part 1</h2>

In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

Make sure Wiki page is reachable

In [2]:
wikiPage = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
wikiPage

<Response [200]>

Scrape contents (look for table tag)

In [3]:
soup = BeautifulSoup(wikiPage.text)
# Get the table having the class wikitable
hoodTable = soup.find("table", attrs={"class": "wikitable sortable"})
hoodRows = hoodTable.tbody.find_all("tr")
# Get the headings for the table
headings = []
for th in hoodRows[0].find_all("th"):
    headings.append(th.text.replace('\n', ' ').replace(' ', '').strip())
# Get table rows
rows = []
for row in hoodRows[1:]:
    data = []
    for td in row.find_all("td"):
        data.append(td.text.replace('\n', ' ').strip())
    rows.append(data)
# Create dataframe
df = pd.DataFrame(data=rows, columns=headings)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [4]:
# Ignore cells with a borough that is Not assigned.
df = df[df["Borough"] != "Not assigned"]
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


More than one neighborhood can exist in one postal code area. 

These two rows will be combined into one row with the neighborhoods separated with a comma.

In [5]:
# Identify duplicates in ["PostalCode","Borough"]
dfDups = df[df.duplicated(subset=["PostalCode","Borough"],keep=False)]
rows = []

for code, boro, hood in dfDups.values:
    # Find all the neighbourhoods for a specific postal code and borough
    idx = (dfDups["PostalCode"]==code) & (dfDups["Borough"]==boro)
    hoods = dfDups.loc[idx, "Neighbourhood"]
    # If result is not empty, merge and add to dictionary
    if not(hoods.empty):
        rows.append({"PostalCode":code, "Borough":boro, "Neighbourhood":hoods.str.cat(sep=", ")})
    # Remove all processed rows
    dfDups = dfDups.loc[~idx]
# Drop from original dataframe and append duplicates merged above
df.drop_duplicates(subset=["PostalCode","Borough"],keep=False,inplace=True)
df = df.append(rows, ignore_index=True)

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [6]:
df.loc[df["Neighbourhood"] == "Not assigned", "Neighbourhood"] = df["Borough"]

Contemplate at the silhouette of our data

In [7]:
print('The dataframe has {} boroughs and at least {} neighborhoods.'.format(
        len(df["Borough"].unique()),
        df.shape[0]
    )
)

The dataframe has 10 boroughs and at least 103 neighborhoods.


<h2>Part 2</h2>

Given that the <strike>Geocoder package</strike> IS very unreliable, I import the data from the link to the CSV file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [8]:
dfLatLon = pd.read_csv("http://cocl.us/Geospatial_data")
dfLatLon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Join the existing dataframe with the lat/lon of the postal codes table

In [9]:
# Remove spaces, just as we did with the main dataframe
dfLatLon.columns = list(map(lambda x: x.replace(" ", ""), dfLatLon.columns))
# Join on postal code
df = df.merge(dfLatLon, how='left', on="PostalCode")
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


<h2>Part 3</h2>