# Segmenting and Clustering Neighborhoods in Toronto

This notebook will be used to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe. We will be using the BeautifulSoup library

In [61]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))

[{"Postcode":"M1A","Borough":"Not assigned","Neighbourhood":"Not assigned"},{"Postcode":"M2A","Borough":"Not assigned","Neighbourhood":"Not assigned"},{"Postcode":"M3A","Borough":"North York","Neighbourhood":"Parkwoods"},{"Postcode":"M4A","Borough":"North York","Neighbourhood":"Victoria Village"},{"Postcode":"M5A","Borough":"Downtown Toronto","Neighbourhood":"Harbourfront"},{"Postcode":"M6A","Borough":"North York","Neighbourhood":"Lawrence Heights"},{"Postcode":"M6A","Borough":"North York","Neighbourhood":"Lawrence Manor"},{"Postcode":"M7A","Borough":"Downtown Toronto","Neighbourhood":"Queen's Park"},{"Postcode":"M8A","Borough":"Not assigned","Neighbourhood":"Not assigned"},{"Postcode":"M9A","Borough":"Etobicoke","Neighbourhood":"Islington Avenue"},{"Postcode":"M1B","Borough":"Scarborough","Neighbourhood":"Rouge"},{"Postcode":"M1B","Borough":"Scarborough","Neighbourhood":"Malvern"},{"Postcode":"M2B","Borough":"Not assigned","Neighbourhood":"Not assigned"},{"Postcode":"M3B","Borough":"N

In [62]:
df = df[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [63]:
#Remove rows with a borough that is Not assigned
indexNames = df[df['Borough']=="Not assigned"].index
df.drop(indexNames, axis=0, inplace=True)

# reset index after droping rows
df.reset_index(drop=True, inplace=True)

In [64]:
#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
df.Neighbourhood.replace("Not assigned", df.Borough, inplace=True)

In [86]:
#Combine Neighbourhood within same Postcode
df=df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()

In [90]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [91]:
df.shape

(103, 3)