This notebook is to create a dataframes with the neighbourhood information of Canada extracted, with the beautifulsoup library, from the Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, and clustered by the postal codes.

In [1]:
pip install beautifulsoup4 lxml html5lib

Note: you may need to restart the kernel to use updated packages.


In [2]:
from bs4 import BeautifulSoup
import pandas as pd 

In [3]:
soup = BeautifulSoup(open("./ListPostalCodesCanada.html"))

Here we want to extract the table of the list of postal codes of Canada, so we use the find feature and suppose that the table containing the is the first one of the html page.

In [4]:
table_tag = soup.find("table")

The column labels of the dataframe would be the header of the table.

In [5]:
labels = []
first_tr_tag = table_tag.tr
for th_tag in first_tr_tag.find_all("th"):
    labels.append(th_tag.string.rstrip())

We will add each element of three columns of a row respectively into the lists of postcode, borough and neighbourhood, and create the dataframe from these lists.

In [6]:
postCodeList = []
BoroughList = []
NeighbourhoodList = []
for tr_tag in first_tr_tag.find_next_siblings("tr"):
    postCodeTag = tr_tag.td
    postCodeList.append(postCodeTag.string)
    BoroughTag = postCodeTag.find_next_sibling("td")
    BoroughList.append(BoroughTag.string)
    NeighbourhoodTag = BoroughTag.find_next_sibling("td")
    for s in NeighbourhoodTag.stripped_strings:
        NeighbourhoodTag = s
    NeighbourhoodList.append(NeighbourhoodTag)

In [7]:
df = pd.DataFrame(list(zip(postCodeList, BoroughList, NeighbourhoodList)), columns =labels) 

In [8]:
df.rename(columns={'Postcode':'PostalCode'}, inplace=True)

In [9]:
df = df[df.Borough != "Not assigned"]

In [10]:
for index in df.index[df.Neighbourhood == "Not assigned"].tolist():
    df.at[index, 'Neighbourhood'] = df.at[index, 'Borough']

We suppose that if two rows have the same postcode, they have also the same borough. Then only the elements in the column of the 'Neighbourhood' need to be aggragated.

In [11]:
df['Neighbourhood'] = df.groupby(['PostalCode'])['Neighbourhood'].transform(lambda x: ','.join(x))
df = df.drop_duplicates()
df.reset_index(drop=True, inplace=True)

In [12]:
df_location = pd.read_csv('./Geospatial_Coordinates.csv')

In [13]:
df_location.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

In [14]:
df.set_index('PostalCode', inplace=True)
df_location.set_index('PostalCode', inplace=True)
pd_concat = pd.concat([df, df_location], axis=1, join='inner')
pd_concat.reset_index(inplace=True)

In [15]:
pd_concat.head(11)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
