In this notebook I will be scraping information from Wikipedia about the division of the city of Toronto into neighborhoods by postal code, and formatting the information into a pandas dataframe.

The Wikipedia page containing this information can be found here: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [1]:
from bs4 import BeautifulSoup
import requests

First, get the Wikipedia page using <em>requests.get</em> and prettify for readability.

In [122]:
wikipedia_link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikipage=requests.get(wikipedia_link).text
soup = BeautifulSoup(wikipage,'lxml')
print(soup.prettify())

The relevant portion of the page is the table, type "wikitable sortable" so next separate this out as code_table:

In [8]:
code_table = soup.find('table',{'class':'wikitable sortable'})

For each row in the table, obtain the text in that row. Add it to a list of all rows.

In [56]:
rows = code_table.find_all('tr')
newtable = []
for row in rows:
    td = row.find_all('td')
    table_entry = [row.text for row in td]
    table_entry = [item.rstrip() for item in table_entry] #remove the "\n" from the Neighborhood in each row
    if table_entry != []:
        newtable.append(table_entry)
print(newtable)

[['M1A', 'Not assigned', 'Not assigned'], ['M2A', 'Not assigned', 'Not assigned'], ['M3A', 'North York', 'Parkwoods'], ['M4A', 'North York', 'Victoria Village'], ['M5A', 'Downtown Toronto', 'Harbourfront'], ['M5A', 'Downtown Toronto', 'Regent Park'], ['M6A', 'North York', 'Lawrence Heights'], ['M6A', 'North York', 'Lawrence Manor'], ['M7A', "Queen's Park", 'Not assigned'], ['M8A', 'Not assigned', 'Not assigned'], ['M9A', 'Etobicoke', 'Islington Avenue'], ['M1B', 'Scarborough', 'Rouge'], ['M1B', 'Scarborough', 'Malvern'], ['M2B', 'Not assigned', 'Not assigned'], ['M3B', 'North York', 'Don Mills North'], ['M4B', 'East York', 'Woodbine Gardens'], ['M4B', 'East York', 'Parkview Hill'], ['M5B', 'Downtown Toronto', 'Ryerson'], ['M5B', 'Downtown Toronto', 'Garden District'], ['M6B', 'North York', 'Glencairn'], ['M7B', 'Not assigned', 'Not assigned'], ['M8B', 'Not assigned', 'Not assigned'], ['M9B', 'Etobicoke', 'Cloverdale'], ['M9B', 'Etobicoke', 'Islington'], ['M9B', 'Etobicoke', 'Martin Gro

Next, make this into a dataframe with the correct column names.

In [57]:
import pandas as pd
postcodes = pd.DataFrame(newtable)
postcodes.columns=["Postal Code", "Borough", "Neighborhood"]
postcodes.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Remove all rows for which Borough is "Not assigned".

In [58]:
postcodes = postcodes[postcodes.Borough != "Not assigned"]
postcodes.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


For any remaining row where "Neighborhood" is not assigned, give "Neighborhood" the value of "Borough".

In [124]:
postcodes.loc[postcodes['Neighborhood'] == "Not assigned", 'Neighborhood'] = postcodes['Borough']
postcodes.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Next, group together rows which share a postal code - applying "set" to get only unique values.

In [123]:
postcodes_grouped = postcodes.groupby('Postal Code')['Borough', 'Neighborhood'].agg(lambda col: ', '.join(set(col)))
postcodes_grouped.reset_index(inplace=True)
postcodes_grouped.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Morningside, Guildwood, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [106]:
postcodes_grouped.shape

(103, 3)

In [125]:
gscoords = pd.read_csv("Geospatial_Coordinates.csv")
gscoords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [121]:
toronto = postcodes_grouped.merge(gscoords, how = 'left', left_on = ["Postal Code"], right_on = ["Postal Code"])
toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Morningside, Guildwood, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
