Srape the list of postal codes of Canada

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import lxml

In [77]:
source = requests.get( 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' ).text
soup = BeautifulSoup( source, 'lxml' )

In [78]:
# Find the first table on the Wikipedia page and iterate through tags for required information
table_info = soup.find( 'table' )
rows = table_info.find_all( 'td' )

postcode = []
borough = []
neighborhood = []

for i in range( 0, len( rows ), 3 ):
    postcode.append( rows[i].text.strip() )
    borough.append( rows[i + 1].text.strip() )
    neighborhood.append( rows[i + 2].text.strip() )

In [79]:
# Build the dataframe from the list of values and set column names as PostalCode, Borough, and Neighborhood
df = pd.DataFrame( data = [postcode, borough, neighborhood] ) \
     .rename( index = { 0: 'Postcode', 1: 'Borough', 2: 'Neighborhood'} ) \
     .transpose()
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


* Rule 1: Ignore cells with a borough that is 'Not assigned'
* Rule 2: If a cell has a borough but a Not assigned neighborhood, the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [80]:
df.drop( df[df['Borough'] == 'Not assigned'].index, inplace = True )
print( "Before:\n{}".format( df.loc[8] ) )
df.loc[df.Neighborhood == 'Not assigned', 'Neighborhood'] = df.Borough
print( "After:\n{}".format( df.loc[8] ) )

Before:
Postcode                 M7A
Borough         Queen's Park
Neighborhood    Not assigned
Name: 8, dtype: object
After:
Postcode                 M7A
Borough         Queen's Park
Neighborhood    Queen's Park
Name: 8, dtype: object


In [81]:
df.head( 12 )

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [82]:
df_group = df.groupby( ['Postcode', 'Borough'] )['Neighborhood'].apply( ', '.join ).reset_index()
df_group.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Print the number of rows of the dataframe

In [85]:
df_group.shape

(103, 3)