Install the BeautifulSoup package

In [2]:
!conda install --yes beautifulsoup4

Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    beautifulsoup4: 4.6.0-py35h442a8c9_1 --> 4.6.3-py35_0

beautifulsoup4 100% |################################| Time: 0:00:00  37.33 MB/s


In [3]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

Download the given Wikipedia page and load the table contents into a Python array

In [4]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" 
response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")

In [5]:
# class="wikitable sortable"

postal_codes = []

for tr in soup.table.find_all('tr')[1:]:
    tds = tr.find_all('td')
    try:
        # print ("Postcode: %s, Borough: %s, Neighbourhood: %s" % (tds[0].text, tds[1].text, tds[2].text))
        postal_codes.append([tds[0].text, tds[1].text, tds[2].text.rstrip()])
    except:
        next

Create dataframe from the parsed array of values

In [6]:
postal_codes_df = pd.DataFrame(data=postal_codes, columns=['PostalCode', 'Borough', 'Neighborhood'])
postal_codes_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Check the size of the newly created dataframe

In [7]:
postal_codes_df.shape

(288, 3)

Delete rows from the dataframe where Borough has the 'Not assigned' value and assign the Borough name to the Neighborhood
in case the Neighborhood is 'Not assigned'

In [8]:
postal_codes_df.drop(postal_codes_df[postal_codes_df['Borough']=='Not assigned'].index, axis=0, inplace=True)
postal_codes_df.loc[postal_codes_df['Neighborhood']=='Not assigned', 'Neighborhood'] = postal_codes_df[postal_codes_df['Neighborhood']=='Not assigned']['Borough']

Do some sanity check to see whether we have missed any data cleaning

In [9]:
postal_codes_df[postal_codes_df['Neighborhood']=='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood


In [10]:
postal_codes_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Check the number of PostalCodes by doing a group by ...

In [11]:
grouped = postal_codes_df.groupby('PostalCode').count()
grouped

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,2,2
M1C,3,3
M1E,3,3
M1G,1,1
M1H,1,1
M1J,1,1
M1K,3,3
M1L,3,3
M1M,3,3
M1N,2,2


Combine rows into one row with the neighborhoods separated with a comma for those rows where the PostalCode is the same

In [12]:
postal_codes_df = postal_codes_df.groupby('PostalCode', as_index=False).agg({'Borough' : 'first', 'Neighborhood' : ', '.join})
postal_codes_df = postal_codes_df[['PostalCode', 'Borough', 'Neighborhood']]

In [13]:
#postal_codes_df[postal_codes_df['PostalCode']=='M9V']
postal_codes_df.shape

(103, 3)