<h1 style='text-align: center'>Postal Codes of Canada : M</h1>

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Request the URL, extract the text from it, and transform it to a BeautifulSoup object

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
txt = requests.get(url).text
soup = BeautifulSoup(txt)

Find the &lt;table&gt; tag and from it, extract all &lt;table&gt;s

In [3]:
tds = soup.find('table').find_all('td')

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [4]:
tds = [td for td in tds if not 'Not assigned' in td.text]

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [5]:
postalcode = [td.find('b').text for td in tds]
boroughs = [td.find('span').text for td in tds]

In [6]:
df = pd.DataFrame(
    {'PostalCode': postalcode,
     'Borough': boroughs}
)

Boroughs and Neighborhoods are mixed. But we can extract them by splitting with '('

In [7]:
temp = df['Borough'].str.split('(', expand=True)[[0, 1]]
df[['Borough', 'Neighborhood']] = temp[[0, 1]]

In [8]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods)
1,M4A,North York,Victoria Village)
2,M5A,Downtown Toronto,Regent Park / Harbourfront)
3,M6A,North York,Lawrence Manor / Lawrence Heights)
4,M7A,Queen's Park / Ontario Provincial Government,


Neighborhoods are separated with a comma

In [9]:
df['Neighborhood'] = df['Neighborhood'].str.replace(' / ', ', ')

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [10]:
df['Neighborhood'] = df['Neighborhood'].fillna(df['Borough'])

In [11]:
condition = df['Neighborhood'].str.endswith(')')
df.loc[condition, 'Neighborhood'] = df.loc[condition, 'Neighborhood'].str[:-1]

In [12]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park / Ontario Provincial Government,Queen's Park / Ontario Provincial Government


In [13]:
df.set_index('PostalCode').loc['M5A']

Borough                  Downtown Toronto
Neighborhood    Regent Park, Harbourfront
Name: M5A, dtype: object

In [14]:
df.shape

(103, 3)