##Import Libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

##Scrap the Wikipedia Page

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url=url)
print('Scrapping... Done!') if response.status_code == 200 else print('Something wrong!')

Scrapping... Done!


##Extract Table

* We will, first, locate the neighborhood table in the `html` soup, then we will extract cell values.

In [3]:
# find the html table to scrap
soup = BeautifulSoup(response.content, 'html.parser')
html_table = soup.findAll('tbody')[0]
data = [str(d).replace('<td>', '').replace('</td>', '').strip()
           for d in html_table.findAll('td')]

* The values have a pattern the first one is the `postal code`, the second one is the `borough` and the third one is the `neighborhood(s)` in a 3-item pairs.
* We constructed each array by taking the corresponding values in 3-item pairs.

In [4]:
# get table cells as data points
postal_codes = data[::3]
boroughes = data[1::3]
neighborhoods = data[2::3]

In [5]:
# extract column names
columns = [str(x).replace('<th>', '').replace('</th>', '').strip()
               for x in html_table.findAll('th')][:3]
columns = [column.replace(' ', '_').lower() for column in columns]

In [6]:
# generate a dataframe
d = {c: d for c, d in zip(columns,
                                [postal_codes, boroughes, neighborhoods])}
df = pd.DataFrame(d)
df.head()

Unnamed: 0,postal_code,borough,neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


##Preprocessing

* Dropped all `borough`es that have value `Not assigned`.

In [7]:
df = df.loc[df.borough != 'Not assigned']
df.head()

Unnamed: 0,postal_code,borough,neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


* Checked if any `neighborhood` has value `Not assigned`.

In [8]:
# check if there is any missing neighborhood
df.loc[df.neighbourhood == 'Not assigned']

Unnamed: 0,postal_code,borough,neighbourhood


In [10]:
print("""
    The data frame has {} rows and {} columns.
""".format(df.shape[0], df.shape[1]))


    The data frame has 103 rows and 3 columns.

