##Import Libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

##Scrap the Wikipedia Page

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url=url)
print('Scrapping... Done!') if response.status_code == 200 else print('Something wrong!')

Scrapping... Done!


##Extract Table

* We will, first, locate the neighborhood table in the `html` soup, then we will extract cell values.

In [3]:
# find the html table to scrap
soup = BeautifulSoup(response.content, 'html.parser')
html_table = soup.findAll('table', class_="wikitable sortable")[0]
data = [str(d).replace('<td>', '').replace('</td>', '').strip()
           for d in html_table.findAll('td')]

* The values have a pattern the first one is the `postal code`, the second one is the `borough` and the third one is the `neighborhood(s)` in a 3-item pairs.
* We constructed each array by taking the corresponding values in 3-item pairs.

In [4]:
# get table cells as data points
postal_codes = data[::3]
boroughs = data[1::3]
neighborhoods = data[2::3]

In [5]:
# generate a dataframe
d = {c: d for c, d in zip(['Postal Code', 'Borough', 'Neighborhood'],
                                [postal_codes, boroughs, neighborhoods])}
df = pd.DataFrame(d)
df.sort_values('Postal Code', inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
9,M1B,Scarborough,"Malvern, Rouge"
18,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
27,M1E,Scarborough,"Guildwood, Morningside, West Hill"
36,M1G,Scarborough,Woburn


##Preprocessing

* Dropped all `borough`s that have value `Not assigned`.

In [6]:
df = df.loc[df.Borough != 'Not assigned']
df.reset_index(drop=True, inplace=True)

In [7]:
with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
    display(df)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


* There is a typo with one of the neighborhoods `Forest Hill North &amp; West, Forest Hill Road Park`
* Let's correct it

In [9]:
df.Neighborhood.replace('Forest Hill North &amp; West, Forest Hill Road Park',
                         'Forest Hill North & West, Forest Hill Road Park',
                         inplace=True)

* Checked if any `neighborhood` has value `Not assigned`.

In [10]:
# check if there is any missing neighborhood
df.loc[df.Neighborhood == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighborhood


In [11]:
print("""
    The data frame has {} rows and {} columns.
""".format(df.shape[0], df.shape[1]))


    The data frame has 103 rows and 3 columns.



* Save dataframe into a `csv` file

In [12]:
df.to_csv('toronto_neighborhood_data.csv', index=False)