# Web Scraping: Toronto Postal Codes
by: Diardano Raihan (Indonesia)
<hr>

Let's start the project by scraping the following Wikipedia page.
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Objective:
- Obtain the data inside the html page containing a list of Toronto postal codes in the form of table and transform the data into a pandas dataframe!

Let's show some spirit by importing some basic libraries

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
%config IPCompleter.greedy=True
%config IPCompleter.use_jedi=False

## Import the Document

Our document is an hmtl page. Let's see what the page looks like:

In [27]:
with open('datasets/toronto_postal_codes.html', encoding='utf8') as file:
    soup = BeautifulSoup(file)
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"3b58f6b1-f524-437f-93e3-4c996b611ad7","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":979555370,"wgRevisionId":979555370,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Communications in Ontario","P

Now, we will make our dataframe consisting of three columns: __PostalCode__, __Borough__, and __Neighborhood__

The followings are how we can scrap the column names from the html page:

In [28]:
# Find all the table headers
soup.find_all('th')

[<th>Postal Code
 </th>,
 <th>Borough
 </th>,
 <th>Neighbourhood
 </th>,
 <th class="navbox-title" style="font-size:110%"><a href="/wiki/Postal_codes_in_Canada" title="Postal codes in Canada">Canadian postal codes</a>
 </th>]

Notice that the list has 4 elements. __We only need the first three elements (excluding the tag and white space)__ to be used as the column names of our dataframe.

In [29]:
# Initiate an empty list
columns = []

# Loop to the first 3 elements of the list
for i in range(3):
    column = list(soup.find_all('th')[i].stripped_strings)[0]
    column = column.replace(" ", "")
    columns.append(column)

print(columns)

# Create an empty dataframe
toronto_df = pd.DataFrame(columns=columns)
toronto_df.head()

['PostalCode', 'Borough', 'Neighbourhood']


Unnamed: 0,PostalCode,Borough,Neighbourhood


Congratulation, we just finished creating an empty dataframe.

Now, we will scrap the table data inside the page by finding all the table row tags __&lt;`tr`&gt;__

In [30]:
# Get the first table tag and find all <tr> tag in the form of list
tableData = soup.table.find_all('tr')

# Filter the first row of the table since it contains only the table header tags
tableData = tableData[1:]

print('The original postal code table contains {} rows'.format(len(tableData)))

The original postal code table contains 180 rows


Let's take a look at the first item in this list.

In [31]:
# Each element of the list is a tag object.
# For each element, we can extract and clean the content from any tag, white space, etc
list(tableData[0].stripped_strings)

['M1A', 'Not assigned', 'Not assigned']

Let's remind us again, some notes to follow are:

- Only process the cells that have an assigned borough. __Ignore cells with a borough that is `Not assigned`.__

- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

<th>

In [32]:
for i in range(len(tableData)):
    data = list(tableData[i].stripped_strings)
    
    if (data[1] == 'Not assigned'):
        continue
    
    postalCode = data[0]
    borough = data[1]
    
    if (data[2] != 'Not assigned'):
        neighborhood = data[2]
    else:
        neighborhood = borough
    
    toronto_df = toronto_df.append({'PostalCode': postalCode,
                                        'Borough': borough,
                                        'Neighbourhood': neighborhood}, ignore_index=True)

In [33]:
# Save the dataframe in a csv file without containing any index
toronto_df.to_csv('datasets/toronto_postal_codes.csv', index=False)

# Print the first 5 data of dataframe
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Finally we can use the `.shape` method to print the number of rows of our dataframe.

In [34]:
print('The shape of our dataframe is {} with the following details:\n- {} rows\n- {} columns\n- {} unique postal codes\n- {} unique boroughs'.format(toronto_df.shape, toronto_df.shape[0], toronto_df.shape[1],
                                   len(toronto_df.PostalCode.unique()), len(toronto_df.Borough.unique())))

The shape of our dataframe is (103, 3) with the following details:
- 103 rows
- 3 columns
- 103 unique postal codes
- 10 unique boroughs
