### Part 1 - Web Scraping
Import statements

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

Get HTML of website and use BeautifulSoup to parse

In [2]:
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

Find table containing neighborhood data

In [3]:
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', { 'class' : 'wikitable sortable' })

Function to extract data into a list of lists

In [4]:
def tableDataText(table):       
    rows = []
    trs = table.find_all('tr')
    headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
    return rows

Run above function on our HTML data

In [5]:
list_table = tableDataText(table)

Convert list of lists to pandas data frame, clean, and ensure dataframe meets assignment specifications

In [6]:
df = pd.DataFrame(list_table[1:], columns=list_table[0])
df = df[df['Borough'] != 'Not assigned']
print("Columns: {}".format(list(df.columns)))
if len(df.loc[df['Borough']=='Not assigned']) == 0:
    print("No unassigned boroughs")
if len(df['Postal Code'].unique()) == len(df):
    print("One neighborhood per postal code")
if len(df.loc[df['Neighbourhood'] == 'Not assigned']) == 0:
    print("No boroughs with unassigned neighbourhoods")

Columns: ['Postal Code', 'Borough', 'Neighbourhood']
No unassigned boroughs
One neighborhood per postal code
No boroughs with unassigned neighbourhoods


In [7]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [8]:
df.shape

(103, 3)

### Part 2 - Adding Coordinates
Reading csv containing coordinates and postal codes into notebook, joining with above df

In [9]:
coordinates = pd.read_csv('Geospatial_Coordinates.csv')
df.join(coordinates.set_index('Postal Code'), on ='Postal Code')
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
