#### Scraping Toronto neighbourhoods and postal codes from WikiPedia with BeautifulSoup4 and Pandas


Import pandas and beautifulsoup for scraping a html table into pd.DataFrame

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import requests
#import geocoder

Define the url and BeautifulSoup4-process the html with the 'lxml' transformer into a bs-object

In [2]:
url = r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
req = requests.get(url)
page = bs(req.content,'lxml')

#### Step by step walkthrough
- find the table in the bs4 object
- read the found table with Pandas straight into a DataFrame with correct column headers
- Drop all rows that have a 'Borough' with value 'Not assigned'
- Join all Neighbourhoods per Postcode 
- Reset the index so the result returns towards a Integer-indexed DataFrame
- Verification method for 'M5A'-case
- Replace all Neighbourhood == 'Not assigned' with the value of the Borough column of that row
- Verify this alteration of the DataFrame
- Show the header of the resulting DataFrame

In [3]:
table = page.find('table')
df = pd.read_html(str(table),header=0)[0]
#df = df.drop(['Not assigned'], axis=0)
df = df[df.Borough != 'Not assigned']
df = pd.DataFrame(df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join) )
df = df.reset_index()
print('verify that the \'M5A\'-case is correct : \n{}\n '.format(df[df.Postcode == 'M5A']))
df.Neighbourhood[df.Neighbourhood == 'Not assigned'] = df.Borough
print('verify the Neighbourhood \'Not assigned\' method:\n{}\n'.format(df[df.Borough == 'Queen\'s Park']))
df.head(12)

verify that the 'M5A'-case is correct : 
   Postcode           Borough              Neighbourhood
53      M5A  Downtown Toronto  Harbourfront, Regent Park
 
verify the Neighbourhood 'Not assigned' method:
   Postcode       Borough Neighbourhood
85      M7A  Queen's Park  Queen's Park



Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [4]:
print('Toronto summary : \n There are {} unique Postcodes \
and \n {} Boroughs in the final DataFrame'.format(df.Postcode.unique().size,df.Borough.unique().size))

Toronto summary : 
 There are 103 unique Postcodes and 
 11 Boroughs in the final DataFrame


In [5]:
df.shape

(103, 3)