In [21]:
import numpy as np
import pandas as pd
import urllib

# Scrap neighborhood information from Wikipedia

The first step is to request the URL with the information and reading all the content into a variable

In [22]:
wiki = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
text = wiki.read()

Decode into a string in UTF-8 format

In [23]:
f = text.decode("UTF-8")

As there is only one table in the webpage, it is very easy to extract by just looking for the the _<table_ and _</table>_ tags in the content, and selecting everything among them.

In [24]:
tab = f[f.find("<table"):f.find("</table>")+8]

Pandas is able to read a table from HTML so just call the reader

In [25]:
dat = pd.read_html(tab, header = 0)[0]

I have not used Beautiful Soup because it look like an overkill for this specific task

And now, select only those Boroughs that have an assigned name, and assign the Borough name to those neighbourhoods with no assigned name.

In [26]:
dat = dat[dat.Borough != "Not assigned"]
dat.Neighbourhood[dat.Neighbourhood == "Not assigned"] = dat.Borough[dat.Neighbourhood == "Not assigned"]
dat.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Now let us define a function (*concatenateN*) that takes a series of neighbourhood names and combines them properly with commas in between names, taking care that there is no comma after the last name. Let us also define a function that take the list of Borough associated with a postcode and verified that the postcode is completely contained within the Borough. If it is not, launch an expcetion. If it is, return the name of the Borough.

In [27]:
def concatenateN(x):
    cad = ""
    for i in range(len(x)-1):
        cad = cad + x.iloc[i] + ", "
    cad += x.iloc[-1]
    return cad

def selectB(x):
    ref = x.iloc[0]
    for i in range(1, len(x)):
        if ref != x.iloc[i]:
            for i in x:
                print(x)
            raise Exception("Postcode comprises two Boroughs")
    return ref

Let us now group all the data by **Postcode**, to compose the name of all neighbourhood.

In [28]:
pdd = dat.groupby(["Postcode"]).agg({"Borough": lambda x: selectB(x),
                                 "Neighbourhood": lambda x: concatenateN(x)})
pdd.head()

Unnamed: 0_level_0,Neighbourhood,Borough
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,"Rouge, Malvern",Scarborough
M1C,"Highland Creek, Rouge Hill, Port Union",Scarborough
M1E,"Guildwood, Morningside, West Hill",Scarborough
M1G,Woburn,Scarborough
M1H,Cedarbrae,Scarborough


And the final size of my dataframe is:

In [29]:
pdd.shape

(103, 2)

And it's done - 103 x 2 as expected

In [49]:
import base64
import pandas as pd
from IPython.display import HTML

def create_download_link( df, title = "Download CSV file (Toronto)", filename = "data.csv"):
    csv = df.to_csv()
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

a = create_download_link(pdd,filename='Toronto.csv')

a


Now we can download the file to use in other exercises