## Grabbing the Toronto data from Wikipedia

Importing all the requirements

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

Reading the data inside the `URL`

In [2]:
wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki_url, 'lxml')

In [3]:
table = soup.find('table', attrs={'class' : 'wikitable sortable'})

Parsing the table content and making a list

In [4]:
data = []
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

Adding the headers

In [5]:
data[0] = ['Postcode', 'Borough', 'Neighbourhood']

Ignoring the cells whose `Borough` is `Not assigned`

In [6]:
for row in data[1:]:
    if row[1] == 'Not assigned':
        data.remove(row)

Assinging the same `Borough` value whose `Neighbourhood` is `Not assigned`

In [7]:
for row in data[1:]:
    if row[2] == 'Not assigned':
        row[2] = row[1]

Merge the `Neighbourhood` with the same `Borough`

Converting the list into DataFrame()

In [8]:
df = pd.DataFrame(data[1:],columns=data[0])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Merge the `Neighbourhood` with the same `Borough`

In [9]:
df.set_index(['Postcode','Borough'],inplace=True)
df = df.groupby(level=['Postcode','Borough'], sort=False).agg( ','.join)

In [10]:
df.reset_index()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout..."


Now we can see that the `Neighbourhood` values are merged with the same `Borough` values

In [11]:
df = df.reset_index()
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [12]:
df.to_csv('toronto_data.csv')

Size of the data frame

In [13]:
df.shape

(103, 3)