<H1> DataFrame Construction -<br> Segmenting and Clustering Neighborhoods in the city of Toronto, Canada

Import necessary libraries

In [1]:
from bs4 import BeautifulSoup
import requests
from lxml import html
import pandas as pd
import re

Retrieve the Wikipedia page regarding postal codes as an HTML and parse it using BeautifulSoup

In [2]:
wiki_html = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').content
soup = BeautifulSoup(wiki_html, 'lxml')

Retrieve the table from the HTML and replace retrieve all rows expcept the first into a list.

In [3]:
table = soup.find('table')
rows = table.find_all('tr')
# Since the first column is a header, let us exclude that
rows = rows[1:]

Create a pandas DataFrame with the columns labeled in the exercise and populate the values accordingly. <br>Also, I noticed that '\n' is appended to all values within 'Neigborhood' column. Hence, I removed it using split method from re library.

In [4]:
toronto_df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'], index=range(len(rows)))
for i, row in enumerate(rows):
    columns = row.find_all('td')
    for j, column in enumerate(columns):
        toronto_df.iloc[i,j] = column.get_text()
        
toronto_df['Neighborhood'] = toronto_df['Neighborhood'].apply(lambda x:re.split("\n",x)[0])

In [5]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Remove all rows where 'Borough' is 'Not assigned'.<br>
Also, replaced the Neighborhood name with Borough name where Neighborhood was 'Not assigned'. (Notice the 6th row)

In [6]:
toronto_df = toronto_df[toronto_df['Borough'] != 'Not assigned']
toronto_df.reset_index(inplace=True, drop=True)

for idx, row in toronto_df.iterrows():
    if 'Not assigned' in row['Neighborhood']:
        row['Neighborhood'] = row['Borough']

toronto_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Combine all the Neighborhood names which belong to the same PostalCode.<br> 
This was achieved by gruping all the values based on the columns 'PostalCode' and 'Borough' and concatenating the Neighborhood values.<br><br>
Also, I have pickled the dataframe to be used in the next notebook.

In [7]:
toronto_df2 = pd.DataFrame(columns=toronto_df.columns)
for idx, (i, j) in enumerate(toronto_df.groupby(['PostalCode','Borough'])):
    pcode = i[0]
    borou = i[1]
    neigh = j['Neighborhood'].str.cat(sep=', ')
    toronto_df2.loc[idx, 'PostalCode'] = pcode
    toronto_df2.loc[idx, 'Borough'] = borou
    toronto_df2.loc[idx, 'Neighborhood'] = neigh

toronto_df2.to_pickle('toronto_df.pkl')
toronto_df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [9]:
toronto_df2.shape

(103, 3)