<h1>Segmenting and Clustering Neighborhoods in Toronto: Part 1</h1>

Prepare the data frame

In [102]:
import pandas as pd # library for data analsysis
# define the dataframe columns
column_names = ['Postcode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

First parse the HTML content with BeautifulSoup

In [103]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

soup = BeautifulSoup(urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').read(), "html.parser")

Retrieve all tr tags that contain the information about Neighborhoods in Toronto and create the dataframe from them

In [104]:
for row in soup.select("table[class='wikitable sortable'] tr"):
    if len(row.select("td:nth-of-type(1)")) == 0:
        continue
    if len(row.select("td:nth-of-type(2)")) == 0:
        continue
    if len(row.select("td:nth-of-type(3)")) == 0:
        continue
    postcode = row.select("td:nth-of-type(1)")[0].string
    borough = row.select("td:nth-of-type(2)")[0].string
    if borough is None:
        borough = row.select("td:nth-of-type(2) a")[0].string
    neighborhood = row.select("td:nth-of-type(3)")[0].string
    if neighborhood is None:
        neighborhood = row.select("td:nth-of-type(3) a")[0].string
    
    neighborhoods = neighborhoods.append({'Postcode': postcode.strip(),
                                          'Borough': borough.strip(),
                                          'Neighborhood': neighborhood.strip()}, ignore_index=True)

neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Check the initial shape of the data frame

In [105]:
neighborhoods.shape

(288, 3)

Drop cells with a borough that is Not assigned

In [106]:
neighborhoods = neighborhoods[neighborhoods['Borough'] != 'Not assigned']
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


To be able to combine rows with the same Postcode, we first group by Postcode and Borough

In [121]:
neighborhoodsGroupByPostcode = neighborhoods.groupby(['Postcode', 'Borough'])
neighborhoodsGroupByPostcode.head()

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Merge the rows with the same Postcode by joining the Neighborhoods

In [126]:
neighborhoods = neighborhoodsGroupByPostcode['Neighborhood'].apply(lambda x: ', '.join(x)).to_frame().reset_index()
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Set the Neighborhood to Borough	when not assigned

In [138]:
neighborhoods.loc[neighborhoods['Neighborhood'] == 'Not assigned', 'Neighborhood'] = neighborhoods['Borough']
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


The final shape of the data frame is:

In [136]:
neighborhoods.shape

(103, 3)