# Segmenting and Clustering Neighborhoods in Toronto

In this notebook we will explore, segment, and cluster the neighborhoods in the city of Toronto. However, the neighborhood data is not readily available on the internet. A Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will need to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

Let's start importing the required libraries. _requests_ will be used to retrieve the Wikipedia paga, and _BeautifulSoup_ to scrape the page.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Retrieve the page with the data, and verify the response is correct (code 200 = Ok).

In [2]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
page.status_code

200

Now we use _BeautifulSoup_ library to parse downloaded page. We verify the title of the page is correct.

In [3]:
soup = BeautifulSoup(page.content, 'html.parser')
print("Page Title: '{}'".format(soup.title.get_text()))
neigh_table = soup.table

Page Title: 'List of postal codes of Canada: M - Wikipedia'


Now extract data from the page, which is in the table of postal codes. We loop over all the rows and columns to obtain the data. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.

In [4]:
#columns = [col.get_text().strip('\n') for col in neigh_table.tbody.find_all('th')]
columns = ['PostalCode', 'Borough', 'Neighborhood']
rows = [[col.get_text().strip('\n') for col in row.find_all('td')] for row in neigh_table.tbody.find_all('tr')]
rows[0:5]

[[],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village']]

And then create a new dataframe (there is one empty row at the beginning, so we include only non-empty rows).

In [5]:
df = pd.DataFrame([row for row in rows if len(row) > 0], columns = columns)
print("Number of neighborhoods: ", df.shape[0])
df.head()

Number of neighborhoods:  289


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We need to do some processing to obtain the final dataframe. First, ignore cells with a borough that is **Not assigned**.

In [6]:
df = df[df['Borough'] != 'Not assigned'].reset_index(drop=True)

print("Number of neighborhoods: ", df.shape[0])
df.head()

Number of neighborhoods:  212


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough.

In [7]:
neighborhood_na_mask = df['Neighborhood'] == 'Not assigned'
df.loc[neighborhood_na_mask, 'Neighborhood'] = df[neighborhood_na_mask]['Borough']
df[neighborhood_na_mask]

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M7A,Queen's Park,Queen's Park


More than one neighborhood can exist in one postal code area. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [8]:
neighborhoods = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


And we display the list of boroughs and he number of neighborhoods in each borough.

In [9]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)
neighborhoods['Borough'].value_counts().sort_index()

The dataframe has 11 boroughs and 103 neighborhoods.


Central Toronto      9
Downtown Toronto    18
East Toronto         5
East York            5
Etobicoke           12
Mississauga          1
North York          24
Queen's Park         1
Scarborough         17
West Toronto         6
York                 5
Name: Borough, dtype: int64

In the last cell of the notebook, we use the .shape method to print the number of rows of the dataframe.

In [10]:
neighborhoods.shape

(103, 3)