<h1>To explore, segment, and cluster the neighborhoods in the city of Toronto.</h1>


import libaries

In [102]:
import pandas as pd

import requests

from bs4 import BeautifulSoup

Scrape the data from wikipedia page

In [103]:
# scrape the required data from wikipedia
req = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

soup = BeautifulSoup(req.content,'lxml')

table = soup.find_all('table')[0]

df = pd.read_html(str(table))

neighbours=pd.DataFrame(df[0])

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [104]:
# Let's inspect the data table
neighbours

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [105]:
# Ignore cells with a borough that is Not assigned
neighbours=neighbors[neighbors.Borough!='Not assigned']

In [106]:
# Reset index
neighbours.reset_index(drop=True)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [107]:
# More than one neighborhood can exist in one postal code area. 
# These rows will be combined into one row with the neighborhoods
neighbours['Neighbourhood']=neighbours['Neighbourhood'].replace('Not assigned', neighbours['Borough'])

In [108]:
# Check how many unique neighbourhoos exist
neighbours['Borough'].unique()

array(['North York', 'Downtown Toronto', 'Etobicoke', 'Scarborough',
       'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [109]:
# Not assigned neighbourhood will be assigned the same as the borough
neighbours.groupby(['Borough'])['Neighbourhood'].apply(', '.join).reset_index()


Unnamed: 0,Borough,Neighbourhood
0,Central Toronto,"Lawrence Park, Roselawn, Davisville North, For..."
1,Downtown Toronto,"Regent Park, Harbourfront, Queen's Park, Ontar..."
2,East Toronto,"The Beaches, The Danforth West, Riverdale, Ind..."
3,East York,"Parkview Hill, Woodbine Gardens, Woodbine Heig..."
4,Etobicoke,"Islington Avenue, Humber Valley Village, West ..."
5,Mississauga,Canada Post Gateway Processing Centre
6,North York,"Parkwoods, Victoria Village, Lawrence Manor, L..."
7,Scarborough,"Malvern, Rouge, Rouge Hill, Port Union, Highla..."
8,West Toronto,"Dufferin, Dovercourt Village, Little Portugal,..."
9,York,"Humewood-Cedarvale, Caledonia-Fairbanks, Del R..."


In [110]:
# Check the size
neighbours.shape

(103, 3)