# Segmenting and Clustering Neighborhoods in Toronto

### In this notebook I will be scraping data from the wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M for the Neighborhoods of Toronto.
Lets start by imported the needed libraries.

In [3]:
import pandas as pd
import numpy as np

Fortunately we can just use pandas for the scraping. We don't need other libraries.

In [4]:
df=pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We need to fix the columns. Postcode should be Postalcode and Neighbourhood should be Neighborhood.

In [5]:
df.columns =["Postalcode","Borough","Neighborhood"] #correct the 1st column name
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We then need to drop all the rows where Borough is 'Not assigned'

In [6]:
df=df[df['Borough']!='Not assigned']
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


If there is a Neighborhood that is 'Not assigned' we have to change the Neighborhoods value to the Boroughs value.

In [8]:
df[df['Neighborhood']=='Not assigned'] #lets see which neighborhood is not assigned

Unnamed: 0,Postalcode,Borough,Neighborhood
6,M9A,Queen's Park,Not assigned


In [9]:
df['Neighborhood'] =  np.where(df['Neighborhood'] == 'Not assigned', df['Borough'], df['Neighborhood']) #if neighborhood is not assigned replace with borough

In [10]:
df[df['Postalcode']=='M9A'] #check if it worked

Unnamed: 0,Postalcode,Borough,Neighborhood
6,M9A,Queen's Park,Queen's Park


More than one neighborhood can exist in one postal code area. So we have a lot of rows that have the same Postalcode but different Neighboorhoods. We are going to combine these rows into 1 row with the neighborhoods separated with a comma.

In [11]:
df = df.groupby(['Postalcode','Borough'])['Neighborhood'].apply(','.join).reset_index() #group by postalcode
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [12]:
df.shape

(103, 3)