# Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Converting wiki page to DataFrame

In [1]:
import pandas as pd
import numpy as np
import requests

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' #url to the wiki site
df_list = pd.read_html(url) #get the list of data_frames from website
df = df_list[0] #we just want the first dataframe

In [3]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


good now we have export table into dataframe

In [4]:
df.shape

(287, 3)

In [5]:
df.Borough.value_counts()

Not assigned        77
Etobicoke           44
North York          38
Downtown Toronto    37
Scarborough         37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

drop raws where Borough=='Not assigned'

In [6]:
df.drop(df[df['Borough']=="Not assigned"].index, axis=0, inplace=True)

In [7]:
df.shape

(210, 3)

In [8]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [9]:
df.groupby('Borough').describe()

Unnamed: 0_level_0,Postcode,Postcode,Postcode,Postcode,Neighbourhood,Neighbourhood,Neighbourhood,Neighbourhood
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
Borough,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Central Toronto,17,9,M4V,5,17,17,North Midtown,1
Downtown Toronto,37,19,M5V,7,37,36,St. James Town,2
East Toronto,7,5,M4K,2,7,7,The Beaches West,1
East York,6,5,M4B,2,6,6,Woodbine Heights,1
Etobicoke,44,11,M8Y,8,44,44,The Queensway West,1
Mississauga,1,1,M7R,1,1,1,Canada Post Gateway Processing Centre,1
North York,38,24,M6L,3,38,38,Hillcrest Village,1
Queen's Park,1,1,M9A,1,1,1,Not assigned,1
Scarborough,37,17,M1V,4,37,37,Port Union,1
West Toronto,13,6,M6K,3,13,13,Dovercourt Village,1


In [10]:
df.groupby('Postcode').describe()

Unnamed: 0_level_0,Borough,Borough,Borough,Borough,Neighbourhood,Neighbourhood,Neighbourhood,Neighbourhood
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
Postcode,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
M1B,2,1,Scarborough,2,2,2,Malvern,1
M1C,3,1,Scarborough,3,3,3,Rouge Hill,1
M1E,3,1,Scarborough,3,3,3,West Hill,1
M1G,1,1,Scarborough,1,1,1,Woburn,1
M1H,1,1,Scarborough,1,1,1,Cedarbrae,1
M1J,1,1,Scarborough,1,1,1,Scarborough Village,1
M1K,3,1,Scarborough,3,3,3,Kennedy Park,1
M1L,3,1,Scarborough,3,3,3,Golden Mile,1
M1M,3,1,Scarborough,3,3,3,Scarborough Village West,1
M1N,2,1,Scarborough,2,2,2,Cliffside West,1


check how many Neighbourhoods available for Postcode=M5A

In [11]:
df.groupby('Postcode')['Neighbourhood'].unique()['M5A']

array(['Harbourfront'], dtype=object)

only one Neighbourhood is available but in assignment guidelines it says there will be 2.  
further check for Neighbourhood='Regent Park' in the dataset

In [12]:
df[df['Neighbourhood']=='Regent Park']

Unnamed: 0,Postcode,Borough,Neighbourhood


Eventhough in assignment guide lines it says that "M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park." actually our downloaded data set does not have Neighbourhood call 'Regent Park', so we can assume the data has been updated recently

In [13]:
s1 = df.groupby('Postcode')['Neighbourhood'].unique() #get list of Neighbourhood against unique Postcode
s1.head()

Postcode
M1B                            [Rouge, Malvern]
M1C    [Highland Creek, Rouge Hill, Port Union]
M1E         [Guildwood, Morningside, West Hill]
M1G                                    [Woburn]
M1H                                 [Cedarbrae]
Name: Neighbourhood, dtype: object

In [14]:
s2 = df.groupby('Postcode')['Borough'].unique() ##get Borough against unique Postcode
s2.head()

Postcode
M1B    [Scarborough]
M1C    [Scarborough]
M1E    [Scarborough]
M1G    [Scarborough]
M1H    [Scarborough]
Name: Borough, dtype: object

In [15]:
df_combined = pd.concat([s2, s1], axis=1)

In [16]:
df_combined.reset_index(inplace=True) #reset_index will remove Postcode from index and make it a column

In [17]:
df_combined.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,[Scarborough],"[Rouge, Malvern]"
1,M1C,[Scarborough],"[Highland Creek, Rouge Hill, Port Union]"
2,M1E,[Scarborough],"[Guildwood, Morningside, West Hill]"
3,M1G,[Scarborough],[Woburn]
4,M1H,[Scarborough],[Cedarbrae]


Now we have got combined Borough and Neighbourhood against each unique Postcode.  
However, still we need to remove bracets in each field

In [18]:
df_combined.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,[Scarborough],"[Rouge, Malvern]"
1,M1C,[Scarborough],"[Highland Creek, Rouge Hill, Port Union]"
2,M1E,[Scarborough],"[Guildwood, Morningside, West Hill]"
3,M1G,[Scarborough],[Woburn]
4,M1H,[Scarborough],[Cedarbrae]


In [19]:
#','.join(map(str,[10,"test",10.5]))

In [20]:
#df_combined.applymap(lambda x: ','.join(map(str,x)) if np.where(df_combined.values==x)[1] in [1,2] else x)

apply below lambda function to 'Brough' and 'Neighbourhood' to make each element to a string

In [21]:
df_combined['Borough']=df_combined['Borough'].apply(lambda x: ','.join(map(str,x)))
df_combined['Neighbourhood']=df_combined['Neighbourhood'].apply(lambda x: ','.join(map(str,x)))

In [22]:
df_combined.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie..."
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."
102,M9W,Etobicoke,Northwest


In [23]:
df_combined.shape

(103, 3)