# This notebook is for segmenting and clustering neighborhoods in the city of Toronto, Canada.

In [1]:
import pandas as pd
import numpy as np
import json
import requests
from sklearn.cluster import KMeans

First and formost, let us download the data table.

In [2]:
toronto_neigh = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

print('The data type is {}.'.format(type(toronto_neigh)))

The data type is <class 'list'>.


As the type is list, we will need to convert it into dataframe.

In [3]:
df_toronto = toronto_neigh[0]
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We now successfully have the raw data shown as dataframe.

Let's look at the shape of our data frame before cleansing it.

In [4]:
df_toronto.shape

(287, 3)

Step #1: Slice all rows whose borough name is NOT 'not assigned'.

In [42]:
df_toronto_1 = df_toronto[df_toronto.Borough != 'Not assigned']
df_toronto_1.shape
df_toronto_1.Postcode.value_counts()

M8Y    8
M9V    8
M5V    7
M9B    5
M8Z    5
M4V    5
M9R    4
M1V    4
M6M    4
M9C    4
M5R    3
M5T    3
M8X    3
M5H    3
M3H    3
M1E    3
M6K    3
M2J    3
M1M    3
M5J    3
M6L    3
M1L    3
M1K    3
M1P    3
M8V    3
M1C    3
M1T    3
M5S    2
M8W    2
M3K    2
      ..
M6B    1
M7Y    1
M2H    1
M4S    1
M3L    1
M4A    1
M6G    1
M6E    1
M5W    1
M4J    1
M5G    1
M2R    1
M1G    1
M5A    1
M7A    1
M9N    1
M4M    1
M2P    1
M2K    1
M9P    1
M1W    1
M2N    1
M4W    1
M4Y    1
M4G    1
M4P    1
M3A    1
M9L    1
M7R    1
M1H    1
Name: Postcode, Length: 103, dtype: int64

Step #2. Merge rows that have the same borough names.

In [25]:
df_toronto_2 = df_toronto_1.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda x: ', '.join(x)).reset_index()
df_toronto_2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [27]:
df_toronto_2.shape

(103, 3)

Step #3: For neighbourhoods that have 'Not assigned' value, assign the borough name as the neighbourhood name.

In [29]:
df_toronto_2.loc[(df_toronto_2['Neighbourhood'] == 'Not assigned')]

Unnamed: 0,Postcode,Borough,Neighbourhood
93,M9A,Queen's Park,Not assigned


In [30]:
df_toronto_2.at[93, 'Neighbourhood'] = 'Queen\'s Park'

In [33]:
df_toronto_2.loc[93]

Postcode                  M9A
Borough          Queen's Park
Neighbourhood    Queen's Park
Name: 93, dtype: object

In [34]:
df_toronto_proc = df_toronto_2

In [37]:
df_toronto_proc.shape

(103, 3)