# Segmenting and Clustering Neighborhoods in Toronto

### Step 1: Get the data from wikipedia

In [2]:
# Import libraries
import pandas as pd
import numpy as np

In [3]:
# Get and load the data from wikipedia into dataframe
#toronto_data = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
# Old version of wikipedia
toronto_data = pd.read_html('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=926287641')[0]

In [4]:
# Explore the data
toronto_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## Step 2: Clean the data
As we can see the above dataframe, it consists of three columns: <code>PostalCode, Borough, and Neighborhood</code>

The column 'Borough' consists of 'Not assigned' value. We will ignore cells with a borough that is Not assigned.

In [5]:
# delete the rows with the values 'Not assigned' in the Borough column
df = toronto_data[toronto_data['Borough']!='Not assigned'].reset_index(drop=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


More than one neighborhood can exist in one postal code area. For example, in the table above, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [6]:
# Check duplicated values
df[df['Postcode'].apply(lambda x: x=='M5A')]

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park


In [7]:
# Remove duplicates in the column postal code and combine the Neighbourhood
df_clean = df.groupby(['Postcode','Borough'],as_index=False, sort=False).agg(lambda x: ','.join(x))
df_clean.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Not assigned


Check if a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [8]:
# Check the Neighbourhood with values Not assigned
df_clean[df_clean['Neighbourhood'].apply(lambda x: x=='Not assigned')]

Unnamed: 0,Postcode,Borough,Neighbourhood
4,M7A,Queen's Park,Not assigned


In [9]:
# Replace the values of Neighbourhood with borough if values Not assigned
df_clean.loc[(df_clean.Neighbourhood == "Not assigned"), 'Neighbourhood'] = df_clean['Borough']

In [10]:
df_clean.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


In [11]:
# print the shape of the dataframe
print(df_clean.shape)

(103, 3)


In [16]:
# Save to csv to use later
df_clean.to_csv('toronto_neighbourhoods.csv', index=False)