# Segmenting and Clustering Neighborhoods in Toronto

This excerise prepare the data for segmenting and clustering neighborhoods in Toronto. We take the post code table in the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M), filter and clean the data clean before segementation.

In [10]:
# import libraries
import pandas as pd
import numpy as np

Below code will take the first table from the page and read into a dataframe.

In [11]:
# extract post code table into dataframe
pcode_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
pcode_df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Data cleaning and manipulation include:
- ignore data with borough as 'not assigned'
- group neighourhood by post code
- set neighourhood to be the same as borough if unassigned

In [12]:
# ignore rows with borough as'not assigned'
pcode_df = pcode_df[pcode_df.Borough != 'Not assigned']

# group neighbourhood by post code
pcode_df['Neighbourhood'] = pcode_df.groupby('Postcode')['Neighbourhood'].transform(lambda x: ', '.join(x))
pcode_df.drop_duplicates(inplace=True)

# set neighbourhood as borough if unassigned
pcode_df['Neighbourhood'] = np.where(pcode_df.Neighbourhood == 'Not assigned', pcode_df.Borough, pcode_df.Neighbourhood)
pcode_df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Harbourfront, Regent Park"
6,M6A,North York,"Lawrence Heights, Lawrence Manor"
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,"Rouge, Malvern"
14,M3B,North York,Don Mills North
15,M4B,East York,"Woodbine Gardens, Parkview Hill"
17,M5B,Downtown Toronto,"Ryerson, Garden District"


In [13]:
pcode_df.shape

(103, 3)

Read latitudes and longitudes geo data from a csv data provided at http://cocl.us/Geospatial_data

In [14]:
# read latitudes and longitudes into a dataframe
geo_df = pd.read_csv('http://cocl.us/Geospatial_data')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge latitudes and longitudes data into the prepared dataframe with post codes, boroughs and neighourhoods.

In [39]:
merged_df = pd.merge(pcode_df, geo_df, how='inner', left_on='Postcode', right_on='Postal Code')
merged_df.drop(columns=['Postal Code'],inplace=True)
merged_df.rename(index = {"Postcode": "Postal Code"}, inplace = True) 
merged_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


In [40]:
merged_df.shape

(103, 5)