### Toronto Neighborhood Clustering

For Coursera Capstone week 3 Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

### Data Wrangling and Cleaning

The data for clustering Toronto's neighborhoods will be sourced by wikipedia. This dataset is indexed by postal code and needs to be scrubbed for unassigned zones (ex: large companies like Amazon may secure a unique postal code for large-volume shipping, etc).

In [1]:
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

import pandas as pd
import io
import requests

html_content=requests.get(wiki_url).content
df = pd.read_html(html_content)[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [2]:
# Rename columns
df.columns = ['PostalCode', 'Borough', 'Neighborhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [3]:
# Ignore cells with a borough that is 'Not assigned'
unassigned_boroughs_indeces = df[df['Borough'] == 'Not assigned'].index
df.drop(unassigned_boroughs_indeces, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [4]:
# Examine dataframe for 'Not assigned' neighborhoods
df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
7,M7A,Queen's Park,Not assigned


In [5]:
# Name 'Not assigned' neighborhoods after borough name
df.loc[df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df['Borough']

# Expect empty mask after conditional replacement
df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood


In [6]:
# df.groupby(['PostalCode', 'Borough'], as_index=False)[['Neighborhood']].agg([('Neighborhood', ', '.join)])
formatted_df = df.groupby(['PostalCode', 'Borough'], as_index=False).agg({'Neighborhood': lambda x: ', '.join(x)})
formatted_df.reset_index()
formatted_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


In [7]:
formatted_df.shape

(103, 3)