# Segmenting and Clustering Neighborhoods in Toronto, Canada

First let's read in the data from the website and examine.

In [85]:
import pandas as pd
# there are 3 tables in this link, only use the first one
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
# rename column to match what's on Coursera
df.rename(columns = {'Postal code': 'PostalCode'}, inplace = True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Next, drop rows where borough is Not Assigned.

In [86]:
ind = df[df['Borough'] == 'Not assigned'].index
df.drop(ind, inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


Now let's check if any PostalCode repeats so that we can merge them if necessary.

In [87]:
df['PostalCode'].value_counts().unique()

array([1])

After tabulating the number of times each __PostalCode__ appears and identifying the unique values, we can see that the value of __1__ is the only unique value. Thus, there is no need to merge any rows for repeated PostalCode.

The instructions also say that the neighborhoods are separated by commas. Let's replace all slashes with commas.

In [88]:
df['Neighborhood'] = df['Neighborhood'].str.replace(' /', ',') # replace all / with ,
df['Neighborhood'].apply(lambda x: x.strip()) # remove any white space that may have appeared
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


The next thing to check is if a cell has a Borough but a __Not assigned__ Neighborhood. If so, then we will use the Borough name as the Neighborhood name.

In [89]:
df[df['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood


There are no Neighborhoods with values __Not assigned__ so we don't have to do anything.

In [90]:
df.shape

(103, 3)