# Segmenting and Clustering Neighborhoods in Toronto_Part 1

We will work on the below requirements:

1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
4. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
3. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [1]:
import pandas as pd
import numpy as np
import requests

In [2]:
canada_data_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
candada_data_page = requests.get(canada_data_url)

candada_data = pd.read_html(candada_data_page.content, header = 0)[0]

In [3]:
candada_data
df = candada_data[candada_data.Neighbourhood != 'Not assigned']
df.reset_index(inplace = True,drop=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
df.groupby(['Postal Code']).first()

Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Malvern, Rouge"
M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
...,...,...
M9N,York,Weston
M9P,Etobicoke,Westmount
M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [5]:
len(df['Postal Code'].unique())

103

We can see wikipedia link is already merging neighbourhoods for each group , so no need to handle this in our pipeline.

In [6]:
df[df['Borough'] == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough by the below code.

In [7]:
df["Neighbourhood"].replace({"Not assigned": df["Borough"]}, inplace=True)

Shape of our data

In [8]:
df.shape

(103, 3)