## Notebook created to explore and cluster the neighborhoods in Toronto.

# 1.1 Web page scraped using Panda Dataframes

About the Data, Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

1.It is a list of postal codes in Canada where the first letter is M. Postal codes beginning with M are located within the city of Toronto in the province of Ontario.
2.Scraping table from HTML using Python Dataframe.

In [40]:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
8,M8A,Not assigned,Not assigned
9,M9A,Downtown Toronto,Queen's Park


In [41]:
# Printing the number of rows of the original dataframe
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 3 columns):
Postcode         287 non-null object
Borough          287 non-null object
Neighbourhood    287 non-null object
dtypes: object(3)
memory usage: 6.8+ KB


(287, 3)

# 1.2 Creating cleaned dataframe

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [8]:
# Printing the columns of original dataframe
df.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

In [10]:
# Droping the cells with a borough that is Not assigned
df.drop(df[df['Borough']=="Not assigned"].index,axis=0, inplace=True)

In [19]:
# The dataframe can be reindexed 
df1 = df.reset_index()
df1.head()

Unnamed: 0,index,Postcode,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,Harbourfront
3,5,M6A,North York,Lawrence Heights
4,6,M6A,North York,Lawrence Manor


In [20]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 4 columns):
index            210 non-null int64
Postcode         210 non-null object
Borough          210 non-null object
Neighbourhood    210 non-null object
dtypes: int64(1), object(3)
memory usage: 6.6+ KB


In [42]:
# Printing the number of rows
df1.shape

(210, 4)

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [25]:
# Using GroupBy to combine into one row with two neighborhoods
df2=df1.groupby("Postcode").agg(lambda x:','.join(set(x)))

In [43]:
df2.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Rouge,Malvern"
M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek"
M1E,Scarborough,"Guildwood,West Hill,Morningside"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [28]:
# Assigning the neighborhood same as the borough if a cell has a borough but a Not assigned neighborhood
df2.loc[df2['Neighbourhood']=="Not assigned",'Neighbourhood']=df2.loc[df2['Neighbourhood']=="Not assigned",'Borough']

In [44]:
# Reseting the index
df3 = df2.reset_index()

In [45]:
df3.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek"
2,M1E,Scarborough,"Guildwood,West Hill,Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [46]:
# Print the number of rows in the cleaned dataframe
df3.shape

(103, 3)