# Segmenting and Clustering Neighborhoods in Toronto

#### Install and import all the dependencies that will be needed

In [1]:
!pip install lxml
!pip install pandas
import pandas as pd
import numpy as np



#### Download and tranform the data into a *pandas* dataframe
The data is scraped from the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df.rename(columns={'Postcode': 'PostalCode'}, inplace=True)
print("\n", "The dataframe's shape: ", df.shape)
df.head()


 The dataframe's shape:  (287, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Cleaning the data
We will process only the cells that have an assigned borough. In other word, we have to drop the row with a borough that is <b>Not assigned</b>.

In [3]:
del_row_index = df[df['Borough'] == 'Not assigned'].index
df.drop(del_row_index, inplace = True)
print("\n", "The dataframe's shape: ", df.shape)
df.head(10)


 The dataframe's shape:  (210, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Queen's Park,Not assigned
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


To make our analysis more accurate, we will also <b>assign a Borough value for a Neighborhood cell</b> if the row has a borough but a <b>Not assigned neighborhood</b>.

In [4]:
na_row_index = df[df['Neighborhood'] == 'Not assigned'].index
for index in na_row_index:
    df.at[index, 'Neighborhood'] = df.at[index, 'Borough']
print("\n", "The dataframe's shape: ", df.shape)
df.head(10)


 The dataframe's shape:  (210, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


To simplify (by reducing the number of rows) our dataframe, we will <b>merge neighborhood that have the same postal code</b>.

In [9]:
aggregation_functions = {'Borough': 'first', 'Neighborhood': ', '.join}
df_cleaned = df.groupby(df['PostalCode']).aggregate(aggregation_functions)
df_cleaned.reset_index(inplace=True)
print("\n", "The final dataframe's shape: ", df_cleaned.shape)
df_cleaned


 The dataframe's shape:  (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
