# Segmenting and Clustering Neighborhoods in Toronto Notebook

## First Assignment

In [137]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

## Data retrieval

Let's get the Wikipedia file https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050.  
Note that the function read_html always returns a list of DataFrame objects!!  
We get the first Dataframe object containing the postcodes informations.

In [151]:
url_html='https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050'
df = pd.read_html(url_html)
df_postcodes=df[0]
print("imported dataframe has",df_postcodes['Postcode'].count(), "postcodes entries")
df_postcodes.head(10)

imported dataframe has 287 postcodes entries


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,Islington Avenue


## Data cleaning

Let's clean the Dataframe. Ignore cells with a borough that is Not assigned.

In [157]:
df_postcodes = df_postcodes.loc[df_postcodes['Borough']!="Not assigned"]
df_postcodes.head(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Etobicoke,Islington Avenue
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


More than one neighborhood can exist in one postal code area.  For example, in the table you will notice that M6A is listed twice and has two neighborhoods: Lawrence Heights and Lawrence Manor.
These two rows will be combined into one row with the neighborhoods separated with a comma.
We have to use the groupby method on both 'Postcode' and 'Borough'. Then the 'apply' method will help to put the comma between the neighbourhoods belonging to the same Postcode/Borough group.

In [172]:
postcodes_grouped_serie = df_postcodes.groupby(['Postcode', 'Borough'], as_index=False, sort=False)['Neighbourhood'].apply(', '.join)
# convert the series to Dataframe and assign the column header
df_postcodes_grouped = postcodes_grouped_serie.to_frame(name='Neighbourhood')
# reset the index to have a numerical index
df_postcodes_grouped.reset_index(drop=False, inplace=True)
df_postcodes_grouped.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [177]:
df_postcodes_grouped['Neighbourhood']=df_postcodes_grouped['Neighbourhood'].replace('Not assigned', df_postcodes_grouped['Borough'])
df_postcodes_grouped.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Let's use the .shape attribute to print the number of rows of your datafram

In [176]:
df_postcodes_grouped.shape

(103, 3)