# **Neighborhood segmentation and clustering get & clean Data**

In this notebook we will explore toronto's neighborhoods, using web scraping techniques, to parse data from wikipedia. 

First we will start with importing pandas library, and then we will use read_html function in order to scrape wikipedia webpage and get Toronto's neighborhoods along with postale code.

In [2]:
# import pandas library
import pandas as pd 
# using read_html function to get the table from wikipedia link
df=pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0] 

In [68]:
df.head()

Unnamed: 0_level_0,Postcode,Borough,Neighbourhood
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


In [69]:
df.shape

(287, 3)

We notice that many postal codes are not assigned, hence we will drop all the rows where in which postal code is "Not assigned"

In [72]:
# ignore rows containing 'Not assigned' in col Borough
df_1= df[~df.Borough.str.contains("Not assigned", na=False)] 
df_1.head()

Unnamed: 0_level_0,Postcode,Borough,Neighbourhood
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor


In [73]:
df_1.shape # check the size of our dataframe

(210, 3)

We notice that even when a borough is allocated to a specific postale code, many neighbourhoods are not assigned, hence we will replace all non assigned neighbourhoods with the name of the correponding Borough

In [55]:
#replace non assigned neighbourhoods
df_2= df_1.replace(to_replace =df_1['Neighbourhood'] == ('Not assigned'), value =df_1['Borough']) 
df_2.shape # check the size of our dataframe

(210, 3)

To avoid repition of boroughs with multiple neighbourhoods, we will group all the neighbourhoods under the corresponding Boroughs.

In [74]:
# Group neighborhoods by their corresponding Borough.
df_3=df_1.groupby(['Postcode','Borough'], as_index=False, sort=False).agg(lambda x:', '.join(x))
df_3.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [75]:
df_3.shape # check the size of our dataframe

(103, 3)

In [57]:
# export the dataframe as CSV file for the second part of the assignment
df_3.to_csv('Toronto.csv')