# Segmenting & Clustering Neighborhoods in Toronto

### Introduction

This notebook has steps to explore, segment and cluster the neighborhoods in the city of Toronto. There are steps to get the raw html data from web, scrpe through the data and extract the required details. Once the details are available, 

###  1. Import all required libraries

In [146]:
# Import the required libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup as Soup

### 2. Get HTML data into dataframe

This step will go out to the website and get the raw HTML code which will then be converted to readable format and eventually to a dataframe

In [149]:
# Initialize the path and get data
web_path = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html_source = requests.get(web_path).text

#Use lxml parser
html_data = Soup(html_source,'lxml')

#Find all the table entries from the html source and create table list
table = html_data.find_all('table', class_="wikitable sortable")
table_ls = pd.read_html(str(table))


# Create Dataframe with the columns from the html source
columns = [table_ls[0][0][0],table_ls[0][1][0],table_ls[0][2][0]]
table_df = pd.DataFrame(columns = columns)

#Loop through the table list and append each entry to the dataframe
for i in range(len(table_ls[0])-1):
    data = {table_ls[0][0][0]:table_ls[0][0][i+1],table_ls[0][1][0]:table_ls[0][1][i+1],table_ls[0][2][0]:table_ls[0][2][i+1]}
    table_df = table_df.append(data,ignore_index = True)

table_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### 3. Pre-process the data

This step will remove the rows that has Borough value set as "Not Assigned". The duplicate rows in the dataframe are flagged using the _Is Duplicate_ value. Move the non-duplicate values to the final data frame

In [150]:
#Drop the rows that has Borough = Not Assigned
drop_idx = table_df[table_df["Borough"] == "Not assigned"].index
table_df.drop(drop_idx,inplace = True)
table_df.reset_index(drop=True,inplace = True)
table_df["Is_Duplicate"] = table_df.duplicated(subset = "Postcode",keep=False) 

#Create dataframe with only the duplicate values
table_df2 = table_df[table_df["Is_Duplicate"]==True]
table_df2.reset_index(drop=True,inplace = True)
table_df2.set_index("Postcode")
table_df2.sort_index(inplace=True)

#Move non-duplicates to final dataframe
neigh_df = table_df[table_df["Is_Duplicate"] == False ]
neigh_df.reset_index(drop=True,inplace = True)
neigh_df.set_index("Postcode")
neigh_df.sort_index(inplace=True)
neigh_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Is_Duplicate
0,M3A,North York,Parkwoods,False
1,M4A,North York,Victoria Village,False
2,M7A,Queen's Park,Not assigned,False
3,M9A,Etobicoke,Islington Avenue,False
4,M3B,North York,Don Mills North,False


### 4. Process duplicates and combine values

This step will loop through the duplicates dataframe and combine the _Neighbourhood_ values. The combined values along with _Postcode_ and _Borough_ are appended to the final dataframe. 

In [151]:
#Get the values of the first row
code = table_df2.iloc[0]["Postcode"]
hood = table_df2.iloc[0]["Neighbourhood"]

#Loop through the data frame and combine neighbourhood values
for i in range(1,len(table_df2)):

    if (table_df2.iloc[i]["Postcode"] == str(code)):
        hood = hood + ',' + str(table_df2.iloc[i]["Neighbourhood"]) #Combine values for same postcode.
    else:
        #Set the values and append to dataframe
        code = str(table_df2.iloc[(i-1)]["Postcode"]) 
        boro = str(table_df2.iloc[(i-1)]["Borough"])
        dup = str(table_df2.iloc[(i-1)]["Is_Duplicate"])
        neigh_df =neigh_df.append({"Postcode":code,"Borough":boro,"Neighbourhood":hood,"Is_Duplicate":dup},ignore_index=True)
        
        #Move next postcode values
        hood = table_df2.iloc[i]["Neighbourhood"]
        code = table_df2.iloc[i]["Postcode"]

#Append last record to dataframe
code = str(table_df2.iloc[(i-1)]["Postcode"])
boro = str(table_df2.iloc[(i-1)]["Borough"])
dup = str(table_df2.iloc[(i-1)]["Is_Duplicate"])
neigh_df =neigh_df.append({"Postcode":code,"Borough":boro,"Neighbourhood":hood,"Is_Duplicate":dup},ignore_index=True)

neigh_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Is_Duplicate
0,M3A,North York,Parkwoods,False
1,M4A,North York,Victoria Village,False
2,M7A,Queen's Park,Not assigned,False
3,M9A,Etobicoke,Islington Avenue,False
4,M3B,North York,Don Mills North,False


### 5. Final cleanup

Now the dataframe should have the final result. Reset the index, drop the _Is Duplicate_ column and display the result.

In [152]:
#Set/Reset indexes
neigh_df.reset_index(drop=True,inplace = True)
neigh_df.set_index("Postcode")

#Drop Duplicate flag column
del neigh_df["Is_Duplicate"]
neigh_df.sort_index(inplace=True)

neigh_df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M7A,Queen's Park,Not assigned
3,M9A,Etobicoke,Islington Avenue
4,M3B,North York,Don Mills North
5,M6B,North York,Glencairn
6,M4C,East York,Woodbine Heights
7,M5C,Downtown Toronto,St. James Town
8,M6C,York,Humewood-Cedarvale
9,M4E,East Toronto,The Beaches


### 6. Display result size

In [153]:
neigh_df.shape

(103, 3)

### 7. Add location coordinates to the dataframe

Get the Latitude and Longitude for every postcode & merge that with the Results dataframe from above. The geo coordinates from the csv file will be used.
_Postcode_ column will be used as the key to merge both dataframes.

In [170]:
# Initialize path
geo_path = "http://cocl.us/Geospatial_data"
geo_df = pd.read_csv(geo_path) #Read csv to dataframe

#Merge the data
merge_df = neigh_df.merge(geo_df,left_on ="Postcode",right_on="Postal Code")
merge_df

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M3A,North York,Parkwoods,M3A,43.753259,-79.329656
1,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
2,M7A,Queen's Park,Not assigned,M7A,43.662301,-79.389494
3,M9A,Etobicoke,Islington Avenue,M9A,43.667856,-79.532242
4,M3B,North York,Don Mills North,M3B,43.745906,-79.352188
5,M6B,North York,Glencairn,M6B,43.709577,-79.445073
6,M4C,East York,Woodbine Heights,M4C,43.695344,-79.318389
7,M5C,Downtown Toronto,St. James Town,M5C,43.651494,-79.375418
8,M6C,York,Humewood-Cedarvale,M6C,43.693781,-79.428191
9,M4E,East Toronto,The Beaches,M4E,43.676357,-79.293031
