# Toronto Neighborhood Segmentation and Cluster
Alif Sussardi

This week 3 project was carried out to segment and cluster neighborhood of Toronto with the data obtained from wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

The data from link above will be imported as a dataframe using pandas. 

### First of all, import all packages that will be needed for this project.

Note-- some packages may will be imported later on when relevant

In [283]:
#Pandas for dataframe analysis
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

#Numpy for array operations
import numpy as np 

#json for handling with json data
import json 

# convert address into coordinate values
from geopy.geocoders import Nominatim 

# library to handle requests
import requests 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# KMeans from sklearn for clustering
from sklearn.cluster import KMeans

# to render map
import folium 

#beautiful soup library for importing html format
#!pip install beautifulsoup4    #to install bs4 package
from bs4 import BeautifulSoup

#package to process html and xml
import lxml

print('Libraries imported.')

Libraries imported.


## Get the data and wrangle and Add Coordinate

### Get the data

In [284]:
#declare data source url
data_source = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# download and parse the data from url
req = requests.get(data_source)
soup = BeautifulSoup(req.text, 'html.parser')
table = soup.find('table', attrs={'class':'wikitable sortable'})

### Get the rows

In [291]:
# Declare Headers from the table header key <th>
header = table.findAll('th')
for i, head in enumerate(header): 
    header[i]=str(header[i]).replace("<th>","").replace("</th>","").replace("\n","")

In [292]:
# get all the data in the table, it is identified with tr from table row, 

#declare the rows, this will return as list
rows = table.findAll('tr')

# the first row will be the one of the header, so get from the second row
rows = rows[1:len(rows)]

#get all rows
for i, row in enumerate(rows): 
    rows[i] = str(rows[i]).replace("\n</td></tr>","").replace("<tr>\n<td>","")
    

### Create New Dataframe from Rows

In [293]:
# Make dataframe from rows
df=pd.DataFrame(rows)

# assign header from df[0] split with the remnants of the html, assign it to 2 new columns 
df[headers] = df[0].str.split("\n</td>\n<td>", expand = True) 

# Remove the original html column
df.drop(columns=[0],inplace=True)

# Rename PostalCode column to PostalCode
df.rename(columns = {'Postal Code':'PostalCode'}, inplace=True)

In [294]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Dealing with "Not assigned" Value and duplicates

In [295]:
# delete rows with the Borough column has value of Not assigned
df = df.drop(df[df.Borough == "Not assigned"].index).reset_index(drop=True)

# Assign the value of Borough to the Neighbourhood when its Not assigned
df = df.replace(df[df.Neighbourhood == "Not assigned"], df.Borough)

# If there is an empty cell in Neighbourhood, fill it with Borough
df.Neighbourhood.fillna(df.Borough, inplace=True)

# Remove duplicates
df = df.drop_duplicates()


### Combine all Borough and Neighbourhood with same PostalCode

In [296]:
# combine multiple neighborhood
#index all unique postalcode
df_un = pd.DataFrame({'PostalCode':df.PostalCode.unique()})

#add new column name Borough, consist of Borough with same PostalCode
df_un['Borough'] = pd.DataFrame(list(set(df['Borough'].loc[df['PostalCode'] == x['PostalCode']])) for i, x in df_un.iterrows())

#same method, add new column name: Neighborhood, consists of Neighborhood with same PostalCode
df_un['Neighborhood'] = pd.DataFrame(list(set(df['Neighbourhood'].loc[df['PostalCode'] == x['PostalCode']])) for i, x in df_un.iterrows())

In [297]:
df_un.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"
