## Clustering Toronto Neighborhoods

### In this assignment, I was required to explore, segment, and cluster the neighborhoods in the city of Toronto - Canada.

#### First step - Scrap the Wikipedia's website to build a pandas dataframe consisting of three columns: PostalCode, Borough, and Neighborhood

In [1]:
# install the libraries for webscraping: Beautiful Soup and Requests
!pip install BeautifulSoup4
!pip install requests



In [4]:
# import the libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

# create soup object with the url.
request = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
# soup = BeautifulSoup(source.text, 'xml')
soup = BeautifulSoup(request.text)

# using the soup object, iterate the table to get the data
data = []
columns = []
table = soup.find(class_='wikitable')
for index, tr in enumerate(table.find_all('tr')):
    section = []
    for td in tr.find_all(['th','td']):
        section.append(td.text.rstrip())
    
    # first row is the header
    if (index == 0):
        columns = section
    else:
        data.append(section)

# transform into Pandas DataFrame
df_can = pd.DataFrame(data = data,columns = columns)
df_can.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


In [5]:
# replace the slashes in the cells with commas
df_can['Neighborhood'] = df_can['Neighborhood'].str.replace(' /',',')
df_can.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"


In [6]:
# remove'Not assigned' boroughs
df_can = df_can[df_can['Borough'] != 'Not assigned']
df_can.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [12]:
# combine neighborhoods in the same postal code into one row, with the neighborhoods separated with a comma
df_can["Neighborhood"] = df_can.groupby("Postal code")["Neighborhood"].transform(lambda neigh: ', '.join(neigh))

# remove the duplicates
df_can = df_can.drop_duplicates()

# update index to be postcode if it isn't already
if(df_can.index.name != 'Postal code'):
    df_can = df_can.set_index('Postal code')
    
df_can.head(10)

Unnamed: 0_level_0,Borough,Neighborhood
Postal code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M9A,Etobicoke,Islington Avenue
M1B,Scarborough,"Malvern, Rouge"
M3B,North York,Don Mills
M4B,East York,"Parkview Hill, Woodbine Gardens"
M5B,Downtown Toronto,"Garden District, Ryerson"


In [14]:
# If a cell has a borough but a Not assigned neighborhood, assign the borough name to the neighborhood
df_can['Neighborhood'].replace("Not assigned", df_can["Borough"],inplace=True)
df_can.head(10)

Unnamed: 0_level_0,Borough,Neighborhood
Postal code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M9A,Etobicoke,Islington Avenue
M1B,Scarborough,"Malvern, Rouge"
M3B,North York,Don Mills
M4B,East York,"Parkview Hill, Woodbine Gardens"
M5B,Downtown Toronto,"Garden District, Ryerson"


#### Get the latitudes and longitudes for the neighborhoods -> Loading the CSV file into a dataframe.

#### Hidden cells because of the api credentials to load the file.

In [28]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [29]:
# rename columns and set the index to be Postcode
lat_long.columns = ["Postcode", "Latitude", "Longitude"]
if(lat_long.index.name != 'Postcode'):
    lat_long = lat_long.set_index('Postcode')
    
lat_long.head()

Unnamed: 0_level_0,Latitude,Longitude
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [33]:
df_can = df_can.join(lat_long)
df_can.head(11)

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
Postal code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
M3B,North York,Don Mills,43.745906,-79.352188
M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [32]:
df_can.to_csv('toronto_neigh.csv')