# Segmenting and Clustering Neighborhoods in Toronto

Loading libraries

## First part: Download the data

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np 
import pandas as pd

Download the Wikipwdia page and find the table with the data

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

table = soup.find('table')

Get the names of the columns from the first row of the table

In [3]:
headline = table.tbody.tr.text
headline = headline.split('\n')
while('' in headline) :
    headline.remove('') 
print(headline)

['Postcode', 'Borough', 'Neighbourhood']


Get all the rows in the table and store in a numpy array, ignoring the "Not assigned" ones

In [4]:
data_table=np.array(headline)

for element in table.find_all('tr') [1:]:
    columns = element.find_all('td')
    el=[]
    for i, column in enumerate(columns, start=0):
        el.append(column.get_text().strip('\n'))
        #print(i,el)
    #print(el)
    if(el[1] != 'Not assigned') :
        data_table=np.vstack((data_table,el))

Create a pandas dataframe from the numpy array

In [5]:
df=pd.DataFrame(data=data_table[1:,0:],
                  columns=data_table[0,0:])
df=df.rename(columns={'Postcode':'Postal Code'})
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Shape of the dataframe

In [6]:
df.shape

(211, 3)

## Second part: geolocalization

Download the csv file with the geospatial data

In [7]:
!wget -O Geospatial_Coordinates.csv https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv

--2019-05-20 18:11:26--  https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
Resolving cocl.us (cocl.us)... 169.48.113.201
Connecting to cocl.us (cocl.us)|169.48.113.201|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-05-20 18:11:27--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.24.197, 107.152.25.197
Connecting to ibm.box.com (ibm.box.com)|107.152.24.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-05-20 18:11:28--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjj

In [8]:
geo_data = pd.read_csv("Geospatial_Coordinates.csv", delimiter=",")

Create a pandas data frame with the geospacial data

In [9]:
df_with_geo = pd.merge(df,geo_data, on='Postal Code')
df_with_geo.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636
4,M6A,North York,Lawrence Heights,43.718518,-79.464763


Merge the geospacial data with the previous dataframe