## Segmenting and Clustering Neighbourhoods in Toronto, CA

First import all the libraries and dependencies needed

In [1]:
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

print('Libraries imported.')

Libraries imported.


Then store the web address in a variable for easy use

In [2]:
data_html = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

Using pandas.read_html read in the data from the website, taking in only the useful table which contains the information for this project. Then check what the first ten entries look like

In [3]:
neighbourhoods = pd.read_html(data_html, header=0)[0]
neighbourhoods.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Drop all the entries in the table where the borough is Not assigned, and rename the post code column to postal code

In [4]:
neighbourhoods.columns=['Postal Code', 'Borough', 'Neighbourhood']
neighbourhoods = neighbourhoods[neighbourhoods.Borough!='Not assigned']
neighbourhoods.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Next step through the dataframe, find the entries/instances which have a not-assigned neighbourhood and storing them in a list. Then step through the list, replace the entries with their borough as the neighbourhood and check the first ten entries to see if changes have taken place

In [5]:
n_list = []
for row in neighbourhoods.itertuples():
    if row.Neighbourhood == 'Not assigned':
        n_list.append([row.Neighbourhood, row.Borough])

length = len(n_list)
i = 0
while i < length:
    borough = n_list[i][1]
    neighbourhoods.loc[neighbourhoods['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = borough
    i = i+1
neighbourhoods.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Finally merge the entries with the same Postcode and Borough, merging their neighbourhoods together with commas and spaces for easy reading. Then drop the duplicates which are created through this method and check the first ten entries in the dataframe

In [6]:
neighbourhoods['Neighbourhood'] = neighbourhoods.groupby(['Postal Code','Borough'])['Neighbourhood'].transform(lambda x: ', '.join(x)) 
neighbourhoods.drop_duplicates(inplace=True)
neighbourhoods.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Harbourfront, Regent Park"
6,M6A,North York,"Lawrence Heights, Lawrence Manor"
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,"Rouge, Malvern"
14,M3B,North York,Don Mills North
15,M4B,East York,"Woodbine Gardens, Parkview Hill"
17,M5B,Downtown Toronto,"Ryerson, Garden District"


Finally print the shape of the dataframe (103,3)

In [7]:
neighbourhoods.shape

(103, 3)

Next, load in the Geospatial data from the CSV due to issues with reliability of Geocoder, then merge the two dataframes together and print the head

In [8]:
df = pd.read_csv('Geospatial_Coordinates.csv')
df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
full_data = pd.merge(neighbourhoods,df)
full_data.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
