# Neighbourhood Clustering

### Index

<div class="alert alert-block alert-info" style="margin-top: 20px">
<font size=3>
    
1. <a href="#item1">Importing and Cleaning Data</a>
    
2. <a href="#item2">Getting the geolocation of every neighbourhood</a>
    
3. <a href="#item3">Clustering the neighbourhoods</a>
    
</font>
</div>

<a id='item1'></a>

## Part 1: Importing and Cleaning Data

In [1]:
import pandas as pd

In [2]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df.columns=['PostalCode', 'Borough', 'Neighborhood']

In [3]:
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


In [4]:
index = df[df['Borough']=="Not assigned"].index
df.drop(index, inplace=True)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [5]:
df.reset_index(drop=True, inplace=True)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [6]:
df.shape

(103, 3)

<a id='item2'></a>

## Part 2: Getting the geolocation of each neighbourhood

In [7]:
import geocoder

In [8]:
# This code should have worked. But we all know that this package is unreliable. 
# Thus, I ended up having to use the csv file instead.
'''
latitude = []
longitude = []

for i in df.index:
    ll = None
    while (ll is None):
        g = geocoder.google('{}, Toronto, Ontario, Canada'.format(df.loc[i]['PostalCode']))
        ll = g.latlng
        
    latitude.append(ll[0])
    longitude.append(ll[1])
    print("Finished run: {}".format(i))

    # I keyboard interrupted the sequence because the package kept returning me 'None'
'''
print()




In [9]:
latlngFrame = pd.read_csv("geoCood.csv")
latlngFrame.columns = ['PostalCode', 'Latitude', 'Longitude']
latlngFrame.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [10]:
dfFull = df.set_index('PostalCode').join(latlngFrame.set_index('PostalCode'))

In [11]:
dfFull.reset_index(inplace=True)

In [12]:
dfFull.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


<a id='item3'></a>

## Part 3: Clustering the neighborhoods

In this section, I decided to cluster each neighbourhood based on their locations. Though my observations (Ex: by changing the number of k), I have concluded that the neighbourhoods in Toronto can roughly be divided into 4 major sectors. Although the number of k can be increased, it results in a map that has too many clusters.

In [13]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans

In [14]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
olat = location.latitude
olng = location.longitude
print("Latitude: {}, Longitude: {}".format(olat, olng))

Latitude: 43.6534817, Longitude: -79.3839347


In [15]:
OMap = folium.Map(location=[olat, olng], zoom_start=10)


for lat, lng, borough, neighborhood in zip(dfFull['Latitude'], dfFull['Longitude'], dfFull['Borough'], dfFull['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(OMap)

OMap

In [16]:
k=4
dfg = dfFull.copy()
fulledit =  dfFull.drop(['PostalCode', 'Borough', 'Neighborhood'], 1)
km = KMeans(n_clusters=k, random_state=0).fit(fulledit)
len(km.labels_)

103

In [17]:
dfg.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [18]:
dfFull.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [19]:
dfg.insert(0, 'Labels', km.labels_)
dfg.drop(['PostalCode', 'Borough', 'Latitude', 'Longitude'], axis=1, inplace=True)
every = dfFull
every = every.join(dfg.set_index("Neighborhood"), on="Neighborhood")
every.drop(every[every['Labels'].isnull()].index, axis=0, inplace=True)
every.reset_index(drop=True, inplace=True)
every.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Labels
0,M3A,North York,Parkwoods,43.753259,-79.329656,2
1,M4A,North York,Victoria Village,43.725882,-79.315572,2
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,3
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242,1
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,2
7,M3B,North York,Don Mills,43.745906,-79.352188,3
8,M3B,North York,Don Mills,43.745906,-79.352188,0
9,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,0


In [20]:
# create map
clustered_map = folium.Map(location=[lat, lng], zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(every['Latitude'], every['Longitude'], every['Neighborhood'], every['Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(clustered_map)
       
clustered_map