<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

## Introduction

In this peer-graded assignment, we will use data from Wikipeida to explore, segment, and cluster neighborhoods in the city of Toronto, using postalcode and borough information. We will convert postalcodes into their equivalent latitude and longitude values. We will use the k_-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in the city of Toronto and their emerging clusters.


In [2]:
# Import all necessary libraries
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
#from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

import urllib.request
from bs4 import BeautifulSoup # to parse HTMLand XML
from IPython.display import display_html

In [3]:
# Scrape Wiki page for information about the city of Toronto - Old link works
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969'
#url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' # new link doesn't work
response = requests.get(url).text
soup = BeautifulSoup(response, 'lxml')
print(soup.title)
table = str(soup.table)
display_html(table, raw = True)

<title>List of postal codes of Canada: M - Wikipedia</title>


Postal Code,Borough,Neighbourhood
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M8A,Not assigned,Not assigned
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"


In [4]:
# Read the html based table into a pandas datafram
df_init = pd.read_html(table)
df = df_init[0] # get the first table 
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
# Drop the rows where Borough is 'Not assigned'
df_new = df[~df['Borough'].str.contains('Not assigned')].reset_index(drop = True)

# Combine the neighbourhoods with same Postal code
df_combined = df_new.groupby(['Postal Code','Borough'], sort = False).agg(', '.join)
df_combined.reset_index(inplace = True)

# If Neighbourhood contains 'Not assigned', but the Borough has a name, replace the 'Not assigned' with the name of the Borough
df_combined.loc[df_combined['Neighbourhood'].str.contains('Not assigned'), 'Neighbourhood'] = df_combined['Borough']
df_combined.shape 

(103, 3)

In [6]:
df_combined

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [9]:
# get the latitude and the longitude coordinates of each neighborhood from a csv file
geo_coordinates = pd.read_csv('https://cocl.us/Geospatial_data')
# Write the csv file into the current folder
#geo_coordinates.to_csv('Toronto_Geospatial_data.csv')
geo_coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [10]:
geo_coordinates.shape

(103, 3)

In [11]:
#Combine dataframes and add the latiutde and longitude data for each postal code
updated_geoCoor = pd.merge(df_combined, geo_coordinates, on = 'Postal Code')
updated_geoCoor.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [34]:
# Only extract the Boroughs in within the city of Toronto
city_Toronto = updated_geoCoor[updated_geoCoor['Borough'].str.contains('Toronto', regex = False)]

In [35]:
city_Toronto

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


In [54]:
#Visualizing all the Neighbourhoods of the above dataframe using Folium
latitude = 43.651070
longitude = -79.347015
map_toronto = folium.Map(location = [latitude,longitude], zoom_start = 12)

for lat,lng,borough,neighbourhood in zip(city_Toronto['Latitude'],city_Toronto['Longitude'],city_Toronto['Borough'],city_Toronto['Neighbourhood']):
    nb_label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(nb_label, parse_html = True)
    folium.CircleMarker(
    [lat,lng],
    radius = 5,
    popup = label,
    color = 'blue',
    fill = True,
    fill_color = 'blue',
    fill_opacity = 0.6,
    parse_html = False).add_to(map_toronto)
map_toronto

In [37]:
# pre-process
Toronto = city_Toronto.drop(['Postal Code', 'Borough', 'Neighbourhood'], 1) 

In [38]:
from sklearn.preprocessing import StandardScaler

X = Toronto.values[:,1:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)
cluster_dataset

array([[ 0.80095905],
       [ 0.07116779],
       [ 0.33813758],
       [ 0.42713509],
       [ 2.51062327],
       [ 0.4805331 ],
       [ 0.12455821],
       [-0.76514129],
       [ 0.19574965],
       [-1.26322593],
       [ 0.26694361],
       [-0.69397767],
       [ 1.01459911],
       [ 0.27139197],
       [-0.90746094],
       [ 1.94059611],
       [ 0.3158882 ],
       [ 1.29948119],
       [ 0.08896375],
       [-0.62281153],
       [-2.40130265],
       [ 0.05337183],
       [-0.48047417],
       [-1.83233257],
       [-0.3381267 ],
       [-0.3381267 ],
       [-1.61893528],
       [ 0.08896375],
       [-0.19577165],
       [-2.33018961],
       [ 0.23134663],
       [-0.19577165],
       [-0.19577165],
       [-0.053409  ],
       [ 0.37373709],
       [ 0.44159795],
       [ 0.62293873],
       [ 0.25359348],
       [ 0.23134663],
       [ 1.78920524]])

In [44]:
# Cluster Neighborhoods Run k-means to cluster the neighborhood into 5 clusters.
num_clusters = 5

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(cluster_dataset) #k_means.fit(Toronto)
labels = k_means.labels_

print(labels)

[3 4 4 4 1 4 4 0 4 2 4 0 3 4 0 1 4 3 4 0 2 4 0 2 0 0 2 4 0 2 4 0 0 4 4 4 3
 4 4 1]


In [52]:
# Add the k-means labels to the dataframe for each Borough
if ('K_mean Labels' in city_Toronto.columns):
    city_Toronto = city_Toronto.drop('K_mean Labels', 1)
#else:
city_Toronto.insert(5, 'K_mean Labels', labels)
city_Toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,K_mean Labels
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,3
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,4
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,4
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,4
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,1


In [53]:
# create map
latitude = 43.651070
longitude = -79.347015

map_clusters = folium.Map(location = [latitude, longitude], zoom_start = 12)

# set color scheme for the clusters
x = np.arange(num_clusters)
ys = [i + x + (i*x)**2 for i in range(num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(city_Toronto['Latitude'], city_Toronto['Longitude'], city_Toronto['Neighbourhood'], city_Toronto['K_mean Labels']):
    label = folium.Popup(str(neighbourhood) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = rainbow[cluster-1],
        fill = True,
        fill_color = rainbow[cluster-1],
        fill_opacity = 0.7).add_to(map_clusters)
       
map_clusters