# IBM Applied Data Science Capstone
## Segmenting and Clustering Neighborhoods in Toronto

by Francisco Martellini
v1 - 2020-04-20

### Table of Contents

**Part 1 - Downoad and create the dataset**

**Part 2 - Create the dataframe with the neighbourhoods latitude and longitude**

**Part 3 - Exploring the data and clustering the data!**

### Part 1 - Downoad and create the dataset

##### Importing the libraries

In [49]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import requests # library to handle requests
import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Libraries imported.')

Libraries imported.


##### Download the dataset to the local machine for working offline.

The code will be scraped from following Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [112]:
!wget https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('Data downloaded!')

--2020-04-20 16:41:12--  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Resolvendo en.wikipedia.org (en.wikipedia.org)... 208.80.154.224
Conectando-se a en.wikipedia.org (en.wikipedia.org)|208.80.154.224|:443... conectado.
A requisição HTTP foi enviada, aguardando resposta... 200 OK
Tamanho: 52234 (51K) [text/html]
Salvando em: “List_of_postal_codes_of_Canada:_M.1”


2020-04-20 16:41:13 (271 KB/s) - “List_of_postal_codes_of_Canada:_M.1” salvo [52234/52234]

Data downloaded!


##### Loading the dataset in a dataframe with Pandas.

In [3]:
html = "List_of_postal_codes_of_Canada:_M" # local document to build the dataframe
df = pd.read_html(html, header = 0)[0] #creating the dataframe
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


##### Cleaning the dataset

1. Create a new dataframe to preserve the original database

In [4]:
df1 = df

2. Check if some line in the Borough column have a value "Not assigned". If True, remove this lines of the dataframe


In [5]:
if (df1['Borough'] == 'Not assigned').any() == True:
    df1 = df1[df1.Borough != 'Not assigned']

3. Check if a row has a Borough value but a "Not assigned" neighborhood. If true, then the neighborhood will be the same as the borough

In [6]:
if (df1['Neighborhood'] == 'Not assigned').any() == True:
    df1 = df1[df1.Neighborhood == df1.Borough]

4. Check if some neighbourhods have the same Postal Code. If is True, group the dataframe by Postal code and joining the values in Borough with a comma

In [7]:
if (df1['Postal code'].duplicated()).any() == True:
    df1 = df1.groupby(['Postal code','Borough']).agg(', '.join)

5. Reset the index to show the final dataframe

In [8]:
df1.reset_index(inplace=True)
df1

Unnamed: 0,index,Postal code,Borough,Neighborhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,Regent Park / Harbourfront
3,5,M6A,North York,Lawrence Manor / Lawrence Heights
4,6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...,...
98,160,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,165,M4Y,Downtown Toronto,Church and Wellesley
100,168,M7Y,East Toronto,Business reply mail Processing CentrE
101,169,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


##### Print the number of rows of the dataframe

In [9]:
df1.shape

(103, 4)

### Part 2 - Create the dataframe with the neighbourhoods latitude and longitude

Dowload the package with latitude and longitudes from neighbours of Canada.

In [10]:
loc_df = pd.read_csv('https://cocl.us/Geospatial_data')
loc_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


First we need rename the column "Postal Code" from the location database to "Postal code" to maintain the integrity with the neighbourohood database, and merge in a new pandas dataframe. 

In [14]:
loc_df.rename(columns={'Postal Code':'Postal code'},inplace=True)

df2 = pd.merge(df1,loc_df,on='Postal code')
df2

Unnamed: 0,index,Postal code,Borough,Neighborhood,Latitude,Longitude
0,2,M3A,North York,Parkwoods,43.753259,-79.329656
1,3,M4A,North York,Victoria Village,43.725882,-79.315572
2,4,M5A,Downtown Toronto,Regent Park / Harbourfront,43.654260,-79.360636
3,5,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...,...
98,160,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North,43.653654,-79.506944
99,165,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,168,M7Y,East Toronto,Business reply mail Processing CentrE,43.662744,-79.321558
101,169,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...,43.636258,-79.498509


### Part 3 - Exploring the data and clustering the data!

Getting all the rows from the data frame which contains Toronto in their Borough.

In [41]:
df3 = df2[df2['Borough'].str.contains('Toronto',regex=False)]
df3

Unnamed: 0,index,Postal code,Borough,Neighborhood,Latitude,Longitude
2,4,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
4,6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494
9,13,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,22,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,30,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,31,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,40,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,41,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,49,M5H,Downtown Toronto,Richmond / Adelaide / King,43.650571,-79.384568
31,50,M6H,West Toronto,Dufferin / Dovercourt Village,43.669005,-79.442259


Create map of this neighbourhoods with folium.

In [42]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for latitude, longitude, borough, neighborhood in zip(df3['Latitude'], df3['Longitude'], df3['Borough'], df3['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Using KMeans clustering:

In [46]:
k=4
toronto_clustering = df3.drop(['Postal code','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
df3.insert(0, 'Cluster Labels', kmeans.labels_)

df3

Unnamed: 0,Cluster Labels,index,Postal code,Borough,Neighborhood,Latitude,Longitude
2,1,4,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
4,1,6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494
9,1,13,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,1,22,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,1,30,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,1,31,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,1,40,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,1,41,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,2,49,M5H,Downtown Toronto,Richmond / Adelaide / King,43.650571,-79.384568
31,2,50,M6H,West Toronto,Dufferin / Dovercourt Village,43.669005,-79.442259


Create a map with founded clusters.

In [51]:
map_clusters = folium.Map(location=[latitude,longitude],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df3['Latitude'], df3['Longitude'], df3['Neighborhood'], df3['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters