# Segmenting and Clustering Neighborhoods in Toronto


## Part 1: Creating the dataset
First we scrape the Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M in order to obtain the data that is in the table of postal codes and to transform the data into a pandas  dataframe.

In [1]:
import pandas as pd 
import numpy as np
import requests 
from bs4 import BeautifulSoup 

req = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M") 

soup = BeautifulSoup(req.content,'lxml') 

table = soup.find_all('table')[0]  

df = pd.read_html(str(table)) 
neighborhood=pd.DataFrame(df[0])

neighborhood

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Now we clean the data in the following way:
- The dataframe will consist of three columns: Postal Code, Borough, and Neighborhood
- We drop all cells with a Borough that is Not assigned.
- Neighborhoods in the same Postal Code area will be combined in one row with the Neighborhoods separated with a comma.
- If a cell has a Borough but a Not assigned  Neighborhood, then the Neighborhood will be the same as the Borough.

In [2]:
# Dropping all rows where Borough is Not Assigned
neighborhood = neighborhood[neighborhood.Borough != 'Not assigned']

# Combining the Neighbourhoods with the same Postal Code, separated by comma
neighborhood = neighborhood.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
neighborhood.reset_index(inplace=True)

# Replacing the name of the Neighbourhoods which are Not assigned with the names of the corresponding Borough
neighborhood['Neighbourhood'] = np.where(neighborhood['Neighbourhood'] == 'Not assigned',neighborhood['Borough'], neighborhood['Neighbourhood'])

neighborhood

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


We check the size of the dataset using .shape:

In [3]:
neighborhood.shape

(103, 3)

---

## Part 2 : Getting the latitude and the longitude coordinates of each neighborhood
We read it from the provided csv file

In [4]:
lat_long = pd.read_csv('https://cocl.us/Geospatial_data')
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We merge them with the existing dataset

In [5]:
neighborhood_latlong = pd.merge(neighborhood,lat_long,on='Postal Code')
neighborhood_latlong.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


---

# Part 3 : Exploring and clustering the neighborhoods in Toronto
We work with only boroughs that contain the word Toronto

In [6]:
toronto = neighborhood_latlong[neighborhood_latlong['Borough'].str.contains('Toronto',regex=False)]
toronto

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


Install/import all required libraries:

In [7]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
!pip install geopy #the !conda install command keeps failing for me in watson studio
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you have to reinstall
!pip install folium #the !conda install command keeps failing for me in watson studio
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## Create a map of Toronto with neighborhoods superimposed on top.

### Use geopy library to get the latitude and longitude values of Toronto.
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent toronto_explorer, as shown below.

In [8]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


In [9]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto['Latitude'], toronto['Longitude'], toronto['Borough'], toronto['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Cluster the neighborhoods
Run k-means to cluster the neighborhood into 5 clusters.

In [10]:
# set number of clusters
k=5
toronto_clustering = toronto.drop(['Postal Code','Borough','Neighbourhood'],1)

# run k-means clustering
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)

toronto.insert(0, 'Cluster Labels', kmeans.labels_)

# create map
map_clusters = folium.Map(location=[latitude,longitude],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(toronto['Latitude'], toronto['Longitude'], toronto['Neighbourhood'], toronto['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters