<h1 align=center><font size = 6>Toronto Neighbourhoods clustering</font></h1>

In this project, we use online resources to scrap them into their equivalent latitude and longitude values. Also, using Foursquare API we explore neighborhoods in Toronto, grouping the neighborhoods into clusters and getting the most common venues in each one. Finally, we will display it in a Folium map

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Dataframe preparation</a>

2. <a href="#item2">Geolocation of postal codes</a>

3. <a href="#item3">Neighbourhood clustering</a>
 
</font>
</div>

# 1. Dataframe preparation

First we import the file, making sure that the table has loaded correctly using head()

In [1]:
import pandas as pd

table = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)
df = pd.DataFrame(data = table[0])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## 1.1 High level exploratory analysis

Now we display the shape of the table, then segmenting the data into 2 "chunks" (head and tail) with a sample of 20 records each to tease the data within

In [2]:
df.shape

(288, 3)

In [3]:
df.head(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [4]:
df.tail(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
268,M8Y,Etobicoke,Kingsway Park South East
269,M8Y,Etobicoke,Mimico NE
270,M8Y,Etobicoke,Old Mill South
271,M8Y,Etobicoke,The Queensway East
272,M8Y,Etobicoke,Royal York South East
273,M8Y,Etobicoke,Sunnylea
274,M9Y,Not assigned,Not assigned
275,M1Z,Not assigned,Not assigned
276,M2Z,Not assigned,Not assigned
277,M3Z,Not assigned,Not assigned


We are able to spot some cases where either Borough, Neighbourhood or both appear labeled as "not assigned"

## 1.2 Data cleansing

We now clean those rows were data is not satisfactory considering two cases:
* Ignore cells with a borough that is Not assigned
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough

We drop first cells with a borough that is Not assigned:

In [5]:
df = df[(df.Borough != 'Not assigned') | (df.Neighbourhood != 'Not assigned')]

Then check cells with a borough but a Not assigned neighborhood:

In [6]:
df[(df.Borough != 'Not assigned') & (df.Neighbourhood == 'Not assigned')]

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Not assigned


The fastest way is to replace it manually:

In [7]:
df.loc[df.Borough == "Queen's Park", 'Neighbourhood'] = "Queen's Park"

And check again the shape of the dataframe:

In [8]:
df.shape

(211, 3)

And print the cleaned result:

In [9]:
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


## 1.3 Data grouping

Finally, it is required to group the data by postcode, where all rows containing the same code are parsed into a single line:

In [10]:
df= pd.DataFrame(df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(lambda joint: ', '.join(joint))).reset_index()
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [11]:
df.shape

(103, 3)

# 2. Geolocation of postal codes

As a preparatory step to use Foursquare API we will assign coordinates tho the postcodes prepared in the previous section, for which we leverage on python geocoder:

In [12]:
#!conda install -c conda-forge geocoder --yes
import geocoder

print('Geocoder installed successfully')

Geocoder installed successfully


Also for precaution, we create a copy of the dataframe, and prepare a list with the Postcodes contained in it:

In [13]:
df_copy=df.copy()
postcodes = df['Postcode'].tolist()

We execute geocoder with the following code as per documentation recommendation:

for code in postcodes:
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    print('{}, Toronto, Ontario'.format(code))
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(code))
      # g = geocoder.opencage('{}, Toronto, Ontario'.format(postal_code), key='13aa077d21ae42f287ec8607e07b2159')

      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    df_copy.loc[df_copy.Postcode == code, 'Latitude'] = latitude
    df_copy.loc[df_copy.Postcode == code, 'Longitude'] = longitude

**Note:** As this package can be quite unreliable, we load the coordinates from a previously extracted .csv file, and attach it to our file:

In [14]:
df_copy=df.copy()

coordinates=pd.read_csv('https://cocl.us/Geospatial_data')

df_copy = pd.merge(df_copy, coordinates, how='inner',left_on = 'Postcode', right_on = 'Postal Code')
df_copy= df_copy.drop('Postal Code', axis=1)
df_copy.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# 3. Neighbourhoods clustering

Following the same approach applied to cluster neighbourhoods in New York, we will: 
   * Create a map of Toronto with neighborhoods superimposed on top
   * Define Foursquare Credentials and Version to explore each neighbourhood
   * Explore the different neighbourhoods
   * Cluster them and examine them

In [19]:
from sklearn.cluster import KMeans

# !conda install -c conda-forge folium=0.5.0 --yes
import folium

## 3.1 Baseline Map creation
We define first the map figure that will hold the data from our analysis, using as location center the average centroids of each neighbourhood:

In [34]:
import numpy as np

latitude= df_copy['Latitude'].mean()
longitude= df_copy['Longitude'].mean()

map_toronto = folium.Map(location=[latitude,longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_copy['Latitude'], df_copy['Longitude'],df_copy['Borough'], df_copy['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 3.2 Foursquare Credentials and Version definition

In [46]:
import requests

CLIENT_ID = '5V0RKZ32BHYQYGKJ3DME1TCGET1MHFJC1U0XAVE2DHIEDEK3'
CLIENT_SECRET = '3GKEKHWAVBGUZ21SNLTWWGYV45KMMVBHCI4W24ZFQ1GKTED2'
VERSION = '20180605' # Foursquare API version
LIMIT=100

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: 5V0RKZ32BHYQYGKJ3DME1TCGET1MHFJC1U0XAVE2DHIEDEK3
CLIENT_SECRET:3GKEKHWAVBGUZ21SNLTWWGYV45KMMVBHCI4W24ZFQ1GKTED2


## 3.3 Neighbourhood exploration

We prepare the function that will allow us to fetch all the data required in a single call:

In [53]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

And applied to Toronto:

In [54]:
toronto_venues = getNearbyVenues(names=df_copy['Postcode'],
                                   latitudes=df_copy['Latitude'],
                                   longitudes=df_copy['Longitude']
                                  )

M1B
M1C
M1E
M1G
M1H
M1J
M1K
M1L
M1M
M1N
M1P
M1R
M1S
M1T
M1V
M1W
M1X
M2H
M2J
M2K
M2L
M2M
M2N
M2P
M2R
M3A
M3B
M3C
M3H
M3J
M3K
M3L
M3M
M3N
M4A
M4B
M4C
M4E
M4G
M4H
M4J
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5M
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6A
M6B
M6C
M6E
M6G
M6H
M6J
M6K
M6L
M6M
M6N
M6P
M6R
M6S
M7A
M7R
M7Y
M8V
M8W
M8X
M8Y
M8Z
M9A
M9B
M9C
M9L
M9M
M9N
M9P
M9R
M9V
M9W


We check if Foursquare has fetched the venues correctly:

In [55]:
print(toronto_venues.shape)
toronto_venues.head()

(2251, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,M1C,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,M1C,43.784535,-79.160497,Affordable Toronto Movers,43.787919,-79.162977,Moving Target
3,M1E,43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,M1E,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


And how many venues have been retrieved per neighbourhood:

In [56]:
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,1,1,1,1,1,1
M1C,2,2,2,2,2,2
M1E,7,7,7,7,7,7
M1G,3,3,3,3,3,3
M1H,8,8,8,8,8,8
M1J,3,3,3,3,3,3
M1K,5,5,5,5,5,5
M1L,10,10,10,10,10,10
M1M,2,2,2,2,2,2
M1N,4,4,4,4,4,4


In [57]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 279 uniques categories.


## 3.4 Neighbourhood clusters data preparation

### 3.4.1 Data encoding
Before clustering all the venues retrieved above, we need to encode the data:

In [58]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We see that the data as been consolidated correctly, now checking the shape:

In [60]:
toronto_onehot.shape

(2251, 280)

### 3.4.2 Grouped dataframe creation

We group the data per neighbourhood:

In [62]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


And define their most frequent venue, defining first a function to sort values in descending order and then classifing the Top 10 venues per neighbourhood:

In [66]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [70]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Drugstore,Diner,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Field
1,M1C,Bar,Moving Target,Yoga Studio,Drugstore,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
2,M1E,Mexican Restaurant,Breakfast Spot,Pizza Place,Medical Center,Electronics Store,Rental Car Location,Intersection,Doner Restaurant,Diner,Discount Store
3,M1G,Coffee Shop,Korean Restaurant,Yoga Studio,Drugstore,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
4,M1H,Bakery,Hakka Restaurant,Bank,Caribbean Restaurant,Lounge,Athletics & Sports,Thai Restaurant,Fried Chicken Joint,Doner Restaurant,Dive Bar


## 3.5 Venues clustering using k-means

Run k-means to cluster the neighborhood into 5 clusters.

In [100]:
# set number of clusters
kclusters = 7

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 4, 4, 1, 4, 4, 1, 4, 4, 4], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood and venue labels:

In [101]:
# add clustering labels
#neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = neighbourhoods_venues_sorted

toronto_merged = pd.merge(toronto_merged, coordinates, how='inner',left_on = 'Neighbourhood', right_on = 'Postal Code')
toronto_merged= toronto_merged.drop('Postal Code', axis=1)

# Add clustering labels
toronto_merged['Cluster Labels'] = kmeans.labels_

toronto_merged 

Unnamed: 0,Cluster Labels,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Latitude,Longitude
0,0,M1B,Fast Food Restaurant,Drugstore,Diner,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Field,43.806686,-79.194353
1,4,M1C,Bar,Moving Target,Yoga Studio,Drugstore,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,43.784535,-79.160497
2,4,M1E,Mexican Restaurant,Breakfast Spot,Pizza Place,Medical Center,Electronics Store,Rental Car Location,Intersection,Doner Restaurant,Diner,Discount Store,43.763573,-79.188711
3,1,M1G,Coffee Shop,Korean Restaurant,Yoga Studio,Drugstore,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,43.770992,-79.216917
4,4,M1H,Bakery,Hakka Restaurant,Bank,Caribbean Restaurant,Lounge,Athletics & Sports,Thai Restaurant,Fried Chicken Joint,Doner Restaurant,Dive Bar,43.773136,-79.239476
5,4,M1J,Pizza Place,Playground,Business Service,Yoga Studio,Donut Shop,Diner,Discount Store,Dive Bar,Dog Run,Doner Restaurant,43.744734,-79.239476
6,1,M1K,Discount Store,Department Store,Coffee Shop,Chinese Restaurant,Drugstore,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,43.727929,-79.262029
7,4,M1L,Bakery,Bus Line,Fast Food Restaurant,Soccer Field,Metro Station,Bus Station,Intersection,Park,Costume Shop,Coworking Space,43.711112,-79.284577
8,4,M1M,American Restaurant,Motel,Dim Sum Restaurant,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Yoga Studio,43.716316,-79.239476
9,4,M1N,College Stadium,General Entertainment,Skating Rink,Café,Donut Shop,Diner,Discount Store,Dive Bar,Dog Run,Doner Restaurant,43.692657,-79.264848


And we display it in a map:

In [108]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Venues: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=8,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters