# **Segmenting and Clustering Neighbourhoods in Toronto**


## Introduction

As part of this assignment, we explore, segment and cluster the neighborhoods in the city of Toronto. We utilize a Wikipedia page to read in Toronto neighborhood data. We scrape the Wikipedia page to extract data and read it into a pandas dataframe by using pandas.

We convert addresses into their equivalent latitude and longitude values and use the Foursquare API to explore neighborhoods in Toronto. We get the most common venue categories in each neighborhood, and then group the neighborhoods into clusters by using the. *k*-means clustering algorithm. Finally, the Folium library is used to visualize the neighborhoods in Toronto and their emerging clusters.

<a id='Q1'></a>

## **Question 1: Solution steps start here**

#### Let's ensure to install all the dependent components first

In [1]:
%%capture 
# Ignore command line output
!conda install -c anaconda lxml --yes
!conda install -c anaconda html5lib
!conda install -c anaconda BeautifulSoup4 --yes
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes

We import all the libraries needed for this program

In [None]:
import numpy as np
import pandas as pd
from lxml import html # For reading html pages
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library
import requests # library to handle requests
from sklearn.cluster import KMeans # import k-means from clustering stage
# Matplotlib and associated plotting modules
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

## Download and Explore Dataset

Let's download the web page by using the Pandas library. Pandas reads in the wikipedia page as a list. The first element in this list is the dataframe with the actual data that we need.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
toronto_wiki = pd.read_html(url)
df_t = pd.DataFrame(toronto_wiki[0])
df_t.head()

Let's drop cells with no borough i.e. Name of the Borough is 'Not assigned'

In [None]:
df_t = df_t.drop(df_t[df_t['Borough'] == 'Not assigned'].index, axis=0).reset_index(drop = True)
df_t.head()

The wikipedia page appears to have been updated to address the following requirement:

_More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table_

Let's verify whether this requirement is met already

In [None]:
if len(df_t['Postal Code'].unique()) != len(df_t['Postal Code']):
    print("Multiple entries for the same postal code")
else:
    print("Only unique entries for all the postal codes")

The wikipedia page appears to have been updated to match the following requirement:

_If a cell has a borough but a Not assigned neighbourhood, then the neighbourhood will be the same as the borough_

Let's verify whether this requirement is met already

In [None]:
if len(df_t[df_t['Neighbourhood'] == 'Not assigned']) == 0:
    print("Found no 'Not assigned' neighbourhood for any Boroughs")
else:
    print("Found 'Not assigned' neighbourhoods for Boroughs")

Check and identify whether a Neighbourhood belongs to more than one Borrough. Not an explicit ask as per the assessment though

In [None]:
if len(df_t['Neighbourhood'].unique()) != len(df_t['Neighbourhood']):
    print("Multiple entries for the same Neighbourhood")
    print()
    print(df_t['Neighbourhood'].value_counts().loc[lambda x: x > 1])
    print()
    print(df_t.groupby('Neighbourhood').filter(lambda x: len(x) > 1))
else:
    print("Only unique entries for all the Neighbourhood")

In [None]:
df_t.head()

## **Answer 1: Here's the output of the .shape method that prints the number of rows of the dataframe**

In [None]:
print("The shape of the dataframe is:", df_t.shape)
print("The number of rows of the dataframe is:", df_t.shape[0])


---


<a id='Q2'></a>

## **Question 2: Solution steps start here**

Let's first convert the postal code to latitude and longitude as this is a prerequisite to utilize the Foursquare location data

**Geocode API did not work. Hence not purusing the same**

In [None]:
%%capture 
# On execution we ignore the output
# !pip install geocoder

In [None]:
# Commented the following as the Geocode API did not work
# import geocoder # import geocoder
#
# Define a function to look up the Latitude and Longitude of a postal code area in Toronto, Ontario
#
# def pc_to_lat_long(postal_code):
#    # initialize your variable to None
#    lat_lng_coords = None
#    
#    # loop until you get the coordinates
#    while(lat_lng_coords is None):
#    g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#        lat_lng_coords = g.latlng
#
#    return(lat_lng_coords[0], lat_lng_coords[1])
#   
# df_t['Latitude'], df_t['Longitude'] = pc_to_lat_long(df_t['Postal Code'])

Let's load the Coursera provided csv file that has the geographical coordinates of each postal code at: http://cocl.us/Geospatial_data

In [None]:
gc_source = 'http://cocl.us/Geospatial_data'
gc_df = pd.read_csv(gc_source)
gc_df.head()

Let's create a copy of the Toronto dataframe and merge the latitude and longitude from the above datafarme based on postal code

In [None]:
# Create a copy
df_tgc = df_t.copy()

# Merge the latitude & longitude from the GC Data Frame for each Postal Code with the Toronto Postal Code Data Frame
df_tgc = pd.merge(df_tgc, gc_df, on='Postal Code')

## **Answer 2: Here's the dataframe arrived at by using the coursera provided csv file**

In [None]:
df_tgc.head(11)


---


<a id='Q3'></a>

## **Question 3: Solution starts here**

In [None]:
df_tgc.shape

Let's explore and cluster the neighborhoods in Toronto. Let's work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data

In [None]:
tslice = df_tgc[df_tgc['Borough'].str.match('.*Toronto') == True]
tslice = tslice.reset_index(drop=True)

In [None]:
print('The dataframe has {} boroughs and {} neighbourhoods.'.format(
        len(tslice['Borough'].unique()),
        tslice.shape[0]
    )
)

Let's get the latitude and longitude of Toronto

In [None]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
lati = location.latitude
long = location.longitude
print('The geograpical coordinate of Toronto: {}, {}.'.format(lati, long))

Let's create a map of Toronto using the above latitude and longitude values

In [None]:
map_toronto = folium.Map(location=[lati, long], zoom_start=12)

# add markers to map
for lat, lng, label in zip(tslice['Latitude'], tslice['Longitude'], tslice['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Here's a copy of the Folium map just in case you are not able to view the map shown above**

![](https://raw.githubusercontent.com/bala-viswanathan/Coursera_Capstone/master/Toronto-Neighbourhoods.png)

Let's collect the credentials needed to query Foursquare for location data

In [None]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Let's define a function that will enable us to explore and get recommendations for venues within a radius of 'radius' and limit to 'LIMIT'

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Let's invoke the same for all the Toronoto neighbourhoods that contain the word Toronto 

In [None]:
LIMIT = 100
toronto_venues = getNearbyVenues(names=tslice['Neighbourhood'],
                                   latitudes=tslice['Latitude'],
                                   longitudes=tslice['Longitude']
                                  )

In [None]:
toronto_venues.shape

In [None]:
toronto_venues.head()

Let's check how many venues were returned for each neighborhood

In [None]:
toronto_venues.groupby('Neighbourhood').Venue.count()

Let's identify the number of unique category of venues in these neighbourhoods

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

## Analysis of the neighbourhoods

Let's use one hot encoding to convert the categories into columns to be used for processing further down

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

In [None]:
toronto_onehot.shape

Let's group rows by neighbourhood and the mean of the frequency of occurrence of each category

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
print(toronto_grouped.shape)
toronto_grouped.head()

Let's list the top fives venues in the first five neighbourhood. Limiting to first five neighbourhood to limit the size of the output

In [None]:
num_top_venues = 5
num_neighbourhood = 5

for hood in toronto_grouped['Neighbourhood'][0:num_neighbourhood]:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

Let's write a function to sort the venues in descending order and create a new dataframe and show the top 10 venues for each neighbourhood.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

### Cluster Neighbourhoods
Run *k*-means to cluster the neighbourhood into an optimal number of clusters

Let's try a range of clusters and compare the cost to identify the best k by usnig the elbow method

In [None]:
# Let's focus on the venues only
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# Let's gather the cost in a list
cost = []
# set range of clusters to search the optimal solution for

rangeofK = range(1,11)

for i in rangeofK:
    kclusters = i

    # run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

    # check cluster labels generated for each row in the dataframe
    kmeans.labels_[0:10]
  
    cost.append(kmeans.inertia_)
  
# plot the cost against K values 
plt.plot(rangeofK, cost, color ='g', linewidth ='3') 
plt.xlabel("Value of K") 
plt.ylabel("Sqaured Error (Cost)") 
plt.show()
# print(cost)

The above graph does not appear to be very conclusive as the Elbow is not very prominent. Let's now use the Silhouette score to identify the best k. Silhouette needs at least two clusters to start with

In [None]:
# import dependencies
from sklearn.metrics import silhouette_score

sil = []
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)
rangeofK = range(2,11)

for i in rangeofK:
    kclusters = i

    # run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, n_init=20, random_state=0).fit(toronto_grouped_clustering)
    sil.append(silhouette_score(toronto_grouped_clustering, kmeans.labels_, metric = 'euclidean'))

Let's plot the Silhouette Scores against K. It reaches its global maximum at the optimal K

In [None]:
# Find the optimal K
npsil = np.array(sil)
kclusters = npsil.argmax() + 2 # We started with 2

# plot the Silhouette score against K
plt.plot(rangeofK, sil, color ='g', linewidth ='3')
plt.plot(kclusters, npsil.max(), 'ro')
plt.xlabel("Value of K") 
plt.ylabel("Silhouette score")
plt.show()

print("Max silhouette score:", npsil.max(), "at k:", npsil.argmax() + 2)

# run k-means clustering with optimal K
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = tslice

# add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

print(toronto_merged.head())

Let's find how the neighbourhoods have been clustered

In [None]:
toronto_merged.groupby('Cluster Labels').Borough.count()

## **Answer 3: Maps to visualize the neighborhoods and how they cluster together**

In [None]:
# create map
map_clusters = folium.Map(location=[lati, long], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

**Here's a copy of the Folium map just in case you are not able to view the map shown above**

![]()https://raw.githubusercontent.com/bala-viswanathan/Coursera_Capstone/master/Toronto-Neighbourhoods-Clusters.png

Let's examine each cluster and determine the discriminating venue categories that distinguish each cluster

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]-5))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

### Naming clusters based on the characteristics of the top common venues in the neighbourhoods

Cluster | Recommended Name | Comments
--- | --- | ---
`0` | **Foodies Delight** | *One or more of the top 5 common venues in the neighbourhood serve food and/or beverages*
`1` | **Alloy** | *Mixed venues with a bias for food*