<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

## Introduction

In this assignment, we will explore, segment, and cluster the neighborhoods in the city of Toronto. For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto.

Once the data is in a structured format, we can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  Extract Toronto neighborhood data from wikipedia and build data frame

    
2.  Coordinates (lat, lng) for each Toronto neighborhoods

    
3.  Analysis on neighborhoods by clustering technique

    </font>
    </div>

First, let's import all the required libraries

In [1]:
import numpy as np
import pandas as pd

import requests
import json # to handle json file from Foursquare

#!conda install -c conda-forge geopy --yes # uncomment if required
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#! pip install geocoder
import geocoder

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#! pip install folium # uncomment if required
import folium # map rendering library

## 1. Extract Toronto neighborhood data from wikipedia and build data frame

Toronto neightborhood data is available on this webpage: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. We would need to crawl through the html and extract the table into dataframe.

#### First we extract the table from wikipedia page using pandas

In [2]:
# Extract html data
data_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_data = requests.get(data_url)

In [3]:
# Use pandas to attempt reading the table in html
wiki_df = pd.read_html(wiki_data.text)
print(len(wiki_df), type(wiki_df), type(wiki_df[0])) # check the size & type

3 <class 'list'> <class 'pandas.core.frame.DataFrame'>


Click **here** to show the code to loop through captured dataframes

<!--
for data in wiki_df:
    print(data.head())
    print('----------------------------------------')
-->

In [4]:
# From the above, we know that the neighborhood data is the first item in list
neigh_df = wiki_df[0]
print(neigh_df.head())
print('The shape of neighborhood data is {}'.format(neigh_df.shape))

  Postal Code           Borough              Neighbourhood
0         M1A      Not assigned               Not assigned
1         M2A      Not assigned               Not assigned
2         M3A        North York                  Parkwoods
3         M4A        North York           Victoria Village
4         M5A  Downtown Toronto  Regent Park, Harbourfront
The shape of neighborhood data is (180, 3)


#### Clean up the data

Below is more info for the table data:
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [5]:
# run through each row in the dataframe for cleanup exercise
for i, row in neigh_df.iterrows():
    
    # drop row if Borough is not assigned
    if row['Borough'] == 'Not assigned':
        neigh_df.drop(i, axis=0, inplace=True)
        
    # assign 'Borough' name to Neighborhood if neighborhood is not assigned
    elif row['Neighbourhood'] == 'Not assigned':
        neigh_df.at[i, 'Neighbourhood'] = row['Borough']

neigh_df.reset_index(drop=True, inplace=True)
print("Check if any 'Not assigned' left: ", any(neigh_df['Borough'] == 'Not assigned'), ', ', any(neigh_df['Neighbourhood'] == 'Not assigned'))

Check if any 'Not assigned' left:  False ,  False


<div class="alert alert-block alert-info" style="margin-top: 20px">
    <font size=3><b> To show the shape of dataframe after cleanup </b></font>
</div>

In [6]:
neigh_df.shape

(103, 3)

## 2. Coordinates (lat, lng) for each Toronto neighborhoods

In [7]:
# Import the coordinates file
coords_df = pd.read_csv("http://cocl.us/Geospatial_data")
coords_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Click **here** for some note on why I did not use geocoder

<!-- I tried to use geocoder function but unable to, it keeps returning None

# define the function to get lat & lng by postal code
def get_latlng(postal_code):
    # initialize variables
    lat_lng_coords = None

    # loop until you get the coordinates
    while (lat_lng_coords is None):        
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
        
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    return latitude, longitude
-->

In [8]:
# join Lat & lng into neighborhood dataframe
neigh_df = neigh_df.join(coords_df.set_index('Postal Code'), on='Postal Code')

# check if there is any missing value for Latitude & Longitude
print('Lat or Lng is missing? ', neigh_df[['Latitude', 'Longitude']].isnull().values.any())

Lat or Lng is missing?  False


<div class="alert alert-block alert-info" style="margin-top: 20px">
    <font size=3><b> To show dataframe after incorporating coordinates </b></font>
</div>  

In [9]:
neigh_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## 3. Analysis on neighborhoods by clustering technique

### 3.1 Retrieve nearby venues of interest for all neighborhoods

#### Define Foursquare Credentials and Version

In [10]:
CLIENT_ID = 'P10LET5HE4VWEJA3E30U41HNIWR11PSIAZUXKE5LGXK21UF1' # your Foursquare ID
CLIENT_SECRET = 'CYJPOCAFILWGGJHUFWMFJXJLIBO2BIJAZ0XGC3NBMD1EDAJ5' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

#### Filter Toronto borough only

In [11]:
# only keep data with Borough containing 'Toronto'
toronto_df = neigh_df[neigh_df['Borough'].str.contains('Toronto')].reset_index(drop=True)

#### Define a function to get nearby venues for neighborhood

In [12]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Get nearby values for all neighborhoods in Toronto

In [13]:
toronto_venues = getNearbyVenues(names=toronto_df['Neighbourhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude'])

In [14]:
# Examine the result set
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))
toronto_venues.head()

There are 237 uniques categories.


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


### 3.2 Analyze data and get ready for clustering

#### Convert venue categories to columns

In [15]:
# one hot encoding
# add in prefix for category to avoid duplicate column name. In fact is a category called 'Neighborhood'
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="Cat", prefix_sep="_")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

#### Group rows by neighborhood and calculate mean occurence of each venue category

In [16]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

### 3.3 Cluster Neighborhoods

In [17]:
# set number of clusters
kclusters = 8

# run k-means on the venue category
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', axis=1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# create a new dataframe that contains neighborhood coords with cluster label for plotting
toronto_cluster = toronto_df
toronto_cluster['Cluster Labels'] = kmeans.labels_

### 3.4 Plot the Map

#### First let's find Toronto coordinates to center the map

In [18]:
# find the coordinates of Toronto
address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="tr_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


<div class="alert alert-block alert-info" style="margin-top: 20px">
    <font size=3><b> To show map with clusters </b></font>
</div>  

#### kclusters = 8 is the optimal value to give us distinct clusters

In [19]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_cluster['Latitude'], toronto_cluster['Longitude'], toronto_cluster['Neighbourhood'], toronto_cluster['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=1).add_to(map_clusters)
       
map_clusters

## THE END.