# Clustering of Toronto Neighbourhood

This notebook contains the work for clustering Toronto neighbourhood

## Part I: Process and Clean Data

We first import the required libraries. The library 'requests' for handling html page requests, the library 'lxml' for processing the html page and the pandas library for handling data

In [191]:
import requests
import lxml.html as lh
import pandas as pd

Request the html page from wikipedia which lists the postal codes of cananda

In [192]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
doc = lh.fromstring(page.content)

Process the table which contains the postal code. The table is identified as the one where the 'class' attirbute is set to 'wikitable sortable' as there is only one table with that attribute. 

In [193]:
table_one_elements=doc.xpath('//table[@class="wikitable sortable"]/tbody/tr')
table_one_elements_len = len(table_one_elements)

Extract the table header

In [194]:
table_header=[]
for col in table_one_elements[0].iterchildren():
    table_header.append(col.text_content().strip()) 

Extract the table data in python dictionary format with the key being postal code. For each row, check that there are three columns and if the row does not contain 3 columns, do not process (as we dont know how to process it). For each processed row, check the column 2 (borough). If not assigned, skip the row. If entry for borough is defined for the row, then check if the postal code already exists in dictionary. If yes, then append the neighbourhood defined for the current row to the neighbourhood already definded for the postal code in the dictionary. Else, create a new entry for the postal code with the borough and neighbourhood defined for the current row. If the neighbourhood is not defined for the current row, then take borough as the neighbourhood 

In [195]:
table_data={}
for row in range(1,table_one_elements_len):
    row_content = table_one_elements[row]
    row_content_len = len(row_content)
  
  
    if row_content_len == 3:
        borough = row_content[1].text_content()
        if borough != 'Not assigned':
                
            postcode = row_content[0].text_content()            
            if postcode in table_data.keys():
      
                neighbourhood = row_content[2].text_content().strip()
                if neighbourhood == 'Not assigned':
                    neighbourhood = borough
                     
                old_neighbourhood = table_data[postcode][1]  
                new_neighbourhood = old_neighbourhood + ", " + neighbourhood
                attr = [borough, new_neighbourhood]
                    
            else:
                    
                neighbourhood = row_content[2].text_content().strip()    
                attr = [borough, neighbourhood]
                
            table_data[postcode] = attr      
                
    else:
        pass




Convert the dictionary into pandas data frame. Reset the index and rename the index colum to html table header[0] (extracted in earlier step). Print the data frame and check if the data frame looks good

In [196]:
df=pd.DataFrame.from_dict(table_data, orient='index', columns=table_header[1:]) 
df=df.reset_index().rename(columns={'index':table_header[0]})
df


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Print the shape of the data frame

In [197]:
df.shape

(103, 3)

## Part 2: Get Latitude and Longitude for postal codes

Import required libraries

In [198]:
import numpy as np

Read the csv file given in assignment into a dataframe

In [199]:
path='http://cocl.us/Geospatial_data'
df_ll = pd.read_csv(path)
df_ll.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Add two new columns - one for latitude and one for longitude to our dataframe that contains postal code - neighbourhood details and initialize them with NaN

In [200]:
df['Latitude'] = np.NaN
df['Longitude'] = np.NaN

For each row in the dataframe containing the postal code - neigbourhood details, extract the postal code. Find the postal code in the second data frame which contains the postal code - latitude, longitude information. Extract the latitude and longitude corresponding to the postal code. While extracting the latitude, longitude information, ensure that there is one and only one entry for the corresponding postal code in the second data frame. If not, reset the latitude and longitude information to NaN. Update the latitude, longitude information in the first data frame against the row corresponding to the postal index

In [201]:
for index, row in df.iterrows():
    postal_code=row['Postcode']
    LL_info=df_ll[df_ll['Postal Code'] == postal_code]
    
    if LL_info.shape[0] == 1:
        latitude=LL_info.iloc[0]['Latitude']
        longitude=LL_info.iloc[0]['Longitude'] 
    else:
        latitude=np.NaN
        longitude=np.NaN
        
    df.at[index, 'Latitude']  = latitude
    df.at[index, 'Longitude'] = longitude


Print the data frame and check that the updates are as expected and that all rows has valid data for all columns

In [202]:
df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.654260,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Not assigned,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


## Part 3: Clustering Neighbourhoods

Import required libraries

In [203]:
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 

!conda install -c conda-forge folium=0.5.0 --yes 
import folium 

from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.



Print the boroughs

In [204]:
print(df['Borough'].unique())

['North York' 'Downtown Toronto' "Queen's Park" 'Etobicoke' 'Scarborough'
 'East York' 'York' 'East Toronto' 'West Toronto' 'Central Toronto'
 'Mississauga']


To simplify the problem, we look at the neigbourhood of boroughs that contain 'Toronto'. Extract boroughs that contain 'Toronto' in a new data frame called toronto_data

In [205]:
toronto_data=df[df['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
1,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
5,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
6,M6G,Downtown Toronto,Christie,43.669542,-79.422564
7,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
8,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


Get the latitude, longitude for the city of Toronto, Canada

In [206]:
address = 'Toronto, CAN'

geolocator = Nominatim(user_agent="can_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))


The geograpical coordinate of Toronto are 43.6913544, -79.5006666.


Visualize the toronto neighbourhood in a map

In [207]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Define four square credentials

In [208]:
CLIENT_ID = 'XXX'
CLIENT_SECRET = 'YYY' 
VERSION = '20180605' 

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 0OUEGH5DZVMBOKZL5NEEXPMBK42XH2EEPXUHBVVSQPLS3QM5
CLIENT_SECRET:45VSVB5TJYWEKUN3JOSE10E5QLCUU32WVE5EMZQZPVSP1RTC


Get 100 nearby venues (within the radius of 500) for each neigbourhood to use as feature for clustering of neighbourhood

In [209]:
venues_list=[]
for lat, lng, neighbourhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood']):
     
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, 500, 100)
            
       
    results = requests.get(url).json()["response"]['groups'][0]['items']
    for item in results:
        venues_list.append((neighbourhood, lat, lng, 
                            item['venue']['name'], item['venue']['location']['lat'], item['venue']['location']['lng'], item['venue']['categories'][0]['name']))
                        
    # print(venues_list)
    
    toronto_venues = pd.DataFrame([item for item in venues_list])
    toronto_venues.columns = ['Neighbourhood', 'Neighbourhood Latitude', 'Neighbourhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    


In [210]:
print("No of venues for Toronto neighbourhood:", toronto_venues.shape[0])
print("No of categories of venues for Toronto neighbourhood:", len(toronto_venues['Venue Category'].unique()))

No of venues for Toronto neighbourhood: 1688
No of categories of venues for Toronto neighbourhood: 235


Across all neigbourhoods of Toronto, we have 1688 venues which falls within 235 category. We want to find for each of the 38 neighbourhoods of Toronto, the top 5 category of venues that exists in the neighbourhood

In [211]:
toronto_venue_category_onehot = pd.get_dummies(toronto_venues[['Venue Category']])                        
toronto_venue_onehot = pd.merge(toronto_venues['Neighbourhood'], toronto_venue_category_onehot, left_index=True, right_index=True)
toronto_venues_grouped = toronto_venue_onehot.groupby('Neighbourhood').mean().reset_index()

In [212]:
print("No of entries for venues after one-hot:", toronto_venue_onehot.shape[0])
print("No of entries for venues after neighbourhood grouping for category:", toronto_venues_grouped.shape)

No of entries for venues after one-hot: 1688
No of entries for venues after neighbourhood grouping for category: (38, 236)


Run K-mens clustering with K=4

In [213]:
toronto_venues_grouped_clustering = toronto_venues_grouped.drop('Neighbourhood', 1)
kmeans = KMeans(n_clusters=4, random_state=0).fit(toronto_venues_grouped_clustering)
kmeans.labels_

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
       1, 2, 1, 2, 2, 1, 3, 2, 2, 2, 2, 2, 2, 0, 2, 2], dtype=int32)

We now have for each of the neighbourhood, the occurence count for each venue category. Find the top 5 for each neighbourhood. Also update the resultant data frame with cluster information

In [214]:
toronto_nbhood_cluster_top5 = []

for index, row in toronto_venues_grouped.iterrows():
    neighbourhood = row['Neighbourhood']
    row_transpose  = toronto_venues_grouped[toronto_venues_grouped['Neighbourhood'] == neighbourhood].T.reset_index()
    row_categories = row_transpose.iloc[1:]
    row_categories.columns=['Venue Category', 'Freq']
    row_categories.set_index('Venue Category', inplace=True)
    row_categories_sorted = row_categories.sort_values('Freq', ascending=False)
    row_categories_top5=row_categories_sorted.index.values[0:5]
    
    toronto_data_row = toronto_data[toronto_data['Neighbourhood'] == neighbourhood]
    neighbourhood_top5=[neighbourhood, toronto_data_row.iloc[0]['Latitude'], toronto_data_row.iloc[0]['Longitude'], kmeans.labels_[index]]
    for item in row_categories_top5:
        neighbourhood_top5.append(item)
        
    toronto_nbhood_cluster_top5.append(neighbourhood_top5)

# print(toronto_nbhood_cluster_top5)

Covert the list to pandas data frame and print to check if it looks good

In [215]:
df_nbhood_venue=pd.DataFrame.from_records(toronto_nbhood_cluster_top5) 
df_nbhood_venue.columns=['Neighbourhood', 'Latitude', 'Longitude','Cluster', '1st Venue Category', '2nd Venue Category', '3rd Venue Category', '4th Venue Category', '5th Venue Category']
df_nbhood_venue

Unnamed: 0,Neighbourhood,Latitude,Longitude,Cluster,1st Venue Category,2nd Venue Category,3rd Venue Category,4th Venue Category,5th Venue Category
0,"Adelaide, King, Richmond",43.650571,-79.384568,2,Venue Category_Coffee Shop,Venue Category_Café,Venue Category_Steakhouse,Venue Category_Bar,Venue Category_Thai Restaurant
1,Berczy Park,43.644771,-79.373306,2,Venue Category_Coffee Shop,Venue Category_Cocktail Bar,Venue Category_Farmers Market,Venue Category_Café,Venue Category_Cheese Shop
2,"Brockton, Exhibition Place, Parkdale Village",43.636847,-79.428191,2,Venue Category_Breakfast Spot,Venue Category_Café,Venue Category_Coffee Shop,Venue Category_Burrito Place,Venue Category_Bar
3,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,2,Venue Category_Light Rail Station,Venue Category_Yoga Studio,Venue Category_Park,Venue Category_Comic Shop,Venue Category_Recording Studio
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.628947,-79.39442,2,Venue Category_Airport Service,Venue Category_Airport Lounge,Venue Category_Airport Terminal,Venue Category_Sculpture Garden,Venue Category_Airport Food Court
5,"Cabbagetown, St. James Town",43.667967,-79.367675,2,Venue Category_Coffee Shop,Venue Category_Restaurant,Venue Category_Pizza Place,Venue Category_Bakery,Venue Category_Pub
6,Central Bay Street,43.657952,-79.387383,2,Venue Category_Coffee Shop,Venue Category_Ice Cream Shop,Venue Category_Italian Restaurant,Venue Category_Café,Venue Category_Burger Joint
7,"Chinatown, Grange Park, Kensington Market",43.653206,-79.400049,2,Venue Category_Café,Venue Category_Vegetarian / Vegan Restaurant,Venue Category_Chinese Restaurant,Venue Category_Bakery,Venue Category_Mexican Restaurant
8,Christie,43.669542,-79.422564,2,Venue Category_Café,Venue Category_Grocery Store,Venue Category_Park,Venue Category_Baby Store,Venue Category_Athletics & Sports
9,Church and Wellesley,43.66586,-79.38316,2,Venue Category_Coffee Shop,Venue Category_Japanese Restaurant,Venue Category_Sushi Restaurant,Venue Category_Restaurant,Venue Category_Gay Bar


We see that the cluster 1 has more establishments of parks, playgrounds or nature trails. Cluster 2 includes neighbourhood that has high food establishments such as coffee, restaurants or bub. While we cannot be too sure of cluster 0 and 3 due to low sample points in the cluster, it looks cluster 0 is most likely to be a tourist place with high establishments of trails and monuments/Landmark. We can plot the neighbourhoods based on cluster to visualize the analysis

In [216]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)


colors = ['red', 'green', 'blue', 'purple']

markers_colors = []
for lat, lon, poi, cluster in zip(df_nbhood_venue['Latitude'], df_nbhood_venue['Longitude'], df_nbhood_venue['Neighbourhood'], df_nbhood_venue['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=colors[cluster-1],
        fill=True,
        fill_color=colors[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters
    

That concludes our analysis of clustering of neighbourhood of Toronto!