# Toronto Neighborhood Clustering

Prepared by [Ali Rifat Kaya](https://www.linkedin.com/in/alirifatkaya/)

For best experience, please view the notebook [here](#https://nbviewer.jupyter.org/github/alirifat/ibm-capstone/blob/master/toronto_clustering.ipynb)

* In this project, we will cluster the neighborhoods in Toronto, which are located in boroughs that contain 'Toronto', such as Downtown Toronto, East Toronto, West Toronto etc.


1. Import neighborhood data and coordinates data and generate a new data set by merging them.
2. Select only the boroughs that contain 'Toronto'.
3. Plot all boroughs in Toronto using geographical coordinates.
4. Plot only those boroughs that contain 'Toronto' using geographical coordinates.
5. Find optimum distance between the boroughs to increase the map coverage and prevent overlapping among boroughs.
6. Collect venues for each neighborhood and convert the data into structured format.
7. Explore venues and perform clustering analysis using $k$-means.


## Table of Contents

* [Import Libraries & Read Data](#Import-Libraries-and-Read-Data)
* [Visualize the Neighborhoods](#Visualize-the-Neighborhoods)
* [Access to Foursquare API](#Access-to-Foursquare-API)
* [Explore Neighborhoods in Toronto](#Explore-Neighborhoods-in-Toronto)
* [Analyze Each Neighborhood](#Analyze-Each-Neighborhood)
* [Cluster Neighborhoods](#Cluster-Neighborhoods)
* [Examine Clusters](#Examine-Clusters)

## Import Libraries and Read Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import requests
import folium
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim
from geopy.distance import great_circle
from pandas import json_normalize

##Read Data

In [2]:
# Toronto neighborhood data
# Scrapped from Wikipedia
df = pd.read_csv('toronto_neighborhood_data.csv')
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [3]:
# Coordinates data
df_coordinates = pd.read_csv('Geospatial_Coordinates.csv')
df_coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


* We will use postal code to merge two data frames.

In [4]:
df = df.merge(df_coordinates, on='Postal Code', how='left')
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [5]:
def get_neighborhoods(string):
    return len(string.split(', '))

In [6]:
df['Number of Neighborhoods'] = df.Neighborhood.apply(get_neighborhoods)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Postal Code              103 non-null    object 
 1   Borough                  103 non-null    object 
 2   Neighborhood             103 non-null    object 
 3   Latitude                 103 non-null    float64
 4   Longitude                103 non-null    float64
 5   Number of Neighborhoods  103 non-null    int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 5.6+ KB


In [8]:
print("""

    The dataframe has {} boroughs and {} neighborhoods.

""".format(len(df.Borough.unique()),
           np.sum(df['Number of Neighborhoods'].values)))



    The dataframe has 10 boroughs and 217 neighborhoods.




In [9]:
# unique boroughs and total neighborhoods
df.groupby('Borough')['Number of Neighborhoods'].sum()

Borough
Central Toronto     18
Downtown Toronto    39
East Toronto         8
East York            7
Etobicoke           47
Mississauga          1
North York          38
Scarborough         38
West Toronto        13
York                 8
Name: Number of Neighborhoods, dtype: int64

* Select only those Boroughs that contains 'Toronto'.

In [10]:
df_toronto = df.loc[df.Borough.str.contains('Toronto')]
df_toronto.reset_index(inplace=True, drop=True)
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Number of Neighborhoods
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,1
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,2
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,2
3,M4M,East Toronto,Studio District,43.659526,-79.340923,1
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,1


## Visualize the Neighborhoods

* Obtain latitude and longitude for Toronto, CA.

In [11]:
address = 'Toronto, Ontario, CA'
geolocator = Nominatim(user_agent="https")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print("""

    The geograpical coordinate of Toronto, CA are {}, {}.

""".format(latitude, longitude))



    The geograpical coordinate of Toronto, CA are 43.6534817, -79.3839347.




* Map of Toronto with all boroughs marked

In [12]:
# 0.05 is adjustment to see all boroughs
map_toronto = folium.Map(location=[latitude + 0.05, longitude], zoom_start=11)

# add markers to map
for row in df.itertuples():
    label = 'Postal Code: {}, Borough: {}, Neigborhoods: {}'.format(
        row[1], row.Borough, row.Neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([row.Latitude, row.Longitude],
                        radius=5,
                        popup=label,
                        color='blue',
                        fill=True,
                        fill_color='#3186cc',
                        fill_opacity=0.7,
                        parse_html=False).add_to(map_toronto)

In [13]:
display(map_toronto)

Map of Toronto: Boroughs that only contain 'Toronto' is plotted

In [14]:
only_toronto_boroughs = folium.Map(location=[latitude + 0.01, longitude],
                                   zoom_start=12)

# add markers to map
for row_namedtuple in df_toronto.itertuples():
    label = 'Postal Code: {}, Borough: {}, Neigborhoods: {}'.format(
        row_namedtuple[1], row_namedtuple.Borough, row_namedtuple.Neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([row_namedtuple.Latitude, row_namedtuple.Longitude],
                        radius=5,
                        popup=label,
                        color='blue',
                        fill=True,
                        fill_color='#3186cc',
                        fill_opacity=0.7,
                        parse_html=False).add_to(only_toronto_boroughs)

In [15]:
only_toronto_boroughs

* Boroughs are scatter around Downtown Toronto. 
* Using a fixed radius will result in overlaps among boroughs. For this reason, we will calculate the distance between two closest boroughs and use the half of the distance as radius. By doing so, we hope to increase the coverage rate and prevent overlapping.
* To increase the variability of the data we will perform clustering analysis including all boroughs.

In [16]:
# all possible combinations of two-pair Postal Codes
from itertools import permutations
df_distances = pd.DataFrame(
    [*permutations(df['Postal Code'].values, 2)],
    columns=['postal_code1', 'postal_code2'])
df_distances.head(3)

Unnamed: 0,postal_code1,postal_code2
0,M1B,M1C
1,M1B,M1E
2,M1B,M1G


In [17]:
# get coordinates from df_toronto dataframe for each columns of postal code
temp = df.loc[:, ['Postal Code', 'Latitude', 'Longitude']]
df_distances = df_distances.merge(temp,
                                  left_on='postal_code1',
                                  right_on='Postal Code',
                                  validate='many_to_one')
df_distances = df_distances.merge(temp,
                                  left_on='postal_code2',
                                  right_on='Postal Code',
                                  validate='many_to_one')
df_distances.drop(['Postal Code_x', 'Postal Code_y'], axis=1, inplace=True)
df_distances.head(3)

Unnamed: 0,postal_code1,postal_code2,Latitude_x,Longitude_x,Latitude_y,Longitude_y
0,M1B,M1C,43.806686,-79.194353,43.784535,-79.160497
1,M1E,M1C,43.763573,-79.188711,43.784535,-79.160497
2,M1G,M1C,43.770992,-79.216917,43.784535,-79.160497


* `Latitude_x` and `Longitude_x` represent the coordinates of the postal codes in `postal_code1` column
* `Latitude_y` and `Longitude_y` represent the coordinates of the postal codes in `postal_code2` column

In [18]:
df_distances.shape

(10506, 6)

In [19]:
df_distances['distance'] = df_distances.apply(lambda x: great_circle(
    (x.Latitude_x, x.Longitude_x), (x.Latitude_y, x.Longitude_y)).meters,
                                              axis=1)
df_distances.head(3)

Unnamed: 0,postal_code1,postal_code2,Latitude_x,Longitude_x,Latitude_y,Longitude_y,distance
0,M1B,M1C,43.806686,-79.194353,43.784535,-79.160497,3667.56356
1,M1E,M1C,43.763573,-79.188711,43.784535,-79.160497,3250.398612
2,M1G,M1C,43.770992,-79.216917,43.784535,-79.160497,4773.524176


* Calculated the distance for every possible pair of postal codes in meters
* We will only choose the minimum distance for each postal code

In [20]:
df_min_distances = df_distances[[
    'postal_code1', 'postal_code2', 'distance'
]].loc[df_distances.groupby('postal_code1')['distance'].idxmin()]
df_min_distances.reset_index(drop=True, inplace=True)
df_min_distances.rename(columns={
    'postal_code1': 'Postal Code',
    'postal_code2': 'Nearest Postal Code',
    'distance': 'Distance'
},
                        inplace=True)
df_min_distances.head(3)

Unnamed: 0,Postal Code,Nearest Postal Code,Distance
0,M1B,M1X,3396.253382
1,M1C,M1E,3250.398612
2,M1E,M1G,2410.515595


In [21]:
df_min_distances = df_min_distances.merge(df, on='Postal Code')
df_min_distances = df_min_distances[[
    'Postal Code', 'Nearest Postal Code', 'Borough', 'Neighborhood',
    'Number of Neighborhoods', 'Latitude', 'Longitude', 'Distance'
]]
# take half of the distance for radius
df_min_distances['Distance'] = df_min_distances.Distance / 2
df_min_distances.sort_values('Postal Code', inplace=True)
df_min_distances.head()

Unnamed: 0,Postal Code,Nearest Postal Code,Borough,Neighborhood,Number of Neighborhoods,Latitude,Longitude,Distance
0,M1B,M1X,Scarborough,"Malvern, Rouge",2,43.806686,-79.194353,1698.126691
1,M1C,M1E,Scarborough,"Rouge Hill, Port Union, Highland Creek",3,43.784535,-79.160497,1625.199306
2,M1E,M1G,Scarborough,"Guildwood, Morningside, West Hill",3,43.763573,-79.188711,1205.257798
3,M1G,M1H,Scarborough,Woburn,1,43.770992,-79.216917,913.470885
4,M1H,M1G,Scarborough,Cedarbrae,1,43.773136,-79.239476,913.470885


In [22]:
df_min_distances.shape

(103, 8)

In [23]:
scaled_toronto_boroughs = folium.Map(location=[latitude + 0.07, longitude],
                                   zoom_start=10.5)

# add markers to map
for row_namedtuple in df_min_distances.itertuples():
    label = 'Postal Code: {}, Borough: {}, Neigborhoods: {}'.format(
        row_namedtuple[1], row_namedtuple.Borough, row_namedtuple.Neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([row_namedtuple.Latitude, row_namedtuple.Longitude],
                        radius=1,
                        color='blue',
                        fill=True,
                        fill_color='#3186cc',
                        fill_opacity=0.7,
                        parse_html=False).add_to(scaled_toronto_boroughs)
    folium.Circle([row_namedtuple.Latitude, row_namedtuple.Longitude],
                  radius=row_namedtuple.Distance,
                  popup=label,
                  color='blue',
                  fill=True,
                  fill_color='#3186cc',
                  parse_html=False).add_to(scaled_toronto_boroughs)

In [24]:
scaled_toronto_boroughs

We eliminated the overlappings, especially in Downtown Toronto, and increased the coverage

## Access to Foursquare API

In [1]:
CLIENT_ID = '*****'  # your Foursquare ID
CLIENT_SECRET = '*****'  # your Foursquare Secret
ACCESS_TOKEN = '*****'  # your FourSquare Access Token
VERSION = '20180605'  # Foursquare API version
LIMIT = 100  # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: *****
CLIENT_SECRET:*****


#### Let's explore the first neighborhood in our dataframe


Get the neighborhood's name.


In [26]:
df_min_distances.loc[0, 'Neighborhood']

'Malvern, Rouge'

Get the neighborhood's latitude and longitude values.


In [27]:
neighborhood_latitude = df_min_distances.loc[
    0, 'Latitude']  # neighborhood latitude value
neighborhood_longitude = df_min_distances.loc[
    0, 'Longitude']  # neighborhood longitude value

neighborhood_name = df_min_distances.loc[0, 'Neighborhood']  # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(
    neighborhood_name, neighborhood_latitude, neighborhood_longitude))

Latitude and longitude values of Malvern, Rouge are 43.806686299999996, -79.19435340000001.


#### Now, let's get the top 100 venues that are in Malvern, Rouge within the designated radius


First, let's create the GET request URL. Name your URL **url**.


In [28]:
# type your answer here

LIMIT = 100  # limit of number of venues returned by Foursquare API
radius = df_min_distances.loc[0, 'Distance']  # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, VERSION, neighborhood_latitude,
    neighborhood_longitude, radius, LIMIT)
url  # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=KSVFIRG0GV1C0YJENJ2Q4HUGX3JGKRFQD2HHVE1VIV2HQEAB&client_secret=YACTCXEZQR0ROFORS0QFJBQJELY23JZOBKJXFDYMY0YBU0HV&v=20180605&ll=43.806686299999996,-79.19435340000001&radius=1698.1266912357644&limit=100'

Send the GET request and examine the resutls


In [29]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '603ff8e13a222c39ea313ce7'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Malvern',
  'headerFullLocation': 'Malvern, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 36,
  'suggestedBounds': {'ne': {'lat': 43.821968315282014,
    'lng': -79.17321730976046},
   'sw': {'lat': 43.79140428471798, 'lng': -79.21548949023956}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4d669cba83865481c948fa53',
       'name': 'Images Salon & Spa',
       'location': {'address': '8130 Sheppard Ave E',
        'crossStreet': 'Morningside Ave',
        'lat': 43.80228301948931,
        'lng': -79.19856472801668,
        'labeledLatLngs'

From the Foursquare lab in the previous module, we know that all the information is in the _items_ key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.


In [30]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']

    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a _pandas_ dataframe.


In [31]:
venues = results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues)  # flatten JSON

# filter columns
filtered_columns = [
    'venue.name', 'venue.categories', 'venue.location.lat',
    'venue.location.lng'
]
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type,
                                                        axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Images Salon & Spa,Spa,43.802283,-79.198565
1,African Rainforest Pavilion,Zoo Exhibit,43.817725,-79.183433
2,Penguin Exhibit,Zoo Exhibit,43.819435,-79.185959
3,Orangutan Exhibit,Zoo Exhibit,43.818413,-79.182548
4,Gorilla Exhibit,Zoo Exhibit,43.81908,-79.184235


And how many venues were returned by Foursquare?


In [32]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

36 venues were returned by Foursquare.


## Explore Neighborhoods in Toronto

In [33]:
def getNearbyVenues(names, latitudes, longitudes, radius):

    venues_list = []
    for name, lat, lng, rad in zip(names, latitudes, longitudes, radius):
        print(name)

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, rad, LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([
            (name, lat, lng, v['venue']['name'], v['venue']['location']['lat'],
             v['venue']['location']['lng'],
             v['venue']['categories'][0]['name']) for v in results
        ])

    nearby_venues = pd.DataFrame(
        [item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [
        'Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude',
        'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category'
    ]

    return (nearby_venues)

In [34]:
# type your answer here
toronto_venues = getNearbyVenues(names=df_min_distances['Neighborhood'],
                                 latitudes=df_min_distances['Latitude'],
                                 longitudes=df_min_distances['Longitude'],
                                 radius=df_min_distances['Distance'])

Malvern, Rouge
Rouge Hill, Port Union, Highland Creek
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Cliffcrest, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Wexford Heights, Scarborough Town Centre
Wexford, Maryvale
Agincourt
Clarks Corners, Tam O'Shanter, Sullivan
Milliken, Agincourt North, Steeles East, L'Amoreaux East
Steeles West, L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
York Mills, Silver Hills
Willowdale, Newtonbrook
Willowdale, Willowdale East
York Mills West
Willowdale, Willowdale West
Parkwoods
Don Mills
Don Mills
Bathurst Manor, Wilson Heights, Downsview North
Northwood Park, York University
Downsview
Downsview
Downsview
Downsview
Victoria Village
Parkview Hill, Woodbine Gardens
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto, Broadview North (Old East York)
The Danforth West, 

In [35]:
print(toronto_venues.shape)
toronto_venues.head()

(3289, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.194353,Images Salon & Spa,43.802283,-79.198565,Spa
1,"Malvern, Rouge",43.806686,-79.194353,African Rainforest Pavilion,43.817725,-79.183433,Zoo Exhibit
2,"Malvern, Rouge",43.806686,-79.194353,Penguin Exhibit,43.819435,-79.185959,Zoo Exhibit
3,"Malvern, Rouge",43.806686,-79.194353,Orangutan Exhibit,43.818413,-79.182548,Zoo Exhibit
4,"Malvern, Rouge",43.806686,-79.194353,Gorilla Exhibit,43.81908,-79.184235,Zoo Exhibit


In [36]:
# save as a csv file
toronto_venues.to_csv('toronto_venues.csv', index=False)

In [37]:
toronto_venues = pd.read_csv('toronto_venues.csv')

In [38]:
toronto_venues = toronto_venues.loc[toronto_venues['Venue Category'] != 'Neighborhood']

In [39]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,54,54,54,54,54,54
"Alderwood, Long Branch",69,69,69,69,69,69
"Bathurst Manor, Wilson Heights, Downsview North",38,38,38,38,38,38
Bayview Village,16,16,16,16,16,16
"Bedford Park, Lawrence Manor East",54,54,54,54,54,54
...,...,...,...,...,...,...
"Willowdale, Willowdale West",37,37,37,37,37,37
Woburn,4,4,4,4,4,4
Woodbine Heights,9,9,9,9,9,9
York Mills West,32,32,32,32,32,32


In [40]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 321 uniques categories.


## Analyze Each Neighborhood


In [41]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


And let's examine the new dataframe size.


In [42]:
toronto_onehot.shape

(3285, 322)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [43]:
toronto_groupped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_groupped

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.014493,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.018519,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.027027,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
94,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
95,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
96,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0


#### Let's confirm the new size


In [44]:
toronto_groupped.shape

(98, 322)

#### Let's print each neighborhood along with the top 5 most common venues


In [45]:
num_top_venues = 5

for hood in toronto_groupped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_groupped[toronto_groupped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                  venue  freq
0    Chinese Restaurant  0.19
1         Shopping Mall  0.04
2  Gym / Fitness Center  0.04
3                Bakery  0.04
4  Cantonese Restaurant  0.04


----Alderwood, Long Branch----
              venue  freq
0       Coffee Shop  0.09
1      Burger Joint  0.04
2  Department Store  0.04
3              Café  0.04
4     Grocery Store  0.04


----Bathurst Manor, Wilson Heights, Downsview North----
         venue  freq
0  Coffee Shop  0.08
1  Pizza Place  0.08
2         Park  0.08
3  Gas Station  0.05
4         Bank  0.05


----Bayview Village----
                 venue  freq
0  Japanese Restaurant  0.12
1         Intersection  0.12
2                 Bank  0.12
3          Gas Station  0.12
4        Grocery Store  0.12


----Bedford Park, Lawrence Manor East----
                  venue  freq
0           Coffee Shop  0.07
1                Bakery  0.06
2    Italian Restaurant  0.06
3            Restaurant  0.04
4  Fast Food Restaurant  0.04


---

4    Breakfast Spot  0.04


----Parkview Hill, Woodbine Gardens----
                  venue  freq
0           Pizza Place  0.14
1  Fast Food Restaurant  0.07
2                  Bank  0.07
3        Breakfast Spot  0.07
4                  Café  0.07


----Parkwoods----
           venue  freq
0           Park  0.11
1       Pharmacy  0.07
2       Bus Stop  0.07
3  Shopping Mall  0.07
4            ATM  0.04


----Queen's Park, Ontario Provincial Government----
              venue  freq
0              Café  0.25
1  Sushi Restaurant  0.25
2   Bubble Tea Shop  0.25
3   Thai Restaurant  0.25
4               ATM  0.00


----Regent Park, Harbourfront----
         venue  freq
0  Coffee Shop  0.15
1      Theater  0.05
2       Bakery  0.05
3         Park  0.05
4          Pub  0.03


----Richmond, Adelaide, King----
              venue  freq
0       Coffee Shop  0.11
1        Steakhouse  0.06
2  Greek Restaurant  0.06
3  Sushi Restaurant  0.06
4        Taco Place  0.06


----Rosedale----
           v

#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.


In [46]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [47]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_groupped['Neighborhood']

for ind in np.arange(toronto_groupped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_groupped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Chinese Restaurant,Caribbean Restaurant,Shopping Mall,Bakery,Coffee Shop,Breakfast Spot,Gym / Fitness Center,Cantonese Restaurant,Pool,Bookstore
1,"Alderwood, Long Branch",Coffee Shop,Restaurant,Grocery Store,Pizza Place,Café,Department Store,Burger Joint,Sandwich Place,Bank,Pharmacy
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Park,Pizza Place,Gas Station,Bank,Fried Chicken Joint,Deli / Bodega,French Restaurant,Diner,Sandwich Place
3,Bayview Village,Gas Station,Bank,Japanese Restaurant,Grocery Store,Intersection,Park,Café,Skating Rink,Restaurant,Shopping Mall
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Italian Restaurant,Bakery,Bagel Shop,Restaurant,Fast Food Restaurant,Café,Bank,Pizza Place,Sandwich Place


## Cluster Neighborhoods


Run _k_-means to cluster the neighborhood into 5 clusters.


In [48]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_groupped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(init = "k-means++", n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 4, 1, 1, 4, 1, 4, 1, 1, 1])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.


In [49]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [50]:
toronto_merged = df_min_distances.merge(neighborhoods_venues_sorted, on='Neighborhood')
toronto_merged.head()

Unnamed: 0,Postal Code,Nearest Postal Code,Borough,Neighborhood,Number of Neighborhoods,Latitude,Longitude,Distance,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,M1X,Scarborough,"Malvern, Rouge",2,43.806686,-79.194353,1698.126691,1,Zoo Exhibit,Fast Food Restaurant,Pizza Place,Restaurant,Bank,Gas Station,Liquor Store,Food & Drink Shop,Other Great Outdoors,Paper / Office Supplies Store
1,M1C,M1E,Scarborough,"Rouge Hill, Port Union, Highland Creek",3,43.784535,-79.160497,1625.199306,1,Hotel,Grocery Store,Pharmacy,Breakfast Spot,Bank,Park,Burger Joint,Gym / Fitness Center,Gym,Italian Restaurant
2,M1E,M1G,Scarborough,"Guildwood, Morningside, West Hill",3,43.763573,-79.188711,1205.257798,1,Pizza Place,Park,Bank,Fast Food Restaurant,Coffee Shop,Sandwich Place,Sports Bar,Smoothie Shop,Liquor Store,Supermarket
3,M1G,M1H,Scarborough,Woburn,1,43.770992,-79.216917,913.470885,0,Coffee Shop,Park,Zoo Exhibit,Ethiopian Restaurant,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
4,M1H,M1G,Scarborough,Cedarbrae,1,43.773136,-79.239476,913.470885,1,Bakery,Coffee Shop,Indian Restaurant,Bus Line,Thai Restaurant,Gas Station,Caribbean Restaurant,Fried Chicken Joint,Athletics & Sports,Chinese Restaurant


Finally, let's visualize the resulting clusters


In [51]:
map_clusters = folium.Map(location=[latitude + 0.01, longitude],
                                   zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to map
for row_namedtuple in toronto_merged.itertuples():
    label = 'Postal Code: {}, Borough: {}, Neigborhoods: {}, Cluster: {}'.format(
        row_namedtuple[1], row_namedtuple.Borough, row_namedtuple.Neighborhood,
        row_namedtuple[9])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([row_namedtuple.Latitude, row_namedtuple.Longitude],
                        radius=1,
                        color=rainbow[row_namedtuple[9]-1],
                        fill=True,
                        fill_color=rainbow[row_namedtuple[9]-1],
                        fill_opacity=0.7,
                        parse_html=False).add_to(map_clusters)
    folium.Circle([row_namedtuple.Latitude, row_namedtuple.Longitude],
                  radius=row_namedtuple.Distance,
                  popup=label,
                  color=rainbow[row_namedtuple[9]-1],
                  fill=True,
                  fill_color=rainbow[row_namedtuple[9]-1],
                  parse_html=False).add_to(map_clusters)

In [52]:
map_clusters

## Examine Clusters


Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.


#### Cluster 1


In [53]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Nearest Postal Code,Latitude,Longitude,Distance,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,M1H,43.770992,-79.216917,913.470885,0,Coffee Shop,Park,Zoo Exhibit,Ethiopian Restaurant,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
56,M7A,43.657952,-79.387383,256.27661,0,Coffee Shop,Café,Sandwich Place,Restaurant,Bank,Fried Chicken Joint,Italian Restaurant,Deli / Bodega,Smoothie Shop,Bookstore
59,M5X,43.647177,-79.381576,75.166945,0,Coffee Shop,Restaurant,Gym,Deli / Bodega,Zoo Exhibit,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store


#### Cluster 2


In [54]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Nearest Postal Code,Latitude,Longitude,Distance,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1X,43.806686,-79.194353,1698.126691,1,Zoo Exhibit,Fast Food Restaurant,Pizza Place,Restaurant,Bank,Gas Station,Liquor Store,Food & Drink Shop,Other Great Outdoors,Paper / Office Supplies Store
1,M1E,43.784535,-79.160497,1625.199306,1,Hotel,Grocery Store,Pharmacy,Breakfast Spot,Bank,Park,Burger Joint,Gym / Fitness Center,Gym,Italian Restaurant
2,M1G,43.763573,-79.188711,1205.257798,1,Pizza Place,Park,Bank,Fast Food Restaurant,Coffee Shop,Sandwich Place,Sports Bar,Smoothie Shop,Liquor Store,Supermarket
4,M1G,43.773136,-79.239476,913.470885,1,Bakery,Coffee Shop,Indian Restaurant,Bus Line,Thai Restaurant,Gas Station,Caribbean Restaurant,Fried Chicken Joint,Athletics & Sports,Chinese Restaurant
5,M1K,43.744734,-79.239476,1301.44386,1,Sandwich Place,Pharmacy,Ice Cream Shop,Grocery Store,Bank,Fish & Chips Shop,Liquor Store,Japanese Restaurant,Beer Store,Indian Restaurant
6,M1M,43.727929,-79.262029,1112.691225,1,Chinese Restaurant,Coffee Shop,Fast Food Restaurant,Convenience Store,Pizza Place,Discount Store,Grocery Store,Burger Joint,Bus Line,Bus Station
7,M4B,43.711112,-79.284577,1052.358786,1,Coffee Shop,Bus Line,Convenience Store,Grocery Store,Bakery,Intersection,Ice Cream Shop,Fast Food Restaurant,Soccer Field,Beer Store
8,M1K,43.716316,-79.239476,1112.691225,1,Harbor / Marina,Pizza Place,Ice Cream Shop,Beach,Sports Bar,Restaurant,Grocery Store,Sandwich Place,Park,Eastern European Restaurant
9,M1L,43.692657,-79.264848,1296.758686,1,Diner,Gym,Café,Skating Rink,General Entertainment,Bus Stop,Dessert Shop,Park,Chinese Restaurant,Ice Cream Shop
10,M1R,43.75741,-79.273304,993.067332,1,Coffee Shop,Restaurant,Pharmacy,Electronics Store,Asian Restaurant,Intersection,Chinese Restaurant,Bakery,Fast Food Restaurant,Light Rail Station


#### Cluster 3


In [55]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Nearest Postal Code,Latitude,Longitude,Distance,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,M2P,43.75749,-79.374714,1050.879937,2,Park,Coffee Shop,Pool,Zoo Exhibit,Escape Room,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant
48,M4W,43.689574,-79.38316,600.898562,2,Park,Gym,Zoo Exhibit,Escape Room,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
50,M4T,43.679563,-79.377529,600.898562,2,Park,Trail,Playground,Zoo Exhibit,Escape Room,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant


#### Cluster 4


In [56]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Nearest Postal Code,Latitude,Longitude,Distance,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
62,M4R,43.711695,-79.416936,496.696233,3,Home Service,Garden,Zoo Exhibit,Event Space,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Escape Room


#### Cluster 5


In [57]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Nearest Postal Code,Latitude,Longitude,Distance,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,M2H,43.778517,-79.346556,1558.768685,4,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Park,Japanese Restaurant,Gas Station,Pharmacy,Intersection,Bakery,Bank
21,M2K,43.789053,-79.408493,911.328822,4,Korean Restaurant,Café,Middle Eastern Restaurant,Pizza Place,Coffee Shop,Bank,Shopping Mall,Diner,Fried Chicken Joint,Japanese Restaurant
22,M2P,43.77012,-79.408493,1023.073557,4,Coffee Shop,Bubble Tea Shop,Ramen Restaurant,Korean Restaurant,Japanese Restaurant,Pizza Place,Fast Food Restaurant,Sandwich Place,Middle Eastern Restaurant,Restaurant
23,M2N,43.752758,-79.400049,1023.073557,4,Coffee Shop,Restaurant,Gym,Park,Optical Shop,Deli / Bodega,Business Service,Dog Run,French Restaurant,Bank
26,M3A,43.745906,-79.352188,992.96292,4,Restaurant,Gym,Japanese Restaurant,Café,Supermarket,Burger Joint,Coffee Shop,Bank,Pizza Place,Asian Restaurant
27,M4A,43.7259,-79.340923,1018.564812,4,Restaurant,Gym,Japanese Restaurant,Café,Supermarket,Burger Joint,Coffee Shop,Bank,Pizza Place,Asian Restaurant
29,M3N,43.76798,-79.487262,1399.854688,4,Coffee Shop,Japanese Restaurant,Gas Station,Restaurant,Bar,Caribbean Restaurant,Fast Food Restaurant,Theater,Furniture / Home Store,Bank
37,M4L,43.676357,-79.293031,994.487615,4,Pub,Pizza Place,Coffee Shop,Beach,Japanese Restaurant,Breakfast Spot,Restaurant,Bar,Park,Sandwich Place
38,M4H,43.70906,-79.363452,601.941971,4,Coffee Shop,Sporting Goods Shop,Shopping Mall,Bank,Sports Bar,Burger Joint,Brewery,Furniture / Home Store,Liquor Store,Bakery
40,M4K,43.685347,-79.338106,651.287678,4,Pizza Place,Coffee Shop,Café,Bookstore,Asian Restaurant,Fast Food Restaurant,Burger Joint,Beer Store,Beer Bar,Ethiopian Restaurant
