<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

# Introduction
Here, the Foursquare API was used to explore neighbourhoods in Toronto. It was also used to get the most common venue categories in each neighbourhood, the data that was used to goup the neighbourhoods into clusters. k-means clustering algorithm was used to complete this task. The Folium library was used to visualize the neighborhoods in Toronto and their emerging clusters.

### After Data Preprocessing completed the previous notebooks, the Neighbourhood Exploration and Clustering starts <a href = "#Neighbourhood-Exploration-and-Clustering">here.</a>

Importing dependencies...

In [1]:
from bs4 import BeautifulSoup # library to aid webscraping
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

! pip install folium==0.5.0
import folium # plotting library

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values



Download website's html doc for scraping

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

data = requests.get(url).text

Create Soup object and extract table data

In [3]:
soup = BeautifulSoup(data, 'html5lib')
table = soup.table
tableData = table.find_all('td')

Below, I made a list (of strings in each table data) for each table data and then made a list of these lists. 

In [4]:
#make rows of desired dataframe into a list and append all into a  general list
data = []
for ind in range(len(tableData)):
    rows = []
    for string in tableData[ind].stripped_strings:
        rows.append(repr(string))
    data.append(rows)
data[0:5]

[["'M1A'", "'Not assigned'"],
 ["'M2A'", "'Not assigned'"],
 ["'M3A'", "'North York'", "'('", "'Parkwoods'", "')'"],
 ["'M4A'", "'North York'", "'('", "'Victoria Village'", "')'"],
 ["'M5A'",
  "'Downtown Toronto'",
  "'('",
  "'Regent Park'",
  "'/'",
  "'Harbourfront'",
  "')'"]]

I noticed that lists, with borough == "Not assigned", were of length 2. So I retained all lists with length > 2.

In [5]:
#retain list of all data without a borough = "Not assigned"
dataFiltered = []
for ind in range(len(data)):
    if len(data[ind]) > 2:
        dataFiltered.append(data[ind])

dataFiltered[0:5]

[["'M3A'", "'North York'", "'('", "'Parkwoods'", "')'"],
 ["'M4A'", "'North York'", "'('", "'Victoria Village'", "')'"],
 ["'M5A'",
  "'Downtown Toronto'",
  "'('",
  "'Regent Park'",
  "'/'",
  "'Harbourfront'",
  "')'"],
 ["'M6A'",
  "'North York'",
  "'('",
  "'Lawrence Manor'",
  "'/'",
  "'Lawrence Heights'",
  "')'"],
 ["'M7A'", '"Queen\'s Park"', "'(Ontario Provincial Government)'"]]

Below, I firstly removed the redundant quotes from each element.

Next, assuming all 3rd elements to the last element in each respective list was a Neighbourhood data, I concatenated all 3rd elements to the last elements in each list.

In [6]:
#delete apostrophe
for ind in range(len(dataFiltered)):
    for ind2 in range(len(dataFiltered[ind])):
        dataFiltered[ind][ind2] = dataFiltered[ind][ind2].replace("'","")

#make each list a length of 3 by concatenating all elements except the first two  
for ind in range (len(dataFiltered)):
    dataFiltered[ind] = dataFiltered[ind][0:2] + [''.join(dataFiltered[ind][2:(len(dataFiltered[ind]))])]

dataFiltered[0:5]

[['M3A', 'North York', '(Parkwoods)'],
 ['M4A', 'North York', '(Victoria Village)'],
 ['M5A', 'Downtown Toronto', '(Regent Park/Harbourfront)'],
 ['M6A', 'North York', '(Lawrence Manor/Lawrence Heights)'],
 ['M7A', '"Queens Park"', '(Ontario Provincial Government)']]

Converting to dataframe...

In [7]:
neighFrame = pd.DataFrame(dataFiltered)
neighFrame.columns = ["Postal Code", "Borough", "Neighbourhood"]
neighFrame

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,(Parkwoods)
1,M4A,North York,(Victoria Village)
2,M5A,Downtown Toronto,(Regent Park/Harbourfront)
3,M6A,North York,(Lawrence Manor/Lawrence Heights)
4,M7A,"""Queens Park""",(Ontario Provincial Government)
...,...,...,...
98,M8X,Etobicoke,(The Kingsway/ Montgomery Road /Old MillNorth)
99,M4Y,Downtown Toronto,(Church and Wellesley)
100,M7Y,East Toronto,Business reply mailProcessing Centre969 Easter...
101,M8Y,Etobicoke,"(Old Mill""South / Kings Mill Park /""Sunnylea/H..."


Cleaning Data...

In [8]:
#delete all double quotes
for column in neighFrame.columns:
    neighFrame[column] = neighFrame[column].str.replace('"','')

#delete all brackets in "Neighbourhood" column data starting with "("
colStartingWthBrack = neighFrame[neighFrame["Neighbourhood"].str.startswith("(")]
#remove ")"
neighFrame.loc[colStartingWthBrack.index,"Neighbourhood"] = \
neighFrame.loc[colStartingWthBrack.index,"Neighbourhood"].str.replace(')','')
#remove "("
neighFrame.loc[colStartingWthBrack.index,"Neighbourhood"] = \
neighFrame.loc[colStartingWthBrack.index,"Neighbourhood"].str.replace('(','')

#delete "Business reply mail" from Neighbourhood column row 100
neighFrame.loc[100, "Neighbourhood"] = \
neighFrame.loc[100, "Neighbourhood"][(len("Business reply mail")):]

#change all "/" to ", "
neighFrame["Neighbourhood"] = neighFrame["Neighbourhood"].str.replace('/',', ')

#change all " ," to ","
neighFrame["Neighbourhood"] = neighFrame["Neighbourhood"].str.replace(' ,',',')

#change all "  " to " "
neighFrame["Neighbourhood"] = neighFrame["Neighbourhood"].str.replace('  ',' ')

In [9]:
neighFrame

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queens Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old MillNorth"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Processing Centre969 Eastern(Enclave of M4L)
101,M8Y,Etobicoke,"Old MillSouth, Kings Mill Park, Sunnylea, Humb..."


In [10]:
neighFrame.shape

(103, 3)

## Geospatial Coordinates Data Incorporation

In [11]:
path = r"C:\Users\ADESOYE\OneDrive\Coursera\Course 10 Capstone Project\Geospatial_Coordinates.csv"

In [12]:
#read file into pandas dataframe
geoData = pd.read_csv(path)
geoData

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [13]:
#Sorting both Geo data and neigbouhood Dataframe by Postal Code...
geoData.sort_values(by=['Postal Code'], inplace=True)
neighFrame.sort_values(by=['Postal Code'], inplace=True )
neighFrame.reset_index(drop = True, inplace= True)

In [14]:
#Defining latitude and longitude variables 
latitude, longitude= geoData["Latitude"],geoData["Longitude"]

#Inserting them in the main data frame...
neighFrame["Latitude"], neighFrame["Longitude"] = latitude, longitude

In [15]:
neighFrame

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


# Neighbourhood Exploration and Clustering

 #### Exploration

Only the neighbourhoods with boroughs ending with "Toronto" were segmented and clustered.

In [16]:
dfForClust = neighFrame[neighFrame["Borough"].str.endswith("Toronto")].reset_index(drop=True)
dfForClust.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The DanforthWest, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The BeachesWest",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


Getting the coordinates for each unique borough...

In [17]:
city = 'Toronto, Canada'
boroughs  = dfForClust['Borough'].unique()
clustLat = []
clustLng = []
for borough in boroughs:
    address = borough + ', ' + city
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(address)
    clustLat.append(location.latitude)
    clustLng.append(location.longitude)
clustLat

[43.6261221, 43.663461999999996, 43.6563221, 43.663461999999996]

In [18]:
from statistics import mean

mapToronto = folium.Map(location=[mean(clustLat), mean(clustLng)], zoom_start=10)

for lat, lng, label in zip(dfForClust['Latitude'], dfForClust['Longitude'], dfForClust['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(mapToronto)  
    
mapToronto

Next up, utilizing the Foursquare API to explore the neighborhoods and segmenting them.

#### Define Foursquare Credentials and Version


In [19]:
CLIENT_ID = 'GHRHQGT5RF3WKI3U2NQYSGYERE4QHDIVGVCGY54FLYZE5N4T' # your Foursquare ID
CLIENT_SECRET = 'QEMODCLOIPTRYPLF5Y3I1LKQKHUC23Q4QCPVISLYCRFZS2DH' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
radius = 500
ACCESS_TOKEN = 'JVIVLB2A4OUXIILH2HNKGSKDHAASYY0S0UCRYTENDOM3A1OX'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: GHRHQGT5RF3WKI3U2NQYSGYERE4QHDIVGVCGY54FLYZE5N4T
CLIENT_SECRET:QEMODCLOIPTRYPLF5Y3I1LKQKHUC23Q4QCPVISLYCRFZS2DH


Creating a function to get the top 100 venues that are in for each neighbourhood within a radius of 500 meters

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Creating Dataframe, torontoVenues, for top 100 venues in each neighbourhood within radius 500m. 

In [21]:
torontoVenues = getNearbyVenues(dfForClust['Neighbourhood'], dfForClust['Latitude'], dfForClust['Longitude'], radius=500)

In [22]:
torontoVenues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,46,46,46,46,46,46
"Brockton, Parkdale Village, Exhibition Place",22,22,22,22,22,22
"CN Tower, King and Spadina, Railway Lands, HarbourfrontWest, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
CentralBay Street,62,62,62,62,62,62
Christie,15,15,15,15,15,15
Church and Wellesley,69,69,69,69,69,69
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,25,25,25,25,25,25
DavisvilleNorth,9,9,9,9,9,9
"Dufferin, Dovercourt Village",13,13,13,13,13,13


As seen above, several neighbourhoods have less than 10 venues within 500m radius. For a better model, only venues with at least 15 venues will be used.

In [23]:
counts = torontoVenues.groupby('Neighbourhood').count()
retainedNeigh = counts[counts['Venue'] > 15]
retainedNeigh.index

Index(['Berczy Park', 'Brockton, Parkdale Village, Exhibition Place',
       'CN Tower, King and Spadina, Railway Lands, HarbourfrontWest, Bathurst Quay, South Niagara, Island airport',
       'CentralBay Street', 'Church and Wellesley',
       'Commerce Court, Victoria Hotel', 'Davisville',
       'First Canadian Place, Underground city', 'Garden District,Ryerson',
       'HarbourfrontEast, Union Station, Toronto Islands',
       'High Park, The JunctionSouth', 'India Bazaar, The BeachesWest',
       'Kensington Market, Chinatown, Grange Park', 'Little Portugal, Trinity',
       'Processing Centre969 Eastern(Enclave of M4L)',
       'Regent Park, Harbourfront', 'Richmond, Adelaide, King',
       'Runnymede, Swansea', 'St. James Town', 'St. James Town, Cabbagetown',
       'Stn A PO Boxes25 The Esplanade(Enclave of M5E)', 'Studio District',
       'The Annex, North Midtown, Yorkville', 'The DanforthWest, Riverdale',
       'Toronto Dominion Centre, Design Exchange',
       'University 

In [24]:
torontoVenues = torontoVenues[torontoVenues['Neighbourhood'].isin(retainedNeigh.index)].reset_index(drop=True)

In [25]:
torontoVenues

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"The DanforthWest, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
1,"The DanforthWest, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
2,"The DanforthWest, Riverdale",43.679557,-79.352188,La Diperie,43.677702,-79.352265,Ice Cream Shop
3,"The DanforthWest, Riverdale",43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop
4,"The DanforthWest, Riverdale",43.679557,-79.352188,Moksha Yoga Danforth,43.677622,-79.352116,Yoga Studio
...,...,...,...,...,...,...,...
1390,Processing Centre969 Eastern(Enclave of M4L),43.662744,-79.321558,Toronto Yoga Mamas,43.664824,-79.324335,Yoga Studio
1391,Processing Centre969 Eastern(Enclave of M4L),43.662744,-79.321558,TTC Stop #03049,43.664470,-79.325145,Light Rail Station
1392,Processing Centre969 Eastern(Enclave of M4L),43.662744,-79.321558,Greenwood Cigar & Variety,43.664538,-79.325379,Smoke Shop
1393,Processing Centre969 Eastern(Enclave of M4L),43.662744,-79.321558,ONE Academy,43.662253,-79.326911,Gym / Fitness Center


In [26]:
print(torontoVenues.shape)
torontoVenues.head()

(1395, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"The DanforthWest, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
1,"The DanforthWest, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
2,"The DanforthWest, Riverdale",43.679557,-79.352188,La Diperie,43.677702,-79.352265,Ice Cream Shop
3,"The DanforthWest, Riverdale",43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop
4,"The DanforthWest, Riverdale",43.679557,-79.352188,Moksha Yoga Danforth,43.677622,-79.352116,Yoga Studio


In [27]:
print('There are {} uniques categories.'.format(len(torontoVenues['Venue Category'].unique())))

There are 207 uniques categories.


Analysing each neighborhood...

In [28]:
# one hot encoding
torontoOnehot = pd.get_dummies(torontoVenues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
torontoOnehot['Neighbourhood'] = torontoVenues['Neighbourhood'] 

# move neighborhood column to the first column
fixedColumns = [torontoOnehot.columns[-1]] + list(torontoOnehot.columns[:-1])
torontoOnehot = torontoOnehot[fixedColumns]

torontoOnehot.head(50)

Unnamed: 0,Neighbourhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,"The DanforthWest, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"The DanforthWest, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"The DanforthWest, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"The DanforthWest, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The DanforthWest, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,"The DanforthWest, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,"The DanforthWest, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,"The DanforthWest, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,"The DanforthWest, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"The DanforthWest, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
torontoOnehot.shape

(1395, 208)

#### Grouping rows by neighborhood and taking the mean of the frequency of occurrence of each category


In [30]:
torontoGrouped = torontoOnehot.groupby('Neighbourhood').mean().reset_index()
torontoGrouped

Unnamed: 0,Neighbourhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0625,0.0625,0.0625,0.125,0.1875,0.125,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,CentralBay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.016129,0.0,0.016129,0.0,0.016129,0.0,0.016129
4,Church and Wellesley,0.014493,0.014493,0.0,0.0,0.0,0.0,0.0,0.0,0.014493,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014493,0.014493
5,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0
6,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"First Canadian Place, Underground city",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0
8,"Garden District,Ryerson",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0
9,"HarbourfrontEast, Union Station, Toronto Islands",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0


In [31]:
torontoGrouped.shape

(26, 208)

Defining a function to sort the venues in descending order.

In [32]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [33]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = torontoGrouped['Neighbourhood']

for ind in np.arange(torontoGrouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(torontoGrouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Sandwich Place,Bakery,Farmers Market,Beer Bar,Seafood Restaurant,Vegetarian / Vegan Restaurant,Japanese Restaurant,Italian Restaurant
1,"Brockton, Parkdale Village, Exhibition Place",Sandwich Place,Coffee Shop,Breakfast Spot,Café,Climbing Gym,Restaurant,Italian Restaurant,Stadium,Bar,Intersection
2,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Terminal,Airport Lounge,Coffee Shop,Boat or Ferry,Rental Car Location,Bar,Harbor / Marina,Sculpture Garden,Airport Gate
3,CentralBay Street,Coffee Shop,Sandwich Place,Sushi Restaurant,Italian Restaurant,Japanese Restaurant,Café,Restaurant,Salad Place,Burger Joint,Pizza Place
4,Church and Wellesley,Sushi Restaurant,Japanese Restaurant,Restaurant,Coffee Shop,Gay Bar,Mediterranean Restaurant,Fast Food Restaurant,Indian Restaurant,Gym,Pizza Place


#### Clustering

In [34]:
# set number of clusters
kclusters = 4

torontoClustering = torontoGrouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(torontoClustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 2, 3, 2, 0, 2, 2, 2, 2, 2])

Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [35]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

torontoMerged = dfForClust[dfForClust['Neighbourhood'].isin(retainedNeigh.index)].reset_index(drop=True)

# merge torontoGrouped with dfClust to add latitude/longitude for each neighborhood
torontoMerged = torontoMerged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

torontoMerged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4K,East Toronto,"The DanforthWest, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Italian Restaurant,Coffee Shop,Ice Cream Shop,Yoga Studio,Bank,Spa,Indian Restaurant,Pizza Place,Lounge
1,M4L,East Toronto,"India Bazaar, The BeachesWest",43.668999,-79.315572,1,Fast Food Restaurant,Sandwich Place,Pizza Place,Liquor Store,Italian Restaurant,Steakhouse,Pub,Food & Drink Shop,Restaurant,Burrito Place
2,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Coffee Shop,Gastropub,Café,Italian Restaurant,Bakery,Comfort Food Restaurant,Bank,Bar,Food,Fish Market
3,M4S,Central Toronto,Davisville,43.704324,-79.38879,2,Pizza Place,Sandwich Place,Coffee Shop,Gym,Dessert Shop,Sushi Restaurant,Pharmacy,Thai Restaurant,Brewery,Diner
4,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675,0,Coffee Shop,Restaurant,Café,Pub,Italian Restaurant,Bakery,Pizza Place,Yoga Studio,Sandwich Place,Plaza


In [36]:
torontoMerged

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4K,East Toronto,"The DanforthWest, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Italian Restaurant,Coffee Shop,Ice Cream Shop,Yoga Studio,Bank,Spa,Indian Restaurant,Pizza Place,Lounge
1,M4L,East Toronto,"India Bazaar, The BeachesWest",43.668999,-79.315572,1,Fast Food Restaurant,Sandwich Place,Pizza Place,Liquor Store,Italian Restaurant,Steakhouse,Pub,Food & Drink Shop,Restaurant,Burrito Place
2,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Coffee Shop,Gastropub,Café,Italian Restaurant,Bakery,Comfort Food Restaurant,Bank,Bar,Food,Fish Market
3,M4S,Central Toronto,Davisville,43.704324,-79.38879,2,Pizza Place,Sandwich Place,Coffee Shop,Gym,Dessert Shop,Sushi Restaurant,Pharmacy,Thai Restaurant,Brewery,Diner
4,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675,0,Coffee Shop,Restaurant,Café,Pub,Italian Restaurant,Bakery,Pizza Place,Yoga Studio,Sandwich Place,Plaza
5,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,0,Sushi Restaurant,Japanese Restaurant,Restaurant,Coffee Shop,Gay Bar,Mediterranean Restaurant,Fast Food Restaurant,Indian Restaurant,Gym,Pizza Place
6,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2,Coffee Shop,Park,Pub,Bakery,Restaurant,Café,Beer Store,Performing Arts Venue,Spa,Chocolate Shop
7,M5B,Downtown Toronto,"Garden District,Ryerson",43.657162,-79.378937,2,Coffee Shop,Sandwich Place,Clothing Store,Café,Hotel,Japanese Restaurant,Bank,Pizza Place,Cosmetics Shop,Middle Eastern Restaurant
8,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Coffee Shop,Italian Restaurant,Cocktail Bar,Café,Restaurant,Clothing Store,Gastropub,Gym,Farmers Market,Diner
9,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Cocktail Bar,Sandwich Place,Bakery,Farmers Market,Beer Bar,Seafood Restaurant,Vegetarian / Vegan Restaurant,Japanese Restaurant,Italian Restaurant


Visualising the resulting clusters

In [37]:
# create map
map_clusters = folium.Map(location=[mean(latitude), mean(longitude)], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(torontoMerged['Latitude'], torontoMerged['Longitude'], torontoMerged['Neighbourhood'], torontoMerged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# THE END