
# Introduction: Business Problem 


In this project we will try to find an optimal location for a restaurant. Specifically, this report will be targeted to stakeholders interested in opening an restaurant in Toronto, Canada.Toronto is the provincial capital of Ontario and the most populous city in Canada, with a population of 2,731,571 in 2016. Current to 2016, the Toronto census metropolitan area (CMA), of which the majority is within the Greater Toronto Area (GTA), held a population of 5,928,040, making it Canada's most populous CMA. Toronto is the fastest growing city in North America. and is the anchor of an urban agglomeration, known as the Golden Horseshoe in Southern Ontario, located on the northwestern shore of Lake Ontario.

Toronto encompasses a geographical area formerly administered by many separate municipalities. These municipalities have each developed a distinct history and identity over the years, and their names remain in common use among Torontonians. Former municipalities include East York, Etobicoke, Forest Hill, Mimico, North York, Parkdale, Scarborough, Swansea, Weston and York. Throughout the city there exist hundreds of small neighbourhoods and some larger neighbourhoods covering a few square kilometres.

Having such vast population and big geographical area, there also exists big competition between businesses. Therefore it became very challenging for stake holder or new business to decide which area they should start their business to get higher revenue with lowest possible competition. Therefore, we will try finding if someone wants to open a new restaurant in the city which location is best suited for it keeping in mind the competitors and which income group of people will be attracted most to it based on the population of the neighbourhood. Since there are lots of restaurants in Toronto, we will try to detect locations that are not already crowded with restaurants. We would also prefer locations as close to city center as possible, assuming that first two conditions are met.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders



# Description of the data and how it will be used to solve the problem.

Here we will try finding if someone wants to open a new restaurant in the city which location is best suited for it keeping in mind the competitors and which income group of people will be attracted most to it based on the population of the neighbourhood.

For my analysis I choose 3 data resources which are,

City of Toronto Neighbourhood Profiles use this Census data to provide a portrait of the demographic, social and economic characteristics of the people and households in each City of Toronto neighbourhood. The profiles present selected highlights from the data, but these accompanying data files provide the full data set assembled for each neighbourhood.

Link: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

Using foursquare data to get information about restaurants in Toronto

Link: https://foursquare.com/explore?mode=url&ne=44.418088%2C-78.362732&q=Restaurant&sw=42.742978%2C-80.554504

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

#pd.set_option('display.expand_frame_repr', False)

In [2]:
#import the library used to query a website
from urllib.request import urlopen

wiki = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050."
page = urlopen(wiki)

#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup

#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page, "lxml")
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XsLpAApAAD4AAGABjbkAAADO","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":955414546,"wgRevisionId":945633050,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Toronto","Ontario

In [3]:
Table1=soup.find('table', class_='wikitable sortable')
Table1

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Harbourfront</a>
</td></tr>
<tr>
<td>M6A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Lawrence_Heights" title="Lawrence Heights">Lawrence Heights</a>
</td></tr>
<tr>
<td>M6A</td>
<td><a href="/wiki/North

In [4]:
# define the dataframe columns
column_names = ['Postal_Code','Borough', 'Neighborhood'] 

# instantiate the dataframe
pd_Nebr = pd.DataFrame(columns=column_names)

In [5]:
A=[]
B=[]
C=[]

for row in Table1.findAll("tr"):
    cells = row.findAll('td')
    if len(cells)==3: #Only extract table body not heading
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

        
#Adding Data to our DataFrame
pd_Nebr['Postal_Code']=A
pd_Nebr['Borough']=B
pd_Nebr['Neighborhood']=C

pd_Nebr

Unnamed: 0,Postal_Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,Islington Avenue


In [6]:
pd_Nebr = pd_Nebr.drop(pd_Nebr[pd_Nebr['Borough'].str.contains("Not assigned")==True].index, axis=0, inplace=False)
pd_Nebr.index = pd.RangeIndex(len(pd_Nebr.index))
pd_Nebr 


Unnamed: 0,Postal_Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


In [7]:
Nebr2=pd_Nebr.groupby('Postal_Code').agg({'Borough':'first',
                               'Neighborhood': ', '.join}).reset_index()

column_names = ['Postal_Code','Borough', 'Neighborhood'] 
Nebr3 = pd.DataFrame(columns=column_names)
Nebr3 = Nebr2.drop(Nebr2[Nebr2['Borough'].str.contains("Toronto")==False].index, axis=0, inplace=False)

#Reset Index
Nebr3.index = pd.RangeIndex(len(Nebr3.index))
Nebr3

Unnamed: 0,Postal_Code,Borough,Neighborhood
0,M4E,East Toronto,The Beaches
1,M4K,East Toronto,"The Danforth West\n, Riverdale"
2,M4L,East Toronto,"The Beaches West\n, India Bazaar"
3,M4M,East Toronto,Studio District\n
4,M4N,Central Toronto,Lawrence Park
5,M4P,Central Toronto,Davisville North\n
6,M4R,Central Toronto,North Toronto West\n
7,M4S,Central Toronto,Davisville\n
8,M4T,Central Toronto,"Moore Park, Summerhill East\n"
9,M4V,Central Toronto,"Deer Park, Forest Hill SE\n, Rathnelly, South ..."


In [8]:
column = ['Postal_Code','Borough', 'Neighborhood'] 
Nebr_ungrp = pd.DataFrame(columns=column_names)

Nebr_ungrp = pd_Nebr.drop(pd_Nebr[pd_Nebr['Borough'].str.contains("Toronto")==False].index, axis=0, inplace=False)


Nebr_ungrp.index = pd.RangeIndex(len(Nebr_ungrp.index))
Nebr_ungrp

Unnamed: 0,Postal_Code,Borough,Neighborhood
0,M5A,Downtown Toronto,Harbourfront
1,M7A,Downtown Toronto,Queen's Park
2,M5B,Downtown Toronto,Ryerson
3,M5B,Downtown Toronto,Garden District
4,M5C,Downtown Toronto,St. James Town
5,M4E,East Toronto,The Beaches
6,M5E,Downtown Toronto,Berczy Park
7,M5G,Downtown Toronto,Central Bay Street
8,M6G,Downtown Toronto,Christie
9,M5H,Downtown Toronto,Adelaide


In [9]:
#!conda install -c conda-forge geopy --yes
import time
from geopy.geocoders import Nominatim
from geopy.util import get_version
get_version()


'1.22.0'

In [10]:
geolocator = Nominatim(scheme='http', user_agent="ES1234")


for row_index, item in Nebr_ungrp.iterrows():
    
    list1 = Nebr_ungrp.loc[[row_index],['Neighborhood']].values.astype('str')
    loc = ' , Toronto, Ontario, Canada'
    list1.astype('str')
    list1 = np.append(list1, loc)
    latitude = None
    longitude = None
    location = None
    
    location = geolocator.geocode(list1 , limit = 15)
    #time.sleep(5)
    if(location is not None):
        Nebr_ungrp.loc[Nebr_ungrp.index[row_index], 'Latitude'] = location.latitude
        Nebr_ungrp.loc[Nebr_ungrp.index[row_index], 'Longitude'] = location.longitude

print(Nebr_ungrp)

   Postal_Code           Borough  \
0          M5A  Downtown Toronto   
1          M7A  Downtown Toronto   
2          M5B  Downtown Toronto   
3          M5B  Downtown Toronto   
4          M5C  Downtown Toronto   
5          M4E      East Toronto   
6          M5E  Downtown Toronto   
7          M5G  Downtown Toronto   
8          M6G  Downtown Toronto   
9          M5H  Downtown Toronto   
10         M5H  Downtown Toronto   
11         M5H  Downtown Toronto   
12         M6H      West Toronto   
13         M6H      West Toronto   
14         M5J  Downtown Toronto   
15         M5J  Downtown Toronto   
16         M5J  Downtown Toronto   
17         M6J      West Toronto   
18         M6J      West Toronto   
19         M4K      East Toronto   
20         M4K      East Toronto   
21         M5K  Downtown Toronto   
22         M5K  Downtown Toronto   
23         M6K      West Toronto   
24         M6K      West Toronto   
25         M6K      West Toronto   
26         M4L      East Tor

In [11]:
#print(item in Nebr_ungrp.iterrows())

In [12]:

print(Nebr_ungrp)

   Postal_Code           Borough  \
0          M5A  Downtown Toronto   
1          M7A  Downtown Toronto   
2          M5B  Downtown Toronto   
3          M5B  Downtown Toronto   
4          M5C  Downtown Toronto   
5          M4E      East Toronto   
6          M5E  Downtown Toronto   
7          M5G  Downtown Toronto   
8          M6G  Downtown Toronto   
9          M5H  Downtown Toronto   
10         M5H  Downtown Toronto   
11         M5H  Downtown Toronto   
12         M6H      West Toronto   
13         M6H      West Toronto   
14         M5J  Downtown Toronto   
15         M5J  Downtown Toronto   
16         M5J  Downtown Toronto   
17         M6J      West Toronto   
18         M6J      West Toronto   
19         M4K      East Toronto   
20         M4K      East Toronto   
21         M5K  Downtown Toronto   
22         M5K  Downtown Toronto   
23         M6K      West Toronto   
24         M6K      West Toronto   
25         M6K      West Toronto   
26         M4L      East Tor

In [13]:
import json 

import requests 
from pandas.io.json import json_normalize 

import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Importing to use the Foursquare API lab
!conda install -c conda-forge folium=0.5.0 --yes  #Uncomment if not installed
import folium

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         713 KB

The following NEW packages will be INSTALLED:

    altair:  4.1.0-py_1 conda-forge
    branca:  0.4.1-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge


Downloading and Extracting Packages
vincent-0.4.4        | 28 KB     | #####

In [14]:
print('We have {} boroughs and {} neighborhoods.'.format(
        len(Nebr_ungrp['Borough'].unique()),
        Nebr_ungrp.shape[0]
    )
)

Nebr_ungrp.dropna(inplace =True)
Nebr_ungrp.index = pd.RangeIndex(len(Nebr_ungrp.index))

address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="ES1234")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

We have 4 boroughs and 74 neighborhoods.
The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [15]:
Nebr_ungrp

Unnamed: 0,Postal_Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.640080,-79.380150
1,M7A,Downtown Toronto,Queen's Park,43.659659,-79.390340
2,M5B,Downtown Toronto,Ryerson,43.658469,-79.378993
3,M5B,Downtown Toronto,Garden District,43.656500,-79.377114
4,M5C,Downtown Toronto,St. James Town,43.669403,-79.372704
5,M4E,East Toronto,The Beaches,43.671024,-79.296712
6,M5E,Downtown Toronto,Berczy Park,43.647984,-79.375396
7,M6G,Downtown Toronto,Christie,43.664111,-79.418405
8,M5H,Downtown Toronto,Adelaide,43.650486,-79.379498
9,M5H,Downtown Toronto,King,43.648949,-79.377754


In [16]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(Nebr_ungrp['Latitude'], Nebr_ungrp['Longitude'], Nebr_ungrp['Borough'], Nebr_ungrp['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [17]:
CLIENT_ID = 'QZLH3HZ1HUOXVCKB5PU0H10W1F4BGRTY4LFXOUYLNJ11NFJM' # your Foursquare ID
CLIENT_SECRET = 'JMUM43OOU5ARLC1II0ALAAPTFQDNTGUNBA3PELML5DMV00SH' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Successfully Logged-In')

Successfully Logged-In


In [18]:
Nebr_ungrp.loc[0]
neighborhood_latitude = np.float(Nebr_ungrp.loc[0,['Latitude']].values)
neighborhood_longitude =  np.float(Nebr_ungrp.loc[0,['Longitude']].values)

In [19]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

In [20]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ec2ee8171c428001b240d4e'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Harbourfront',
  'headerFullLocation': 'Harbourfront, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 124,
  'suggestedBounds': {'ne': {'lat': 43.6445801045, 'lng': -79.37394296546611},
   'sw': {'lat': 43.635580095499996, 'lng': -79.38635603453389}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e49413e81dc766f3e3d6312',
       'name': 'Harbour Square Park',
       'location': {'address': '25 Queens Quay West',
        'lat': 43.63925269130776,
        'lng': -79.37839547902978,
        'labeledLatLngs': [{'label': 'display',
          'lat

In [21]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [22]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Harbour Square Park,Park,43.639253,-79.378395
1,Lake Ontario,Lake,43.638945,-79.379665
2,Harbourfront,Neighborhood,43.639526,-79.380688
3,Miku,Japanese Restaurant,43.641374,-79.377531
4,Natrel Pond/Rink,Skating Rink,43.638431,-79.382528


In [23]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


In [24]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [25]:
toronto_venues = getNearbyVenues(names=Nebr_ungrp['Neighborhood'],
                                   latitudes=Nebr_ungrp['Latitude'],
                                   longitudes=Nebr_ungrp['Longitude']
                                  )

Harbourfront
Queen's Park
Ryerson

Garden District

St. James Town
The Beaches
Berczy Park
Christie

Adelaide

King

Richmond

Dovercourt Village
Dufferin

Harbourfront East

Toronto Islands
Union Station
Little Portugal
Trinity
The Danforth West

Riverdale
Design Exchange
Toronto Dominion Centre
Brockton

Exhibition Place
Parkdale Village
The Beaches West

India Bazaar
Commerce Court
Studio District

Lawrence Park
Roselawn

Davisville North

Forest Hill North
High Park
The Junction South

The Annex
Yorkville
Parkdale
Roncesvalles
Davisville

Harbord

University of Toronto
Runnymede
Swansea
Moore Park
Summerhill East

Chinatown
Grange Park
Kensington Market
Deer Park
Forest Hill SE

Rathnelly
South Hill
Summerhill West

CN Tower
Bathurst Quay

Harbourfront West

King and Spadina
South Niagara
Rosedale
Cabbagetown
St. James Town
First Canadian Place
Underground city
Church and Wellesley


In [26]:
print(toronto_venues.shape)
toronto_venues.head()

(3358, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.64008,-79.38015,Harbour Square Park,43.639253,-79.378395,Park
1,Harbourfront,43.64008,-79.38015,Lake Ontario,43.638945,-79.379665,Lake
2,Harbourfront,43.64008,-79.38015,Harbourfront,43.639526,-79.380688,Neighborhood
3,Harbourfront,43.64008,-79.38015,Miku,43.641374,-79.377531,Japanese Restaurant
4,Harbourfront,43.64008,-79.38015,Natrel Pond/Rink,43.638431,-79.382528,Skating Rink


In [27]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,100,100,100,100,100,100
Bathurst Quay,24,24,24,24,24,24
Berczy Park,100,100,100,100,100,100
Brockton,17,17,17,17,17,17
CN Tower,59,59,59,59,59,59
Cabbagetown,47,47,47,47,47,47
Chinatown,58,58,58,58,58,58
Christie,58,58,58,58,58,58
Church and Wellesley,74,74,74,74,74,74
Commerce Court,100,100,100,100,100,100


In [28]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 284 uniques categories.


In [29]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Service,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
toronto_onehot.shape

(3358, 284)

In [31]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Service,American Restaurant,Antique Shop,Aquarium,...,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Adelaide,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.040000,0.000000,0.000000,...,0.00,0.010000,0.000000,0.000000,0.000000,0.00,0.010000,0.000000,0.000000,0.000000
1,Bathurst Quay,0.000000,0.000000,0.000000,0.000000,0.041667,0.041667,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000
2,Berczy Park,0.010000,0.000000,0.000000,0.000000,0.000000,0.000000,0.010000,0.010000,0.000000,...,0.00,0.010000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000
3,Brockton,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.000000,0.117647,0.00,0.000000,0.000000,0.000000,0.000000
4,CN Tower,0.016949,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.033898,...,0.00,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.016949,0.000000,0.000000
5,Cabbagetown,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000
6,Chinatown,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.034483,0.017241,0.000000,0.034483,0.00,0.017241,0.000000,0.000000,0.000000
7,Christie,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.017241,0.000000,0.000000,...,0.00,0.000000,0.000000,0.017241,0.017241,0.00,0.017241,0.000000,0.000000,0.000000
8,Church and Wellesley,0.027027,0.000000,0.013514,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.00,0.000000,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.013514,0.000000
9,Commerce Court,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.040000,0.000000,0.000000,...,0.00,0.010000,0.000000,0.000000,0.000000,0.00,0.010000,0.000000,0.000000,0.000000


In [32]:
toronto_grouped.shape

(64, 284)

In [33]:
Top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(Top_venues))
    print('\n')

----Adelaide
----
                 venue  freq
0                 Café  0.06
1          Coffee Shop  0.06
2                  Gym  0.04
3  American Restaurant  0.04
4           Restaurant  0.04


----Bathurst Quay
----
                  venue  freq
0           Coffee Shop  0.17
1                  Café  0.12
2                  Park  0.08
3                Tunnel  0.04
4  Caribbean Restaurant  0.04


----Berczy Park----
                venue  freq
0         Coffee Shop  0.08
1                Café  0.06
2          Restaurant  0.05
3  Seafood Restaurant  0.04
4  Italian Restaurant  0.04


----Brockton
----
                   venue  freq
0                    Bar  0.18
1                   Park  0.12
2  Vietnamese Restaurant  0.12
3              Gastropub  0.06
4      Korean Restaurant  0.06


----CN Tower----
              venue  freq
0             Hotel  0.08
1       Coffee Shop  0.07
2       Pizza Place  0.07
3  Baseball Stadium  0.05
4               Bar  0.05


----Cabbagetown----
          

In [34]:
def return_most_common_venues(row, Top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:Top_venues]

In [35]:
Top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(Top_venues):
    try:
        columns.append('{}{} Popular Venues'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Popular Venues'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], Top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Popular Venues,2nd Popular Venues,3rd Popular Venues,4th Popular Venues,5th Popular Venues,6th Popular Venues,7th Popular Venues,8th Popular Venues,9th Popular Venues,10th Popular Venues
0,Adelaide,Coffee Shop,Café,Gastropub,Gym,Restaurant,American Restaurant,Japanese Restaurant,Cosmetics Shop,Seafood Restaurant,Bookstore
1,Bathurst Quay,Coffee Shop,Café,Park,Dance Studio,Ramen Restaurant,Caribbean Restaurant,Garden,Sculpture Garden,Sushi Restaurant,Grocery Store
2,Berczy Park,Coffee Shop,Café,Restaurant,Japanese Restaurant,Seafood Restaurant,Italian Restaurant,Beer Bar,Breakfast Spot,Cocktail Bar,Gastropub
3,Brockton,Bar,Park,Vietnamese Restaurant,Gastropub,Grocery Store,Korean Restaurant,Bakery,Jazz Club,Dive Bar,Coffee Shop
4,CN Tower,Hotel,Pizza Place,Coffee Shop,Bar,Baseball Stadium,Ice Cream Shop,Park,Scenic Lookout,Aquarium,Gym
5,Cabbagetown,Restaurant,Coffee Shop,Café,Indian Restaurant,Bakery,Beer Store,Pizza Place,Gastropub,Japanese Restaurant,Pub
6,Chinatown,Dessert Shop,Café,Mexican Restaurant,Vegetarian / Vegan Restaurant,Bar,Vietnamese Restaurant,Coffee Shop,Bakery,Clothing Store,Belgian Restaurant
7,Christie,Korean Restaurant,Coffee Shop,Indian Restaurant,Dessert Shop,Karaoke Bar,Sandwich Place,Japanese Restaurant,Ice Cream Shop,Grocery Store,Cocktail Bar
8,Church and Wellesley,Japanese Restaurant,Sushi Restaurant,Coffee Shop,Café,Restaurant,Yoga Studio,Grocery Store,Hotel,Gay Bar,Gastropub
9,Commerce Court,Coffee Shop,Restaurant,Café,Italian Restaurant,Hotel,Japanese Restaurant,American Restaurant,Gym,Deli / Bodega,Seafood Restaurant


In [36]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood',1)
#print(toronto_grouped_clustering)
#print(toronto_grouped)
# run k-means clustering
kmeans = KMeans(init = "k-means++", n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
labels = kmeans.labels_[0:63] 
print(labels)

[4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4
 4 3 4 4 2 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 0 4 4 4 4]


In [37]:
toronto_merged = Nebr_ungrp
print(toronto_merged.shape)
labels = np.append(labels,labels[0])
print(labels.shape)
# add clustering labels
toronto_merged['Cluster Labels'] = labels.tolist()

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

(65, 5)
(64,)


ValueError: Length of values does not match length of index

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]