# Capstone Project - The Battle of the Neighborhoods 

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project, we will try to find the best location or neighborhood for a new hotel. Specifically, this report will be targeted to potential investors and other relevant stakeholders who are interested in opening a hotel in Toronto, Canada.


There are a lot of hotels in Toronto so we are looking for areas that are not concentrated with hotels and also the neighborhoods that has the most tourist spots. We have to look at locations that restaurants too but not too many because our hotel will have a restaurant too so we need a nice balance between giving our customers a choice as well promoting our restaurant services. We would also look if the location is nearby to the airport on the basis that it will be easier for tourists to come to if they are new and unfamiliar to Toronto. But this would not be the most important factor we will be looking at, and will be considered if the other criterias are met and we need to choose between them.

We will leverage our statisical tools to find suitable locations and investigate the above criterias. This will in turn potentially maximise business for the hotel and in each case we will explain why it is suitable for the stakeholders to choose from.

## Data <a name="data"></a>

We are looking for neighborhoods with:
* Less hotels in the viscinity 
* More tourist attractions nearby 
* Near to airport or city center

We will be using the list of postcodes from the Wikipedia page and webscrape into a dataframe.
The unique postcodes will define our respective neighborhoods and we will get the geograhical location by using the the package geocoder.
To get the number of tourist attractions and their type and location in every neighborhood will be obtained using Foursquare API.


Import relevant packages we need like below as we go.

In [5]:
import pandas as pd
import numpy as np

Webscrape from the wikipedia page of postal codes of Toronto. And see the dataframe for preprocessing.

In [6]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

df=pd.read_html(url, header=0, flavor='html5lib')[0]
df.head()


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


We get rid of "Not assigned" values in the Borough Column. And we check if there are any "Not assigned values in "Neighborhood" column. 

In [7]:
df = df[df["Borough"] != "Not assigned"]
df[df["Neighbourhood"] == "Not assigned"] 
df = df.reset_index(drop=True)
df.shape

(103, 3)

There is no invalid values in our dataframe now and we can continue on to get the coordinates of the neighborhoods. We have 103 neighborhoods altogether.

We tried to use geocoder package but unfortunately it was taking too long to collect the required data so we will be using a data from IBM which has all the geographical coordinates for all postcodes in Toronoto Cananda.

In [8]:
geodf=pd.read_csv("http://cocl.us/Geospatial_data")
geodf.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We merge our daraframes on "Postal Code" feature.

In [9]:
df_merged = pd.merge(df, geodf, on='Postal Code')
df_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Now we have Postal Code, Borough, Neighborhood, Latitude and Longitude in the same dataframe. 

# Methodology <a name="methodology"></a>

The way we are going to proceed from here is we want to find out the top venues near our neighborhoods and try to maximise the number of venues that are specifically that we think would be attractions for tourists or customers for the hotel.
We will trim down some of our original neighborhoods by putting a limit on the least number of attractions we need in 1km radius of our hotel in the neighborhood.

Then we will implement one hot encoding so we can see the percentage frequency of the venue category in that particular neighborhood and gain insights.
This will also enable us to use model clusters to cluster the filtered trimmed neighborhoods. 

We will visualise the clusters on the map and heusterically anlayse the clusters postion relative to airports or city center.
Hence, we will use the map, and the insights gained from the one hot encoding to give a positive and negative breakdown of the neighborhood for a potential hotel and also a personal conclusion.

# Analysis <a name="analysis"></a>

Importing the relevant modules we need in our analysis.

In [10]:
import requests
from sklearn.cluster import KMeans
# !pip install folium
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

Now that we have our relevent modules all imported, we can store our Foursquare credentials into variables we can use easily.
We also set the top venues returned as 50 and set a radius of 1 kilometer

In [64]:
CLIENT_ID = '########################' # your Foursquare ID
CLIENT_SECRET = '########################' # your Foursquare Secret
ACCESS_TOKEN = "########################" # your FourSquare Access Token
LIMIT = 50 #top 50 venues returned
radius = 1000 # 1km

We create a function that retrieves all the nearby venues in a nice dataframe using the Foursquare API.

We acquired the follwing ID for the categories of venues we are intereted in near our hotel.
* 4d4b7105d754a06377d81259 - Outdoors and Recreation
* 4d4b7105d754a06376d81259 - Nightlife spot
* 4d4b7105d754a06373d81259 - Event
* 4d4b7104d754a06370d81259 - Arts and Entertainment

In [18]:
Cat = ["4d4b7105d754a06377d81259","4d4b7105d754a06376d81259","4d4b7105d754a06373d81259","4d4b7104d754a06370d81259"]

Create a function to return a dataframe which has name, location and the specific category of the venues nearby our input neighborhood. We also input specific categories that most likely are tourist sightseeing categories which will return tourist attractions nearby each neighborhood.

In [65]:
def getNearbyVenues(names, latitudes, longitudes, radius, cat):
    venues_list=[]
    version = '20180604'
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={},{},{},{}'.format(
            CLIENT_ID, CLIENT_SECRET, version, lat, lng, radius, LIMIT,cat[0],cat[1],cat[2],cat[3])
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues_list.append([(name, lat, lng,
                             v['venue']['name'],
                             v['venue']['location']['lat'], 
                             v['venue']['location']['lng'],
                             v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 'Neighbourhood Latitude', 'Neighbourhood Longitude', 'Venue', 
                             'Venue Latitude', 'Venue Longitude', 'Venue Category']
    return(nearby_venues)

In [66]:
venues = getNearbyVenues(df_merged['Neighbourhood'], df_merged['Latitude'], df_merged['Longitude'],radius,Cat)

In [67]:
venues.shape

(1587, 7)

We have located 1,587 possible locations for tourist attraction venues for each of our neighborhoods. Now we see which neighbourhoods have the most attractions nearby so we can shortlist the neighborhoods down from the original 103 suitable locations. Also note we defined our search within 1km radius and to return the top 50 hits which is an arbitrary decsion on the logic 50 hits is plenty of nearby attractions for potential hotel customers when we are talking about a 1km radius.

In [77]:
venue_count=venues.groupby(by="Neighbourhood")[["Venue"]].count().sort_values(by=['Venue'],ascending=False)
candidates = venue_count[venue_count["Venue"]>=20]
candidates.shape

(28, 1)

We have put a minimum limit of how many attractions we want nearby our hotel at the least as 20. Now we explore our 28 suitable neighborhod candidates and see what are the attractions near our hotel and see if that gives us any insights into what kind of cutomers we would be potentially be attracting if our hotel was in these locations. These are all in the assumption that the customer is solely picking our hotel due to the attarctions and services nearby. 
Later we will explore the categories that have a good balance of restaurants.

In [102]:
# venues["Neighbourhood"]
cand=list(candidates.index.values)
df_candidates=venues[venues["Neighbourhood"].isin(cand)].reset_index().drop("index",axis=1)
df_candidates.shape

(1084, 7)

We have trimmed the 1,587 venues needed to be analysed to 1,084 which are more relevant to the neighborhoods we are going to analyse. We implement one hot encoding to see the distribution of venues among the neighborhoods,calculate the % of which kind of venues appear in each neighborhood.

In [104]:
df_filtered = pd.get_dummies(df_candidates[['Venue Category']], prefix="", prefix_sep="")
df_filtered.rename(columns={'Neighbourhood':'Neighbourhood (category)'}, inplace=True)
df_filtered['Neighbourhood'] = df_candidates['Neighbourhood'] 
fixed_columns = [df_filtered.columns[-1]] + list(df_filtered.columns[:-1])
df_filtered = df_filtered[fixed_columns]
df_filtered.head()

Unnamed: 0,Neighbourhood,Athletics & Sports,Baseball Field,Beach,Botanical Garden,Boxing Gym,Castle,Climbing Gym,Curling Ice,Cycle Studio,...,Skate Park,Skating Rink,Ski Area,Soccer Field,Sports Club,Summer Camp,Tennis Court,Track,Trail,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


df_grouped = df_filtered.groupby('Neighbourhood').mean().reset_index()

We find 10 most common venues in each neighborhood so it easy to analyse the neighborhoods.

In [107]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [154]:
topven = 10
subs = ['st', 'nd', 'rd']
columns = ['Neighbourhood']
for ind in np.arange(topven):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, subs[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
dfs = pd.DataFrame(columns=columns)
dfs['Neighbourhood'] = df_grouped['Neighbourhood']
for ind in np.arange(df_grouped.shape[0]):
    dfs.iloc[ind, 1:] = return_most_common_venues(df_grouped.iloc[ind, :], topven)
dfs

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Gym,Park,Gym / Fitness Center,Harbor / Marina,Plaza,Yoga Studio,Gym Pool,Roof Deck,Beach,Fountain
1,"Brockton, Parkdale Village, Exhibition Place",Park,Gym / Fitness Center,Yoga Studio,Gym,Plaza,Boxing Gym,Climbing Gym,Farm,Harbor / Marina,Trail
2,"Business reply mail Processing Centre, South C...",Park,Gym / Fitness Center,Harbor / Marina,Yoga Studio,Skate Park,Beach,Dog Run,Garden,Gym,Martial Arts School
3,Central Bay Street,Gym,Gym / Fitness Center,Park,Plaza,Sculpture Garden,Yoga Studio,Garden,Pool,Skating Rink,Lake
4,Christie,Park,Martial Arts School,Gym,Rock Climbing Spot,Baseball Field,Gym / Fitness Center,Playground,Pool,Athletics & Sports,Skating Rink
5,Church and Wellesley,Park,Gym,Gym / Fitness Center,Yoga Studio,Plaza,Dog Run,Botanical Garden,Garden,Pilates Studio,Playground
6,"Commerce Court, Victoria Hotel",Gym,Park,Plaza,Gym / Fitness Center,Yoga Studio,Garden,Scenic Lookout,Fountain,Gym Pool,Playground
7,Davisville,Gym,Yoga Studio,Plaza,Tennis Court,Gym / Fitness Center,Park,Indoor Play Area,Trail,Pilates Studio,Martial Arts School
8,Davisville North,Gym,Yoga Studio,Park,Athletics & Sports,Gym / Fitness Center,Dog Run,Plaza,Gym Pool,Track,Pilates Studio
9,"First Canadian Place, Underground city",Gym,Park,Plaza,Gym / Fitness Center,Yoga Studio,Garden,Scenic Lookout,Baseball Field,Fountain,Gym Pool


Now we have a nice dataframe which gives the 10th most common venues near all of our neighborhoods.

We have this as a reference when we continue to implement a cluster model on our neighborhood candidates. We notice that there are a lot of gyms and sports related venues. For our neighborhood where the hotel is going to, we do not need so much so we could leave out "Outdoors and Recreation" Category when we were putting request to Foursquare but we are going to keep as is and see from our cluster solution if it clusters the most sports and gym intensive neighborhoods out from the well balanced ones we need.

We cluster with 4 clusters and kmeans algorithm and then observe which cluster provides the best balance between the outdoor and recreation attractions and other attractions we queried from Foursquare - Arts and Entertainment, Nighlife Spot and Event.

In [155]:
k = 4
dfc = df_grouped.drop('Neighbourhood', 1)
kmeans = KMeans(n_clusters=k, random_state=1234).fit(dfc)
kmeans.labels_

array([1, 0, 0, 3, 0, 2, 1, 1, 3, 1, 3, 3, 0, 3, 0, 3, 2, 2, 1, 1, 0, 1, 2,
       2, 2, 2, 1, 2])

In [156]:
dfs.insert(0, 'Cluster Labels', kmeans.labels_)
dfm = df_candidates
dfm = dfm.join(dfs.set_index('Neighbourhood'), on='Neighbourhood')


We visualise the clusters using folium package on a map to see their relative location to nearby airport and city center. Provides a good indication and will supoort our decision analysing each of the clusters.

In [157]:
toronto_coordinates =[43.6532, -79.3832]
map_clusters = folium.Map(location=toronto_coordinates, zoom_start=11)
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
carray = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in carray]
markers_colors = []
for lat, lon, poi, cluster in zip(dfm['Venue Latitude'], dfm['Venue Longitude'],
                                  dfm['Neighbourhood'], dfm['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon], radius=5, popup=label, color=rainbow[cluster-1], fill=True,
                        fill_color=rainbow[cluster-1], fill_opacity=0.7).add_to(map_clusters)
map_clusters

In [174]:
clusters=dfm.drop(["Venue","Venue Latitude","Venue Longitude","Venue Category"],axis=1)
cluster1=clusters[clusters["Cluster Labels"]==0].drop_duplicates().reset_index().drop("index",axis=1)
cluster2=clusters[clusters["Cluster Labels"]==1].drop_duplicates().reset_index().drop("index",axis=1)
cluster3=clusters[clusters["Cluster Labels"]==2].drop_duplicates().reset_index().drop("index",axis=1)
cluster4=clusters[clusters["Cluster Labels"]==3].drop_duplicates().reset_index().drop("index",axis=1)

In [180]:
cluster1

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Christie,43.669542,-79.422564,0,Park,Martial Arts School,Gym,Rock Climbing Spot,Baseball Field,Gym / Fitness Center,Playground,Pool,Athletics & Sports,Skating Rink
1,"Little Portugal, Trinity",43.647927,-79.41975,0,Park,Gym / Fitness Center,Yoga Studio,Gym,Playground,Skating Rink,Athletics & Sports,Martial Arts School,Boxing Gym,Dog Run
2,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191,0,Park,Gym / Fitness Center,Yoga Studio,Gym,Plaza,Boxing Gym,Climbing Gym,Farm,Harbor / Marina,Trail
3,"India Bazaar, The Beaches West",43.668999,-79.315572,0,Park,Beach,Gym,Gym / Fitness Center,Yoga Studio,Harbor / Marina,Pool,Baseball Field,Boxing Gym,Garden
4,"St. James Town, Cabbagetown",43.667967,-79.367675,0,Park,Trail,Gym / Fitness Center,Plaza,Yoga Studio,Dog Run,Gym,Playground,Pool,Garden
5,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,0,Park,Gym / Fitness Center,Harbor / Marina,Yoga Studio,Skate Park,Beach,Dog Run,Garden,Gym,Martial Arts School


We see this cluster is near Beaches and Parks and Gardens while having balance of the outdoor and recreation venues.

In [181]:
cluster2

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,St. James Town,43.651494,-79.375418,1,Gym,Gym / Fitness Center,Park,Plaza,Yoga Studio,Garden,Gym Pool,Pool,Playground,Baseball Field
1,Berczy Park,43.644771,-79.373306,1,Gym,Park,Gym / Fitness Center,Harbor / Marina,Plaza,Yoga Studio,Gym Pool,Roof Deck,Beach,Fountain
2,"Richmond, Adelaide, King",43.650571,-79.384568,1,Gym,Park,Plaza,Gym / Fitness Center,Yoga Studio,Garden,Scenic Lookout,Roof Deck,Fountain,Lake
3,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576,1,Gym,Park,Plaza,Gym / Fitness Center,Garden,Scenic Lookout,Yoga Studio,Roof Deck,Baseball Field,Fountain
4,"Commerce Court, Victoria Hotel",43.648198,-79.379817,1,Gym,Park,Plaza,Gym / Fitness Center,Yoga Studio,Garden,Scenic Lookout,Fountain,Gym Pool,Playground
5,Davisville,43.704324,-79.38879,1,Gym,Yoga Studio,Plaza,Tennis Court,Gym / Fitness Center,Park,Indoor Play Area,Trail,Pilates Studio,Martial Arts School
6,Stn A PO Boxes,43.646435,-79.374846,1,Gym,Park,Gym / Fitness Center,Plaza,Yoga Studio,Gym Pool,Beach,Fountain,Garden,Harbor / Marina
7,"First Canadian Place, Underground city",43.648429,-79.38228,1,Gym,Park,Plaza,Gym / Fitness Center,Yoga Studio,Garden,Scenic Lookout,Baseball Field,Fountain,Gym Pool


This cluster is very similar to cluster 1. We can see a good balance.

In [182]:
cluster3

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Regent Park, Harbourfront",43.65426,-79.360636,2,Park,Gym,Gym / Fitness Center,Yoga Studio,Playground,Trail,Athletics & Sports,Gym Pool,Dog Run,Baseball Field
1,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,2,Gym,Gym / Fitness Center,Park,Yoga Studio,Sculpture Garden,Dog Run,Pilates Studio,Pool,Martial Arts School,Skating Rink
2,"The Danforth West, Riverdale",43.679557,-79.352188,2,Park,Gym,Gym / Fitness Center,Yoga Studio,Scenic Lookout,Playground,Trail,Skating Rink,Gym Pool,Dog Run
3,Studio District,43.659526,-79.340923,2,Gym,Park,Gym / Fitness Center,Yoga Studio,Playground,Baseball Field,Climbing Gym,Curling Ice,Cycle Studio,Gym Pool
4,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,2,Park,Gym,Yoga Studio,Gym / Fitness Center,Martial Arts School,Athletics & Sports,Cycle Studio,Ski Area,Skating Rink,Dog Run
5,"University of Toronto, Harbord",43.662696,-79.400049,2,Park,Gym,Yoga Studio,Gym / Fitness Center,Sculpture Garden,Martial Arts School,Soccer Field,Sports Club,Ski Area,Track
6,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049,2,Park,Gym,Gym / Fitness Center,Skating Rink,Athletics & Sports,Trail,Yoga Studio,Tennis Court,Ski Area,Gym Pool
7,Church and Wellesley,43.66586,-79.38316,2,Park,Gym,Gym / Fitness Center,Yoga Studio,Plaza,Dog Run,Botanical Garden,Garden,Pilates Studio,Playground


This cluster is mainly outdoor and recreation venues and not good balance between the other attractions.

In [183]:
cluster4

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Garden District, Ryerson",43.657162,-79.378937,3,Gym,Gym / Fitness Center,Park,Yoga Studio,Garden,Plaza,Skating Rink,Sculpture Garden,Dog Run,Lake
1,Central Bay Street,43.657952,-79.387383,3,Gym,Gym / Fitness Center,Park,Plaza,Sculpture Garden,Yoga Studio,Garden,Pool,Skating Rink,Lake
2,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752,3,Gym,Park,Gym / Fitness Center,Harbor / Marina,Plaza,Scenic Lookout,Gym Pool,Yoga Studio,Roof Deck,Baseball Field
3,Davisville North,43.712751,-79.390197,3,Gym,Yoga Studio,Park,Athletics & Sports,Gym / Fitness Center,Dog Run,Plaza,Gym Pool,Track,Pilates Studio
4,"Parkdale, Roncesvalles",43.64896,-79.456325,3,Gym,Park,Playground,Baseball Field,Garden,Dog Run,Curling Ice,Tennis Court,Sports Club,Skating Rink
5,"Kensington Market, Chinatown, Grange Park",43.653206,-79.400049,3,Gym / Fitness Center,Gym,Yoga Studio,Park,Martial Arts School,Garden,Athletics & Sports,Playground,Skating Rink,Gym Pool


This cluster is very similar to cluster 3, not a good balance.

From this examination we can get rid of clusters 3 and 4. We plot clusters 1 and 2 and compare again and try to notice any major differences although they are in different clusters so there will obviously be major differences but we can still pick out which cluster would better suit us looking back at our criteria from the business understanding section.

In [185]:
a=dfm[dfm["Cluster Labels"]==0]
b=dfm[dfm["Cluster Labels"]==1]
Fcands=pd.concat([a,b],ignore_index=True)

In [189]:
toronto_coordinates =[43.6532, -79.3832]
map_clusters = folium.Map(location=toronto_coordinates, zoom_start=11)
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(2)]
carray = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in carray]
markers_colors = []
for lat, lon, poi, cluster in zip(Fcands['Venue Latitude'], Fcands['Venue Longitude'],
                                  Fcands['Neighbourhood'], Fcands['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon], radius=5, popup=label, color=rainbow[cluster-1], fill=True,
                        fill_color=rainbow[cluster-1], fill_opacity=0.7).add_to(map_clusters)
map_clusters

The first cluster is red and second cluster is purple in the map.

# Results and Discussion <a name="results"></a>

Both clusters 1 and 2 look really good in very good locations, close to airport and major parks and beaches. Observing closer the red cluster is a little away from the city center while few neighborhoods that are included in the purple cluster are close to a Mount Hope Cemetery. Generally we would expect customers would avoid cemeteries when they are going for holidays or a weekend away. Due to this we can eliminate those neighborhoods that are there which is the Davisville neigborhood.

In general, the only difference between the clusters are the fact the red clusters are little away from the city centre and spread around the center but are still close to some major attractions from our most common dataframe and visual map. If we get rid of the Davisville neighborhood we have 1 big cluster purple in the middle of city center.

# Conclusion <a name="conclusion"></a>

Purpose of this project was to identify Toronto areas close to center with max number of attarction in order to aid stakeholders in narrowing down the search for optimal location for a new hotel. By calculating possible tourist attraction distribution from Foursquare data we have first identified general boroughs that have helped trim down some of the original neighborhoods from 103 to only 28. Clustering of those locations was then performed in order to create major zones of interest as seen from the map.

Final decision on optimal hotel location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.