<H1>Identifying best Neighborhoods in Toronto to start a Restaurant

## Import Libraries

In [25]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import geocoder # import geocoder
import folium
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

## 1. Download and Explore Dataset

We will scrape the table from Wiki page using BeautifulSoap package and using html parser.

In [26]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response=requests.get(wiki_url)
soup = BeautifulSoup(response.text,'html.parser')

In [27]:
table =soup.find('table')

In [28]:
df =pd.read_html(str(table))
df = df[0]

In [29]:
df = df[df.Borough != 'Not assigned']

Converting the html table to dataframe and the below is the final data dimension after removing all non assigned neighourhoods.

In [30]:
dupl = df[df['Postal Code'].duplicated()]
df.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M4W,North York,Downsview
freq,1,24,4


In [7]:
!pip install geocoder



In [40]:
!pip install folium

Collecting folium
  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


### Using geocoder to fetch the latitude and longitude of each pincode fetched

In [16]:
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords
coords = [ get_latlng(neighborhood) for neighborhood in df["Postal Code"].tolist() ]
coords

[[43.75293455500008, -79.33564142299997],
 [43.72810248500008, -79.31188987099995],
 [43.65096410900003, -79.35304116399999],
 [43.723265465000054, -79.45121077799996],
 [43.66179000000005, -79.38938999999993],
 [43.66748067300006, -79.52895286499995],
 [43.80862623100006, -79.18991284599997],
 [43.74890000000005, -79.35721999999998],
 [43.70719267700008, -79.31152927299996],
 [43.65749059800004, -79.37752923699998],
 [43.70727872700007, -79.44750009299997],
 [43.65002250300006, -79.55408903099999],
 [43.78577865700004, -79.15736763799998],
 [43.72214339800007, -79.35202341799999],
 [43.68974004200004, -79.30850701899999],
 [43.65173364700007, -79.37555358799995],
 [43.69172991700003, -79.43001279899994],
 [43.637813150000056, -79.57648363299995],
 [43.76580607300008, -79.18528434099994],
 [43.67814827600006, -79.29534930999995],
 [43.645195888000046, -79.37385548899994],
 [43.68911756600005, -79.45065043699998],
 [43.77154467100007, -79.21813521299998],
 [43.70941386000004, -79.363099

In [38]:
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
df['Latitude'] = df_coords['Latitude']
df['Longitude'] = df_coords['Longitude']
print(df.shape)

(103, 6)


### Glimpse of the final Dataset 

In [37]:
df.drop(['index'], axis=1)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.311890
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.661790,-79.389390
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653340,-79.509766
99,M4Y,Downtown Toronto,Church and Wellesley,43.666659,-79.381472
100,M7Y,East Toronto,Business reply mail Processing Centre,43.648700,-79.385450
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.632798,-79.493017


In [47]:
latitude=max(df['Latitude'])
longitude=min(df['Longitude'])

In [48]:
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)
# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
map_manhattan

In [62]:
df_toronto = df.drop(['index'], axis=1)

In [64]:
df_toronto.shape
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939


## 2. Fetching Neighborhood Venues using Foursquare API

In [79]:
# define Foursquare Credentials and Version
CLIENT_ID = 'OILLBQOIB030SIAEJKNLM1Y1IEIFUNRBLVABMEDPZBAJZEQE' # your Foursquare ID
CLIENT_SECRET = '2BBYSH5HZQXTGWWZXNZACEYTKC5Z5SYWCCL513BB25AAK0YQ' # your Foursquare Secret
VERSION = '20200505' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: OILLBQOIB030SIAEJKNLM1Y1IEIFUNRBLVABMEDPZBAJZEQE
CLIENT_SECRET:2BBYSH5HZQXTGWWZXNZACEYTKC5Z5SYWCCL513BB25AAK0YQ


In [81]:
radius = 1000
LIMIT = 50

venues = []

for lat, long, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [83]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(3385, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Parkwoods,43.752935,-79.335641,Donalda Golf & Country Club,43.752816,-79.342741,Golf Course
1,Parkwoods,43.752935,-79.335641,Brookbanks Park,43.751976,-79.33214,Park
2,Parkwoods,43.752935,-79.335641,TD Canada Trust,43.753253,-79.347851,Bank
3,Parkwoods,43.752935,-79.335641,Variety Store,43.751974,-79.333114,Food & Drink Shop
4,Parkwoods,43.752935,-79.335641,TTC Stop #9075,43.757596,-79.338155,Train Station


## 3. Exploratory data Analysis

### Top 20 Venues

In [179]:
#venues_df.groupby(["VenueName"]).agg({"VenueName": [np.sum]})

venues_df[['VenueName']].groupby(['VenueName'])['VenueName'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .head(20).reset_index(drop=True)

Unnamed: 0,VenueName,count
0,Tim Hortons,79
1,Starbucks,56
2,Subway,52
3,Shoppers Drug Mart,36
4,TD Canada Trust,33
5,Pizza Pizza,25
6,RBC Royal Bank,24
7,Petro-Canada,23
8,LCBO,21
9,The Beer Store,19


### Bottom 30 Venues

In [180]:
venues_df[['VenueName']].groupby(['VenueName'])['VenueName'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .tail(20).reset_index(drop=True)

Unnamed: 0,VenueName,count
0,Her Father's Cider Bar + Kitchen,1
1,Henry VIII Ale House,1
2,Henderson Brewing,1
3,Heather Heights Park,1
4,Heart Sushi,1
5,Hearn Generating Station,1
6,Hancook,1
7,Hazel's Diner,1
8,Haven,1
9,Haute Coffee,1


### Neighbourhoods with highest number of venues

In [183]:
venues_df[['Neighborhood']].groupby(['Neighborhood'])['Neighborhood'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .head(10).reset_index(drop=True)

Unnamed: 0,Neighborhood,count
0,Downsview,68
1,Willowdale,68
2,Don Mills,67
3,Leaside,50
4,"Runnymede, Swansea",50
5,"Dufferin, Dovercourt Village",50
6,East Toronto,50
7,Stn A PO Boxes,50
8,"Fairview, Henry Farm, Oriole",50
9,"First Canadian Place, Underground city",50


### Neighbourhoods with Lowest number of venues


In [184]:
venues_df[['Neighborhood']].groupby(['Neighborhood'])['Neighborhood'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .tail(10).reset_index(drop=True)

Unnamed: 0,Neighborhood,count
0,"North Park, Maple Leaf Park, Upwood Park",11
1,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",11
2,Scarborough Village,11
3,Parkwoods,11
4,Lawrence Park,10
5,"Humberlea, Emery",9
6,"Malvern, Rouge",9
7,Woburn,7
8,Humber Summit,7
9,"Rouge Hill, Port Union, Highland Creek",5


In [189]:
venues_df[['Neighborhood','VenueName']].groupby(['Neighborhood','VenueName'])['Neighborhood'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .reset_index(drop=True)

Unnamed: 0,Neighborhood,VenueName,count
0,Downsview,Tim Hortons,5
1,"Clarks Corners, Tam O'Shanter, Sullivan",Tim Hortons,4
2,East Toronto,Starbucks,3
3,"The Kingsway, Montgomery Road, Old Mill North",Starbucks,3
4,"Summerhill West, Rathnelly, South Hill, Forest...",Starbucks,3
...,...,...,...
3289,"Fairview, Henry Farm, Oriole",The Beer Store,1
3290,"Fairview, Henry Farm, Oriole",The LEGO Store,1
3291,"Fairview, Henry Farm, Oriole",Tommy Hilfiger,1
3292,"Fairview, Henry Farm, Oriole",Zara,1


In [113]:
#venues_df.groupby(["Neighborhood"]).count()
#venues_df.groupby(["VenueCategory"]).count()
venues_dfnew = venues_df[venues_df["VenueCategory"].str.contains("Restaurant")].reset_index()

venues_df_ab = venues_dfnew[['Neighborhood','Latitude','Longitude']]
venues_df_ab.shape

(816, 3)

In [170]:
#df.groupby(['A','B']).B.agg('count').to_frame('c').reset_index()

df1 = venues_df_ab.groupby('Neighborhood').Neighborhood.agg('count').to_frame('Count').reset_index()['Count']
type(df1)
df1=df1.to_frame()
df1

Unnamed: 0,Count
0,22
1,4
2,5
3,2
4,12
...,...
87,21
88,7
89,2
90,7


In [158]:
toronto_grouped = venues_df_ab.groupby(['Neighborhood']).mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Agincourt,43.793930,-79.265694
1,"Alderwood, Long Branch",43.600895,-79.540387
2,"Bathurst Manor, Wilson Heights, Downsview North",43.757394,-79.442394
3,Bayview Village,43.780607,-79.376921
4,"Bedford Park, Lawrence Manor East",43.735447,-79.417944
...,...,...,...
87,Willowdale,43.769593,-79.415197
88,"Willowdale, Newtonbrook",43.791800,-79.406428
89,Woburn,43.771545,-79.218135
90,Woodbine Heights,43.689740,-79.308507


## 4. K means clustering and plotting the final segmented neighborhoods

In [166]:

# set number of clusters
kclusters = 3

#toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df1)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 0, 0, 0, 1, 1, 0, 1, 1, 1])

In [167]:
# add clustering labels
#toronto_grouped.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_grouped['Cluster Labels'] = kmeans.labels_
toronto_merged = toronto_grouped

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
#toronto_merged = toronto_merged.join(toronto_grouped.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head()
#toronto_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels
0,Agincourt,43.79393,-79.265694,2
1,"Alderwood, Long Branch",43.600895,-79.540387,0
2,"Bathurst Manor, Wilson Heights, Downsview North",43.757394,-79.442394,0
3,Bayview Village,43.780607,-79.376921,0
4,"Bedford Park, Lawrence Manor East",43.735447,-79.417944,1


In [168]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Analysis of clusters

In [195]:
clus_0 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[0] + list(range(5, toronto_merged.shape[1]))]].reset_index(drop=True)
clus_1 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[0] + list(range(5, toronto_merged.shape[1]))]].reset_index(drop=True)
clus_2 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[0] + list(range(5, toronto_merged.shape[1]))]].reset_index(drop=True)

We will compare the neighbourhoods in the each cluster with total number of avenues. So we can get a better understanding of which neighborhood to invest in based on the number of other avenues in the same neighborhood.

In [200]:
count_venues = venues_df[['Neighborhood']].groupby(['Neighborhood'])['Neighborhood'] \
                             .count() \
                             .reset_index(name='count') \
                             .sort_values(['count'], ascending=False) \
                             .reset_index(drop=True)

### Count of total venues in a neighborhood at Clouster 0 

In [232]:
pd.merge(clus_0,count_venues,on='Neighborhood').sort_values('count',ascending=False).reset_index(drop=True)

Unnamed: 0,Neighborhood,count
0,Woodbine Heights,50
1,"Regent Park, Harbourfront",50
2,Leaside,50
3,"Runnymede, The Junction North",44
4,"Guildwood, Morningside, West Hill",39
5,"Alderwood, Long Branch",36
6,York Mills West,33
7,Glencairn,32
8,Forest Hill North & West,31
9,"New Toronto, Mimico South, Humber Bay Shores",30


### Count of total venues in a neighborhood at Clouster 1

In [230]:
pd.merge(clus_1,count_venues,on='Neighborhood').sort_values('count',ascending=False).reset_index(drop=True)

Unnamed: 0,Neighborhood,count
0,Downsview,68
1,"Garden District, Ryerson",50
2,"Toronto Dominion Centre, Design Exchange",50
3,"High Park, The Junction South",50
4,Berczy Park,50
5,"India Bazaar, The Beaches West",50
6,"Kensington Market, Chinatown, Grange Park",50
7,"Lawrence Manor, Lawrence Heights",50
8,"Moore Park, Summerhill East",50
9,"First Canadian Place, Underground city",50


### Count of total venues in a neighborhood at Clouster 2

In [231]:
pd.merge(clus_2,count_venues,on='Neighborhood').sort_values('count',ascending=False).reset_index(drop=True)

Unnamed: 0,Neighborhood,count
0,Willowdale,68
1,Don Mills,67
2,Davisville,50
3,"Little Portugal, Trinity",50
4,"Summerhill West, Rathnelly, South Hill, Forest...",50
5,"The Annex, North Midtown, Yorkville",50
6,Agincourt,47
