<H1>Battle of the Neighborhoods | Opening a new Coffee Shop in Toronto</H1>

<H2>Introduction</H2>

<H3>The Business Problem</H3>

My client owns several Coffee Shops in New York City. She has been told about the huge opportunities that Toronto poses, so she's seriously considering the idea of opening one of her Coffee Shops there but, where?

The brief: find the best spot to open her new Coffee Shop in Toronto, Canada.

Since Toronto is a very big city, we'll focus our efforts in Downtown Toronto.

<H3>The Approach</H3>

1) Divide Toronto into grids.

2) Analyze the density of restaurants per grid.

3) Identify the zones of opportunity with low competition.

<H2>The Data</H2>

<H3>Describing the Data</H3>

My  first approach to the analysis was to find if there was a correlation between restaurant density, categories and a restaurant density or, at least, a trend. However, after further analysis of the API, its limitations are evident, so I'll focus just on restaurant density to determine opportunity.

To do this, I'll be using previously analyzed data sets (Toronto neighborhoods).

For further reference, the data sets are these:

1) Toronto neighborhood list: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

3) Toronto neighborhood geo data: https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv

I've also created a simplified geojson file with Toronto's boroughs based on jasoicarter's Toronto Neighborhoods geojson file that you can find here: https://github.com/jasonicarter/toronto-geojson/blob/master/toronto_crs84.geojson


<H3>Foursquare's data</H3>

In order to get more insight on the restaurant scene in Toronto, I'll need Foursquare's data. For my analysis, I'll only need the Explore endpoint (https://developer.foursquare.com/docs/api/venues/explore). This will give me the name of the venue, the location and the category.

<H2>Analysis</H2>

First, we need to import the libraries needed for our analysis.

In [1]:
!pip install pyproj
!pip install shapely
!pip install beautifulsoup4
!conda install -c conda-forge folium=0.5.0 --yes
!conda install -c conda-forge geopy --yes

Collecting pyproj
[?25l  Downloading https://files.pythonhosted.org/packages/e5/fd/eb99d24327e248a5e93cec65eedf22a751f70723384a832837eef1f80509/pyproj-2.2.2-cp36-cp36m-manylinux1_x86_64.whl (11.2MB)
[K     |████████████████████████████████| 11.2MB 14.3MB/s eta 0:00:01▌          | 7.5MB 14.3MB/s eta 0:00:01
[?25hInstalling collected packages: pyproj
Successfully installed pyproj-2.2.2
Collecting shapely
[?25l  Downloading https://files.pythonhosted.org/packages/38/b6/b53f19062afd49bb5abd049aeed36f13bf8d57ef8f3fa07a5203531a0252/Shapely-1.6.4.post2-cp36-cp36m-manylinux1_x86_64.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 12.5MB/s eta 0:00:01
[?25hInstalling collected packages: shapely
Successfully installed shapely-1.6.4.post2
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import urllib.request

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

%matplotlib inline

# import k-means from clustering stage
from sklearn.cluster import KMeans

# importing scrapper
from bs4 import BeautifulSoup
import io

import shapely.geometry

import pyproj

import math

In [3]:
import folium # map rendering library

<H2>Analyzing the data</H2>

First, we need to define the coordinates for Downtown Toronto. We need to create functions to transform latitude/longitude coordinates into cartesian coordinates.

In [4]:
#Defining Downtown Toronto Coordinates
toronto_downtown=[43.6547527,-79.4141868]

#Creating Functions to convert lat/lon coordinates to cartesian
def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=17, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

#Creating Functions to convert cartesian coordinates to lat/lon
def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=17, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

#Defining a function to calculate distance using 2D coordinates
def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

#Checking that functions work
print('Coordinate transformation check')
print('-------------------------------')
print('toronto_downtown longitude={}, latitude={}'.format(toronto_downtown[1], toronto_downtown[0]))
x, y = lonlat_to_xy(toronto_downtown[1], toronto_downtown[0])
print('Downtown Toronto UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Downtown Toronto longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
toronto_downtown longitude=-79.4141868, latitude=43.6547527
Downtown Toronto UTM X=627876.8881059117, Y=4834749.928079093
Downtown Toronto longitude=-79.4141868, latitude=43.654752699999996


Then, we need to generate the neighborhoods that we'll use for our analysis. These clusters will be 600m apart in a radius of 6km from Downtown Toronto's center.

In [5]:
# City center in Cartesian coordinates
toronto_downtown_x, toronto_downtown_y = lonlat_to_xy(toronto_downtown[1], toronto_downtown[0])

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = toronto_downtown_x - 6000
x_step = 600
y_min = toronto_downtown_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(toronto_downtown_x, toronto_downtown_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')

364 candidate neighborhood centers generated.


In order to see if the neighborhoods were correctly generated, we'll need to visualize them.

In [6]:
toronto_map = folium.Map(location=toronto_downtown, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(toronto_map)
folium.Marker(toronto_downtown, popup='Toronto Downtown').add_to(toronto_map)
folium.Circle(toronto_downtown, radius=6000, color='red', fill=False).add_to(toronto_map)
for lat, lon in zip(latitudes, longitudes):
    folium.CircleMarker([lat, lon], popup=str([lat,lon]), radius=2, color='gray', fill=True, fill_color='gray', fill_opacity=.25).add_to(toronto_map) 
    folium.Circle([lat, lon], radius=300, color='lightblue', fill=False).add_to(toronto_map)
    
    #folium.Marker([lat, lon]).add_to(toronto_map)
            
toronto_map

Everything's looking great so far. Some clusters are falling into bodies of water, but that won't impact our analysis.

Now, we'll use Google Map's API to get a list of addresses within the radius if our analysis.

In [7]:
api_key='AIzaSyApzoK5rD19et6-zNM7Ltf--L72YR2oamM'

In [8]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None

addr = get_address(api_key, toronto_downtown[0], toronto_downtown[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(toronto_downtown[0], toronto_downtown[1], addr))

Reverse geocoding check
-----------------------
Address of [43.6547527, -79.4141868] is: 6 Gore St, Toronto, ON M6J 2C6, Canada


In [9]:
print('Obtaining location addresses: ', end='')
addresses = []
for lat, lon in zip(latitudes, longitudes):
    address = get_address(api_key, lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    address = address.replace(', Canada', '')
    addresses.append(address)
    print(' .', end='')
print(' done.')

Obtaining location addresses:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.


Then, we'll add that list into a dataframe we can work with.

In [10]:
toronto_locations = pd.DataFrame({'Address': addresses,
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

toronto_locations.head(10)

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
0,"Toronto Division, ON",43.603614,-79.437834,626076.888106,4829034.0,5992.495307
1,"Toronto Division, ON",43.603512,-79.430402,626676.888106,4829034.0,5840.3767
2,"Toronto Division, ON",43.60341,-79.422971,627276.888106,4829034.0,5747.173218
3,"Toronto Division, ON",43.603307,-79.415539,627876.888106,4829034.0,5715.767665
4,"Toronto Division, ON",43.603204,-79.408108,628476.888106,4829034.0,5747.173218
5,"Toronto Division, ON",43.6031,-79.400676,629076.888106,4829034.0,5840.3767
6,"Toronto Division, ON",43.602996,-79.393245,629676.888106,4829034.0,5992.495307
7,"Toronto Division, ON",43.608443,-79.448861,625176.888106,4829554.0,5855.766389
8,"Toronto Division, ON",43.608342,-79.441429,625776.888106,4829554.0,5604.462508
9,"Toronto Division, ON",43.60824,-79.433997,626376.888106,4829554.0,5408.326913


<H3>Analyzing Foursquare data</H3>

We're going to use just the Foursquare Explore Endpoints. We'll focus just on restaurants and we'll pay special attention to the coffee shop scene and all sub categories of coffee shops.

In [11]:
client_id = 'ZD5ZGYN134T5K14QLLIYY5S2GD3JUXFX0HW4Q1OESP502DME' # your Foursquare ID
client_secret = 'W0RU55LWC2Q5L0PUVOAQ5WCOBHCZTOPHS2R2FCA3ROYJ4OFE' # your Foursquare Secret
VERSION = '20180604'

ll=list(zip(latitudes,longitudes))
ll=pd.DataFrame(ll,columns=["latitudes","longitudes"])

#restaurants=pd.DataFrame()

In [12]:
#'Root' category for all food-related venues
food_category = '4d4b7105d754a06374d81259'

#All Coffee Shop Categories
coffee_shop_categories = ['4bf58dd8d48988d128941735','4bf58dd8d48988d143941735','4bf58dd8d48988d16d941735','4bf58dd8d48988d1e0931735','4bf58dd8d48988d147941735','54135bf5e4b08f3d2429dfe7',
                                '4bf58dd8d48988d1c7941735','56aa371be4b08b9a8d5734c1','4bf58dd8d48988d18d941735','4bf58dd8d48988d1f0941735','4bf58dd8d48988d1a1941735']

def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Canada', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=300, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

We're going to use the Foursquare API to get restaurants and coffee shops within each pre-defined segment.

In [13]:
def get_restaurants(lats, lons):
    restaurants = {}
    coffee_shops = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        venues = get_venues_near_location(lat, lon, food_category, client_id, client_secret, radius=350, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_coffee = is_restaurant(venue_categories, specific_filter=coffee_shop_categories)
            if is_res:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_coffee, x, y)
                if venue_distance<=300:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_coffee:
                    coffee_shops[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, coffee_shops, location_restaurants

In [14]:
restaurants = {}
coffee_shops = {}
location_restaurants = []
restaurants, coffee_shops, location_restaurants = get_restaurants(latitudes, longitudes)

Obtaining venues around candidate locations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.


In [16]:
print('Total number of restaurants:', len(restaurants))
print('Total number of coffee shops:', len(coffee_shops))
print('Percentage of coffee shops: {:.2f}%'.format(len(coffee_shops) / len(restaurants) * 100))
print('Average number of restaurants in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())

Total number of restaurants: 1768
Total number of coffee shops: 365
Percentage of coffee shops: 20.64%
Average number of restaurants in neighborhood: 4.315934065934066


Now we know there's a total of 1768 restaurants in the area. 20.64% of them are coffe shops. We'll now check if the restaurant data was correctly extracted.

In [17]:
print('List of all restaurants')
print('-----------------------')
for r in list(restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(restaurants))

List of all restaurants
-----------------------
('4b8f1a68f964a5202d4933e3', 'Song Cook Corean Chili', 43.604020706418524, -79.41682293879842, '681 Bloor Street West, Toronto ON M6G 1L3', 130, False, 627771.7636027196, 4829111.438520751)
('4c4c87ba46240f47d8adf1f4', 'The Mermaid Cafe', 43.619263971909525, -79.39131086652745, 'Canada', 295, True, 629797.9706221918, 4830843.956668356)
('4e2df8aed22daa76ed87f397', 'Comissary', 43.617689, -79.376198, 'Toronto ON', 185, True, 631020.7975555225, 4830692.772462229)
('4bfc4730f14fa5938dc1c9d4', 'Carousel Cafe', 43.6193992460532, -79.37368477304864, 'Centreville, Toronto ON', 342, False, 631219.8663491666, 4830886.685604181)
('51f4098f498ed200bf98f4a6', "Sister Sarah's Coffee Shop", 43.62086, -79.373371, 'Canada', 242, True, 631242.0052640723, 4831049.416688221)
('57ad219238fa6c46b613ee5e', 'krazy roll', 43.625837, -79.477113, 'Marine Parade Drive, Toronto ON', 207, False, 622861.5049794678, 4831443.421567294)
('4b527767f964a520a37e27e3', 'Eden

In [18]:
print('List of coffee shops')
print('---------------------------')
for r in list(coffee_shops.values())[:10]:
    print(r)
print('...')
print('Total:', len(coffee_shops))

List of coffee shops
---------------------------
('4c4c87ba46240f47d8adf1f4', 'The Mermaid Cafe', 43.619263971909525, -79.39131086652745, 'Canada', 295, True, 629797.9706221918, 4830843.956668356)
('4e2df8aed22daa76ed87f397', 'Comissary', 43.617689, -79.376198, 'Toronto ON', 185, True, 631020.7975555225, 4830692.772462229)
('51f4098f498ed200bf98f4a6', "Sister Sarah's Coffee Shop", 43.62086, -79.373371, 'Canada', 242, True, 631242.0052640723, 4831049.416688221)
('51c322e0498e0744523e518d', 'JUICEHunt-The Deck', 43.623642, -79.381329, 'Canada', 287, True, 630593.8943126665, 4831345.844521315)
('5d4847d2735c2d00078ae36a', 'TIM’s Diner', 43.623604, -79.381364, 'Toronto ON M5J 1X9', 291, True, 630591.1527653486, 4831341.5690871235)
('4ad4c05df964a52028f620e3', 'The Rectory Cafe', 43.627359999264485, -79.3563832180081, "102 Lakeshore Ave. (on Ward's Island), Toronto ON M5J 1X9", 276, True, 632598.3647684912, 4831798.317222183)
('51c3316a498e64769fa05603', 'JUICEHunt-Finishline', 43.627281, -

In [19]:
print('Restaurants around location')
print('---------------------------')
for i in range(100, 110):
    rs = location_restaurants[i][:8]
    names = ', '.join([r[1] for r in rs])
    print('Restaurants around location {}: {}'.format(i+1, names))

Restaurants around location
---------------------------
Restaurants around location 101: 
Restaurants around location 102: 
Restaurants around location 103: Harry's Char Broil & Dining Lounge, The Abbott
Restaurants around location 104: SCHOOL Restaurant, Mike's Liberty Grill, Vogue Supper Club, Joker Cafe, Uma Cafe, The Roastery Cafe
Restaurants around location 105: Local Liberty Village, Barista Espresso Bar, Thai Room Liberty Village
Restaurants around location 106: 
Restaurants around location 107: Parisco Cafe, Kitten Food Timez
Restaurants around location 108: touti gelati and café, Music Garden Cafe, Guirei Sushi, Iruka Sushi
Restaurants around location 109: 
Restaurants around location 110: Pearl Harbourfront, Edo Japan


Let's visualize them and see how they are distributed in the map.

In [44]:
map_toronto = folium.Map(location=toronto_downtown, zoom_start=13)
folium.Marker(toronto_downtown, popup='Downtown Toronto').add_to(map_toronto)
folium.TileLayer('cartodbpositron').add_to(map_toronto)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_coffee = res[6]
    color = 'purple' if is_coffee else 'lightblue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_toronto)
map_toronto

<H2>Expanding further our analysis</H2>

Now that we bave extracted and classified the restaurants and coffee shops in the area, it's time to add that data to our Toronto locations dataframe. This will help us have a better understanding of the density of restaurants in the area.

In [21]:
location_restaurants_count = [len(res) for res in location_restaurants]

toronto_locations['Restaurants in area'] = location_restaurants_count

print('Average number of restaurants in every area with radius=300m:', np.array(location_restaurants_count).mean())

toronto_locations.head(10)

Average number of restaurants in every area with radius=300m: 4.315934065934066


Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Restaurants in area
0,"Toronto Division, ON",43.603614,-79.437834,626076.888106,4829034.0,5992.495307,0
1,"Toronto Division, ON",43.603512,-79.430402,626676.888106,4829034.0,5840.3767,0
2,"Toronto Division, ON",43.60341,-79.422971,627276.888106,4829034.0,5747.173218,0
3,"Toronto Division, ON",43.603307,-79.415539,627876.888106,4829034.0,5715.767665,1
4,"Toronto Division, ON",43.603204,-79.408108,628476.888106,4829034.0,5747.173218,0
5,"Toronto Division, ON",43.6031,-79.400676,629076.888106,4829034.0,5840.3767,0
6,"Toronto Division, ON",43.602996,-79.393245,629676.888106,4829034.0,5992.495307,0
7,"Toronto Division, ON",43.608443,-79.448861,625176.888106,4829554.0,5855.766389,0
8,"Toronto Division, ON",43.608342,-79.441429,625776.888106,4829554.0,5604.462508,0
9,"Toronto Division, ON",43.60824,-79.433997,626376.888106,4829554.0,5408.326913,0


Then, we need to define how close each location center is to the nearest cofffee shop.

In [22]:
distances_to_coffee_shop = []

for area_x, area_y in zip(xs, ys):
    min_distance = 10000
    for res in coffee_shops.values():
        res_x = res[7]
        res_y = res[8]
        d = calc_xy_distance(area_x, area_y, res_x, res_y)
        if d<min_distance:
            min_distance = d
    distances_to_coffee_shop.append(min_distance)

toronto_locations['Distance to coffee shop'] = distances_to_coffee_shop

In [23]:
toronto_locations.head()

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Restaurants in area,Distance to coffee shop
0,"Toronto Division, ON",43.603614,-79.437834,626076.888106,4829034.0,5992.495307,0,3495.365046
1,"Toronto Division, ON",43.603512,-79.430402,626676.888106,4829034.0,5840.3767,0,3315.543956
2,"Toronto Division, ON",43.60341,-79.422971,627276.888106,4829034.0,5747.173218,0,3103.420619
3,"Toronto Division, ON",43.603307,-79.415539,627876.888106,4829034.0,5715.767665,1,2639.303036
4,"Toronto Division, ON",43.603204,-79.408108,628476.888106,4829034.0,5747.173218,0,2240.67434


In [24]:
print('Average distance to closest coffee shop from each area center:', toronto_locations['Distance to coffee shop'].mean())

Average distance to closest coffee shop from each area center: 617.5119110407228


On average, each coffee shop is around 600m from each area center. This doesn't tell us much, that's why we'll need another tool for our analysis. Enter the heatmaps!

In [25]:
restaurant_latlons = [[res[2], res[3]] for res in restaurants.values()]

coffee_latlons = [[res[2], res[3]] for res in coffee_shops.values()]

In [54]:
toronto_boroughs_url = 'https://github.com/efralopez/Coursera_Capstone/raw/master/toronto_crs84_simplified.json'
toronto_boroughs = requests.get(toronto_boroughs_url).json()

def boroughs_style(feature):
    return { 'color': 'blue', 'fill': False, 'opacity':0.1 }

In [55]:
from folium import plugins
from folium.plugins import HeatMap

toronto_map = folium.Map(location=toronto_downtown, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(toronto_map) #cartodbpositron cartodbdark_matter

HeatMap(restaurant_latlons).add_to(toronto_map)
folium.Marker(toronto_downtown).add_to(toronto_map)
folium.Circle(toronto_downtown, radius=1000, fill=False, color='white').add_to(toronto_map)
folium.Circle(toronto_downtown, radius=2000, fill=False, color='white').add_to(toronto_map)
folium.Circle(toronto_downtown, radius=3000, fill=False, color='white').add_to(toronto_map)
folium.GeoJson(toronto_boroughs, style_function=boroughs_style, name='geojson').add_to(toronto_map)
toronto_map


It's fairly obvious that the heat is centered around the Financial District. This is mainly due to the high density of affluent people who either live or commute every day to the area.

Now let's see how the map looks with just coffe shops.

In [56]:
toronto_map = folium.Map(location=toronto_downtown, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(toronto_map) #cartodbpositron cartodbdark_matter
HeatMap(coffee_latlons).add_to(toronto_map)
folium.Marker(toronto_downtown).add_to(toronto_map)
folium.Circle(toronto_downtown, radius=1000, fill=False, color='white').add_to(toronto_map)
folium.Circle(toronto_downtown, radius=2000, fill=False, color='white').add_to(toronto_map)
folium.Circle(toronto_downtown, radius=3000, fill=False, color='white').add_to(toronto_map)
folium.GeoJson(toronto_boroughs, style_function=boroughs_style, name='geojson').add_to(toronto_map)
toronto_map

Again, most of the coffee shops are located near the Financial District. However, we can see that there are still places where there are no coffee shops yet, so we may have a good chance to be in a high traffic area with low competition around.

Based on the analysis, we can conclude that, if we want have a chance in this market, we need to be where the people are. That's why we're going to focus on the Financial District.

<H2>Focusing our efforts on the Financial District</H2>

As we previously said, the Financial District looks like the most interesting place we can put our eyes on. This is what Wikipedia has to say about it:

_"It is the most densely built-up area of Toronto, home to banking companies, corporate headquarters, high-powered legal and accounting firms, insurance companies and stockbrokers. In turn, the presence of so many decision-makers has brought advertising agencies and marketing companies. The banks have built large office towers, much of whose space is leased to these companies._

_The bank towers, and much else in Toronto's core, are connected by a system of underground walkways, known as PATH, which is lined with retail establishments making the area one of Toronto's most important shopping districts. The vast majority of these stores are only open during weekdays during the business day when the financial district is populated. During the evenings and weekends, the walkways remain open but the area is almost deserted and most of the stores are closed._

_It is estimated 100,000 commuters enter and leave the financial district each working day. Transport links are centered on Union Station at the south end of the financial district, which is the hub of the GO Transit system that provides commuter rail and bus links to Toronto's suburbs."_

This time, we'll create smaller, 100m tighter segments. The radius of the area of study will be reduced to 1.5km as well.

In [57]:
fin_dis_x, fin_dis_y=lonlat_to_xy(-79.384056,43.655351)

roi_x_min = fin_dis_x -1500
roi_y_max = fin_dis_y + 1500
roi_width = 3000
roi_height = 3000
roi_center_x = fin_dis_x
roi_center_y = fin_dis_y
roi_center_lon, roi_center_lat = xy_to_lonlat(roi_center_x, roi_center_y)
roi_center = [roi_center_lat, roi_center_lon]

map_toronto = folium.Map(location=roi_center, zoom_start=15)
HeatMap(restaurant_latlons).add_to(map_toronto)
folium.TileLayer('cartodbpositron').add_to(map_toronto)
folium.Marker(toronto_downtown).add_to(map_toronto)
folium.Marker(roi_center).add_to(map_toronto)
folium.Circle(roi_center, radius=1500, color='white', fill=True, fill_opacity=0.4).add_to(map_toronto)
folium.GeoJson(toronto_boroughs, style_function=boroughs_style, name='geojson').add_to(map_toronto)
map_toronto

This area should be enough to find our perfect spot.

We now create the tighter grid.

In [30]:
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_step = 100
y_step = 100 * k 
roi_y_min = roi_center_y - 1500# - (int(51/k)*k*100 - 3000)/2

roi_latitudes = []
roi_longitudes = []
roi_xs = []
roi_ys = []
for i in range(0, int(51/k)):
    y = roi_y_min + i * y_step
    x_offset = 50 if i%2==0 else 0
    for j in range(0, 51):
        x = roi_x_min + j * x_step + x_offset
        d = calc_xy_distance(roi_center_x, roi_center_y, x, y)
        if (d <= 1501):
            lon, lat = xy_to_lonlat(x, y)
            roi_latitudes.append(lat)
            roi_longitudes.append(lon)
            roi_xs.append(x)
            roi_ys.append(y)

print(len(roi_latitudes), 'candidate neighborhood centers generated.')

821 candidate neighborhood centers generated.


Now, like we did before, we'll calculate the number of restaurants in vicinity and distance to the closest coffee shop. 

In [32]:
def count_restaurants_nearby(x, y, restaurants, radius=250):    
    count = 0
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=radius:
            count += 1
    return count

def find_nearest_restaurant(x, y, restaurants):
    d_min = 100000
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=d_min:
            d_min = d
    return d_min

roi_restaurant_counts = []
roi_coffee_distances = []

print('Generating data on location candidates... ', end='')
for x, y in zip(roi_xs, roi_ys):
    count = count_restaurants_nearby(x, y, restaurants, radius=250)
    roi_restaurant_counts.append(count)
    distance = find_nearest_restaurant(x, y, coffee_shops)
    roi_coffee_distances.append(distance)
print('done.')

Generating data on location candidates... done.


We'll add that into a dataframe for easier analysis.

In [33]:
toronto_roi_locations = pd.DataFrame({'Latitude':roi_latitudes,
                                 'Longitude':roi_longitudes,
                                 'X':roi_xs,
                                 'Y':roi_ys,
                                 'Restaurants nearby':roi_restaurant_counts,
                                 'Distance to cafeterias':roi_coffee_distances})

toronto_roi_locations.head(10)

Unnamed: 0,Latitude,Longitude,X,Y,Restaurants nearby,Distance to cafeterias
0,43.641859,-79.385038,630255.314065,4833363.0,12,115.906753
1,43.641842,-79.383798,630355.314065,4833363.0,10,34.658139
2,43.642717,-79.390594,629805.314065,4833450.0,6,144.651987
3,43.6427,-79.389355,629905.314065,4833450.0,6,163.949172
4,43.642682,-79.388115,630005.314065,4833450.0,5,64.567176
5,43.642665,-79.386876,630105.314065,4833450.0,6,38.190438
6,43.642647,-79.385636,630205.314065,4833450.0,13,136.923697
7,43.64263,-79.384397,630305.314065,4833450.0,14,62.511714
8,43.642612,-79.383158,630405.314065,4833450.0,12,57.661079
9,43.642595,-79.381918,630505.314065,4833450.0,13,150.80413


Now, we need to filter those locations to places with no more than two restaurants and no coffe shops in a radius of 250m. 

In [34]:
good_res_count = np.array((toronto_roi_locations['Restaurants nearby']<=2))
print('Locations with no more than two restaurants nearby:', good_res_count.sum())

good_ita_distance = np.array(toronto_roi_locations['Distance to cafeterias']>=250)
print('Locations with no cafeterias within 250m:', good_ita_distance.sum())

good_locations = np.logical_and(good_res_count, good_ita_distance)
print('Locations with both conditions met:', good_locations.sum())

toronto_good_locations = toronto_roi_locations[good_locations]

Locations with no more than two restaurants nearby: 109
Locations with no cafeterias within 250m: 124
Locations with both conditions met: 80


Let's see how that looks like on a map.

In [65]:
good_latitudes = toronto_good_locations['Latitude'].values
good_longitudes = toronto_good_locations['Longitude'].values

good_locations = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)]

toronto_map = folium.Map(location=roi_center, zoom_start=15)
folium.TileLayer('cartodbpositron').add_to(toronto_map)
HeatMap(restaurant_latlons).add_to(toronto_map)
folium.Circle(roi_center, radius=1500, color='white', fill=True, fill_opacity=0.6).add_to(toronto_map)
folium.Marker(roi_center).add_to(toronto_map)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='darkblue', fill=True, fill_color='darkblue', fill_opacity=1).add_to(toronto_map) 
folium.GeoJson(toronto_boroughs, style_function=boroughs_style, name='geojson').add_to(toronto_map)
toronto_map

Great! It seems like there are some locations that meet our criteria.

Let's now see how it the heat map looks like with just those locations.

In [68]:
toronto_map = folium.Map(location=roi_center, zoom_start=14.5)
HeatMap(good_locations, radius=25).add_to(toronto_map)
folium.TileLayer('cartodbpositron').add_to(toronto_map)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='darkblue', fill=True, fill_color='darkblue', fill_opacity=1).add_to(toronto_map)
folium.GeoJson(toronto_boroughs, style_function=boroughs_style, name='geojson').add_to(toronto_map)
toronto_map

Now, in order to get the addresses recommendations, we'll cluster those locations.

In [72]:
from sklearn.cluster import KMeans

number_of_clusters = 7

good_xys = toronto_good_locations[['X', 'Y']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)

cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]

toronto_map = folium.Map(location=roi_center, zoom_start=14)
folium.TileLayer('cartodbpositron').add_to(toronto_map)
HeatMap(restaurant_latlons).add_to(toronto_map)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(toronto_map)
#folium.Marker(berlin_center).add_to(toronto_map)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='cadetblue', fill=True, fill_opacity=0.15).add_to(toronto_map) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='dadrkblue', fill=True, fill_color='darkblue', fill_opacity=1).add_to(toronto_map)
folium.GeoJson(toronto_boroughs, style_function=boroughs_style, name='geojson').add_to(toronto_map)
toronto_map

In [75]:
toronto_map = folium.Map(location=roi_center, zoom_start=14.5)
folium.TileLayer('cartodbpositron').add_to(toronto_map)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#00000000', fill=True, fill_color='lightblue', fill_opacity=0.07).add_to(toronto_map)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='darkblue', fill=True, fill_color='darkblue', fill_opacity=1).add_to(toronto_map)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='cadetblue', fill=False).add_to(toronto_map) 
folium.GeoJson(toronto_boroughs, style_function=boroughs_style, name='geojson').add_to(toronto_map)
toronto_map

It's now time to reverse geocode the candidates area centers for a final presentation.

In [39]:
candidate_area_addresses = []
print('==============================================================')
print('Addresses of centers of areas recommended for further analysis')
print('==============================================================\n')
for lon, lat in cluster_centers:
    addr = get_address(api_key, lat, lon).replace(', Canada', '')
    candidate_area_addresses.append(addr)    
    x, y = lonlat_to_xy(lon, lat)
    d = calc_xy_distance(x, y, fin_dis_x,fin_dis_y)
    print('{}{} => {:.1f}km from Financial Center'.format(addr, ' '*(50-len(addr)), d/1000))

Addresses of centers of areas recommended for further analysis

25 King's College Cir, Toronto, ON M5S 3K1         => 1.2km from Financial Center
155 Wellesley St E, Toronto, ON M4Y 1J4            => 1.4km from Financial Center
120 Pembroke St, Toronto, ON M5A 2N8               => 1.0km from Financial Center
241 Lake Shore Blvd E, Toronto, ON M5E             => 1.5km from Financial Center
37 Sullivan St, Toronto, ON M5T 1B8                => 1.0km from Financial Center
Queen's Park, 111 Wellesley St W, Toronto, ON M5S  => 1.2km from Financial Center
128 Shuter St, Toronto, ON M5A 1V8                 => 1.0km from Financial Center


In [95]:
import html
toronto_map = folium.Map(location=roi_center, zoom_start=14.5)
folium.TileLayer('cartodbpositron').add_to(toronto_map)
folium.Circle(roi_center, radius=30, color='orange', fill=True, fill_color='orange', fill_opacity=1).add_to(toronto_map)
for lonlat, addr in zip(cluster_centers, candidate_area_addresses):
    folium.Marker([lonlat[1], lonlat[0]], popup= html.escape(addr)).add_to(toronto_map)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#0000ff00', fill=True, fill_color='cadetblue', fill_opacity=0.05).add_to(toronto_map)
toronto_map

<H2>Results/Discussion</H2>

We have found a way to narrow down, from an almost shapeless, non related data, a list of locations that will definitely prove useful for any entrepreneur. This analysis started by selecting a city, then selcting a point where to start our analysis (in this case, Downtown Toronto). Then, we used Google Map's API to fetch location data to get a list of possible addresses where our business could be located.

Foursquare data proved to be invaluable. Combining both data sources was vital to understanding the distribution of restaurant businesses in Downtown Toronto. This led us to the conclussion that it was probably a good idea to focus were the action was: the Financial District.

By readjusting the area and location to the Financial District, we found some pockets of opportunities that our stakeholders can take advantage of.

It's important to notice that, since we had very limited access to Foursquare's API, we couldn't find more qualitative data for our analysis. Of special interest would've been not only to have access to the locations and categories of the restaurants, but their scores as well. This could've helped us determine if there was a correlation between certain aspects (such as location, distance to specific points of interest, etc)  to the perceived value of the restaurants in order to give better recommendations. Hence, further research is needed to improve this recommendations.

<H2>Conclussion</H2>

We have succesfully created an analysis based on freely available data. This kind of business research should become standard practice in the near future. It's important to note, however, that the quality of the results will be greatly improved by having better quality data sources, such as having an improved access to Foursquare's data.

This analysis should be taken as a very solid first step towards deeper research. In the end, the stakeholders should take into account other aspects when making a decision.