# Opening a Restaurant in Toronto: Choosing a Location

## Table of Contents
<font size = 3>
<ul>
    <li>
        <a href='#Introduction'>Introduction</a>
    </li>
    <li>
        <a href='#Data-Requirements'>Data Requirements</a>
    </li>
    <li>
        <a href="#Methodology">Methodology</a>
    </li>
    <ul>
        <li>
            <a href="#Global-configuration">Global configuration</a>
        </li>
        <li>
            <a href="#Acquiring-the-data-sets">Acquiring the data sets</a>
        </li>
        <ul>
            <li>
                <a href="#Postal-code-information">Postal code information</a>
            </li>
        </ul>
        <ul>
            <li>
                <a href="#Postal-code-location-data">Postal code location data</a>
            </li>
        </ul>
        <ul>
            <li>
                <a href="#Postal-code-location-data-merge">Postal code location data merge (pre-analysis)</a>
            </li>
        </ul>
        <ul>
            <li>
                <a href="#Income-data">Income data</a>
            </li>
        </ul>
        <ul>
            <li>
                <a href="#Venue-data">Venue data</a>
            </li>
        </ul>
        <li>
            <a href="#Merging-postal-codes-with-income">Merging postal codes with income</a>
        </li>
        <ul>
            <li>
                <a href="#Toronto-map-showing-postal-codes-without-valid-income-data">Toronto map showing postal codes without valid income data</a>
            </li>
        </ul>
        <li>
            <a href="#Merging-postal-codes-with-venue-data">Merging postal codes with venue data</a>
        </li>
        <ul>
            <li>
                <a href="#Filtering-venues">Filtering venues</a>
            </li>
            <li>
                <a href="#Determine-relative-revenue-share">Determine relative revenue share</a>
            </li>
        </ul>
    </ul>
    <li>
        <a href="#Results">Results</a>
    </li>    
    <ul>
        <li>
            <a href="#List-of-top-locations">List of top locations</a>
        </li>
        <li>
            <a href="#Map-of-top-locations">Map of top locations</a>
        </li>
    </ul>
    <li>
        <a href="#Discussion">Discussion</a>
    </li>
    <li>
        <a href="#Conclusion">Conclusion</a>
    </li>
</ul>
</font>

## Introduction

When considering opening a restaurant in Toronto there are many factors that need to be considered.  One of the most significant factors is the location of the restaurant.  This project will identify a location in the city where a new restaurant will garner the largest relative revenue to improve the probability of success.  Analyzing the number of restaurants in a given area and the income for that area will provide a measure of the income potential.  The locations will be ranked assuming an equal revenue share among all restaurants (including the new one) according to the income for that area.

## Data Requirements

To perform location analysis, multiple data sets for the Toronto area will be required:
- **A method for subdividing the city of Toronto**

    Using postal codes provides a mechanism to subdivide the city in a manner that correlates well with other data sets.  Each postal code entry in the data set contains the latitude/longitude of the center of the postal code which can also be used to identify local venues.  This data set has already been provided as part of the course.
     
- **Income data**

    Canada 2016 census data will be used to provide the demographic data required to provide income values for each postal code.
    
- **Venue information**

    The Foursquare API will be used to acquire information about venues in the city of Toronto.  Queries for venues will be based on the location information associated with each postal code.  This venue list will be analyzed to determine the number of restaurants in the area which will allow computation of a revenue share for a new establishment.

# Methodology

## Global configuration

In [1]:
# Install external packages are required for analysis
!conda install beautifulsoup4 --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [2]:
# Import support libraries
from bs4 import BeautifulSoup
import csv
import folium
import json
import math
import pandas as pd
import requests
import time
import urllib

In [3]:
# Constants used for analysis
toronto_latitude = 43.651070
toronto_longitude = -79.347015
venue_radius_meters = 1000
maximum_venues_in_radius = 100
locations_to_identify = 10

In [4]:
# Foursquare data
# The file structure should be as follows:
# {"CLIENT_ID": "--Foursquare ID--", "CLIENT_SECRET": "--Foursquare Secret--", "VERSION": "20191009"}

with open("FoursquareCredentials.json", 'r') as infile:
    foursquare_data = json.load(infile)

# print(foursquare_data)

## Acquiring the data sets

### Postal code information
Download the postal code data and extract the table rows.

In [5]:
data_link="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
resp = urllib.request.urlopen(data_link)

soup = BeautifulSoup(resp.read())
table = soup.table
rows = table.find_all('tr')

In [6]:
all_postal_codes = []
for row in rows:
    cols = row.find_all('td')
    if len(cols) == 3:        
        postal_code = cols[0].string.strip()

        if cols[1].find('a'):
            borough = cols[1].a.string
        else:
            borough = cols[1].string

        # Skip boroughs with name 'Not assigned'
        if borough == 'Not assigned':
            continue
            
        borough = borough.rstrip()

        if cols[2].find('a'):
            neighborhood = cols[2].a.string.rstrip()
        else:
            neighborhood = cols[2].string.rstrip()

        # If there is no neighborhood name, use borough name as neighborhood name
        if neighborhood == 'Not assigned':
            neighborhood = borough

        all_postal_codes.append([postal_code, borough, neighborhood])


Since multiple neighborhoods can exist in the same postcode, a single postcode entry needs to aggregate all of the child neighborhoods (separated by commas).
It is assumed that a postcodes will only cover one borough.

In [7]:
postal_code_map = {}

for entry in all_postal_codes:
    postal_code = entry[0]
    if postal_code not in postal_code_map:
        # add borough and neighborhood to the new entry
        postal_code_map[postal_code] = [entry[1], entry[2]]
    else:
        # append this entry neighborhood to the existing record's neighborhood
        existing_entry = postal_code_map[postal_code]
        existing_entry[1] += ", "
        existing_entry[1] += entry[2]

In [8]:
# Convert the postal code map to a data frame
postal_code_data_frame = pd.DataFrame(columns=['Postcode', 'Borough', 'Neighborhood'])

data_frame_row = 0
for entry in postal_code_map:
    postal_data = postal_code_map[entry]
    postal_code_data_frame.loc[data_frame_row] = [entry, postal_data[0], postal_data[1]]
    data_frame_row += 1
    
# print("Postal code data frame dimensions: {}".format(postal_code_data_frame.shape))
# print(postal_code_data_frame.head())

### Postal code location data

Since the geolocation API is unreliable, use the fixed data set to join the lat/lon data for zipcodes with the neighborhood data set.

In [9]:
location_data_source = "https://cocl.us/Geospatial_data"
location_data_frame = pd.read_csv(location_data_source)

### Postal code location data merge

Perform a data merge at this time to aggregate postal code data with associated lat/lon information.  This merged data set will be culled based on income data, and the aggregate lat/lon is required for the venue queries to be made for each postal code.

In [10]:
toronto_neighborhoods_data_frame = pd.DataFrame.merge(postal_code_data_frame, location_data_frame, left_on='Postcode', right_on='Postal Code', how='inner')
toronto_neighborhoods_data_frame.drop(['Postal Code'], axis=1, inplace=True)

# print("toronto_neighborhoods_data_frame")
# print(toronto_neighborhoods_data_frame.head())

### Income data

Gather data from the Canadian 2016 Census.  The main download page is: https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/page_dl-tc.cfm?Lang=E

The relevant data set will be from the Forward Sortation Areas.  Download the .CSV file and it will need to be parsed: https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/GetFile.cfm?Lang=E&FILETYPE=CSV&GEONO=046

Once downloaded, the contents of the .ZIP file will need to be extracted.  There is a large .CSV file containing all of the data for Ontario, Canada.  To avoid loading the entire file in to memory, there is a reference file 'Geo_starting_row_CSV.csv' file that contains index values in to the actual data set for each postal code.

In [11]:
!if [ ! -e "Census_2016_Forward_Sortation_Area.zip" ]; then wget -t0 -c -O "Census_2016_Forward_Sortation_Area.zip" "https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/GetFile.cfm?Lang=E&FILETYPE=CSV&GEONO=046"; fi

In [12]:
!if [ -e "Census_2016_Forward_Sortation_Area.zip" ]; then unzip -o "Census_2016_Forward_Sortation_Area.zip"; else echo "No file to unzip!"; fi


Archive:  Census_2016_Forward_Sortation_Area.zip
  inflating: 98-401-X2016046_English_CSV_data.csv  
  inflating: Geo_starting_row_CSV.csv  
  inflating: README_meta.txt         
  inflating: 98-401-X2016046_English_meta.txt  


It is possible that there is a discrepancy between the postal code lists in the current postal code data
and the older census data.  Read the 'Geo_starting_row_CSV.csv' and determine if there are any missing
postal codes in either data set.  Missing codes will be removed from the data set as there is no way to use that data.

In [13]:
post_codes = toronto_neighborhoods_data_frame['Postcode']
post_codes_row_count = post_codes.shape[0]
geo_index_data_frame = pd.read_csv("Geo_starting_row_CSV.csv")
codes_to_remove = []

census_data_lines = pd.DataFrame(columns=['Post_code', 'start_line', 'end_line'])

for i in range(0, post_codes_row_count):
    current_code = post_codes.iloc[i]
    r = geo_index_data_frame.loc[geo_index_data_frame['Geo Name'] == current_code]
    if not r.empty:
        r1 = geo_index_data_frame.loc[r.iloc[0].name + 1]
        start_line = r.iloc[0]["Line Number"]
        end_line = r1["Line Number"]
        census_data_lines = census_data_lines.append({"Post_code": current_code, "start_line": start_line, "end_line": end_line}, ignore_index=True)
    else:
        print("Unable to locate Postcode: {}".format(current_code))
        codes_to_remove.append(current_code)

# Remove postcodes that cannot be mapped
for code in codes_to_remove:
    toronto_neighborhoods_data_frame = toronto_neighborhoods_data_frame[toronto_neighborhoods_data_frame["Postcode"] != code]

Unable to locate Postcode: M7R


### Venue data

Aquire data for the venues within a given distance from the lat/lon of each postal code.  The number of venues to consider, and the distance from each postal code is determined by the constants at the top of the notebook.

In [14]:
# Function for acquiring data for a location
def getNearbyVenues(names, latitudes, longitudes, radius, max_venue_count):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            foursquare_data['CLIENT_ID'],
            foursquare_data['CLIENT_SECRET'],
            foursquare_data['VERSION'],
            lat,
            lng,
            radius,
            max_venue_count)
            
        # make the GET request
        retry_count = 5
        while retry_count > 0:
            retry_count -= 1
            try:
                results = requests.get(url).json()["response"]['groups'][0]['items']
                break;
            except Exception as e:
                print("Failed: {}".format(e))
            sleep(5)
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                  'Postcode Latitude', 
                  'Postcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [15]:
venue_data = getNearbyVenues(names=toronto_neighborhoods_data_frame['Postcode'],
                             latitudes=toronto_neighborhoods_data_frame['Latitude'],
                             longitudes=toronto_neighborhoods_data_frame['Longitude'],
                             radius=venue_radius_meters,
                             max_venue_count=maximum_venues_in_radius)
print("Venue data retrieved")

Venue data retrieved


## Merging postal codes with income

The census data is too large to load all at once.  Helper functions will be used to skip through the data file and extract all of the census data for a given postal code.  That census data will further be processed further to extract the specific data required.  For this analysis, line 751 of the data set for each postal code contains the average income for that postal code.

In [16]:
# Function to retrieve all census data for a given postal code
def getCensusDataForPostalCode(code):
    all_data = []
    line = census_data_lines.loc[census_data_lines['Post_code'] == code]
    start_line = int(line['start_line'])
    end_line = int(line['end_line'])
#     print("{}\n{}\n{}\n".format(line, start_line, end_line))
    with open("98-401-X2016046_English_CSV_data.csv") as census_data_file:
        cur_line = 1
        while cur_line < start_line:
            census_data_file.readline()
            cur_line += 1
            
        while cur_line < end_line:
            all_data.append(census_data_file.readline())
            cur_line += 1

    return all_data

In [17]:
# Function to retrieve median income for a given postal code
# Data that cannot be parsed will be mapped to None
def getAverageIncomeForPostalCode(code):
    target_line = 751
    all_data = getCensusDataForPostalCode(code)

    income_line = all_data[target_line - 1]
    reader = csv.reader([income_line])
    for row in reader:
        try:
            return int(row[11])
        except:
            break

    return None

In [18]:
income_data = []
for p in toronto_neighborhoods_data_frame['Postcode']:
    income = getAverageIncomeForPostalCode(p)
#     print ("{} income: {}".format(p, income))
    income_data.append(income)
    
print("Income data mapped")

Income data mapped


Add income data to city dataframe.

In [19]:
toronto_neighborhoods_data_frame['Income'] = income_data
# print(toronto_neighborhoods_data_frame.head())

### Toronto map showing postal codes without valid income data
On the following map the blue dots represent postal codes that have income data.  The red dots represent postal codes that do not have income data.  Postal codes without income data will not be included in the location analysis.

In [20]:
# create map of Toronto postal codes
map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=10)

# add markers to map
for lat, lng, postcode, income in zip(toronto_neighborhoods_data_frame['Latitude'], toronto_neighborhoods_data_frame['Longitude'], toronto_neighborhoods_data_frame['Postcode'], toronto_neighborhoods_data_frame['Income']):
    if not math.isnan(income):
        label = '{}, ${}'.format(postcode, income)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_toronto)
    else:
        label = '{}'.format(postcode)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='red',
            fill=True,
            fill_color='#802020',
            fill_opacity=0.7,
            parse_html=False).add_to(map_toronto)
    
map_toronto

## Merging postal codes with venue data

List all of the selected venues categories in the area so they can be reviewed to determine which ones should be considered as revenue competition.

In [21]:
print(sorted(venue_data['Venue Category'].unique()))

['ATM', 'Accessories Store', 'Afghan Restaurant', 'Airport', 'Airport Lounge', 'American Restaurant', 'Amphitheater', 'Animal Shelter', 'Antique Shop', 'Aquarium', 'Art Gallery', 'Art Museum', 'Arts & Crafts Store', 'Asian Restaurant', 'Athletics & Sports', 'Auto Dealership', 'Automotive Shop', 'BBQ Joint', 'Baby Store', 'Badminton Court', 'Bagel Shop', 'Bakery', 'Bank', 'Bar', 'Baseball Field', 'Baseball Stadium', 'Basketball Stadium', 'Beach', 'Beach Bar', 'Beer Bar', 'Beer Store', 'Belgian Restaurant', 'Bike Shop', 'Bistro', 'Bookstore', 'Boutique', 'Bowling Alley', 'Brazilian Restaurant', 'Breakfast Spot', 'Brewery', 'Bridal Shop', 'Bridge', 'Bubble Tea Shop', 'Buffet', 'Burger Joint', 'Burrito Place', 'Bus Line', 'Bus Station', 'Bus Stop', 'Business Service', 'Butcher', 'Cafeteria', 'Café', 'Cajun / Creole Restaurant', 'Camera Store', 'Candy Store', 'Cantonese Restaurant', 'Caribbean Restaurant', 'Castle', 'Cemetery', 'Cheese Shop', 'Chinese Restaurant', 'Chiropractor', 'Chocolate

### Filtering venues

Venues containing the string 'Restaurant', 'Steakhouse', or 'Bistro' will be considered competition for a restaurant at a given location.

In [22]:
# competing_venue_keywords = ["Restaurant", "Steakhouse", "Airport Food Court", "Bistro", "Burger Joint", "Poutine Place", "Burrito Place", "Fish & Chips Shop"]
competing_venue_keywords = ["Restaurant", "Steakhouse", "Bistro"]

competing_venue_counts = []
for p in toronto_neighborhoods_data_frame['Postcode']:
    venues = venue_data[venue_data['Postcode'] == p]
    competing_venues = 0
    
    for keyword in competing_venue_keywords:
        subset = venues[venues['Venue Category'].str.contains(keyword)]
        competing_venues += len(subset)
    
    competing_venue_counts.append(competing_venues)

toronto_neighborhoods_data_frame['CompetingVenueCount'] = competing_venue_counts
print("Venue counts mapped")
# print(toronto_neighborhoods_data_frame.head())

Venue counts mapped


### Determine relative revenue share

This calculation uses the income for the postal code and divides it equally among all competing venues in the area.  The potential new restaurant is considered in this calculation.

In [23]:
# determine the relative revenue share by adding a restaurant
relative_revenue = []

for p in toronto_neighborhoods_data_frame['Postcode']:
    postcode_data = toronto_neighborhoods_data_frame[toronto_neighborhoods_data_frame['Postcode'] == p]

    income = float(postcode_data['Income'])
    
    if not math.isnan(income):
        competing_venues = int(postcode_data['CompetingVenueCount'])
        relative_revenue.append(income/(competing_venues + 1))
    else:
        relative_revenue.append(0.0)

toronto_neighborhoods_data_frame['RelativeRevenue'] = relative_revenue
print(toronto_neighborhoods_data_frame.head())

  Postcode           Borough                                  Neighborhood  \
0      M3A        North York                                     Parkwoods   
1      M4A        North York                              Victoria Village   
2      M5A  Downtown Toronto                    Regent Park / Harbourfront   
3      M6A        North York             Lawrence Manor / Lawrence Heights   
4      M7A  Downtown Toronto  Queen's Park / Ontario Provincial Government   

    Latitude  Longitude   Income  CompetingVenueCount  RelativeRevenue  
0  43.753259 -79.329656  86403.0                    3     21600.750000  
1  43.725882 -79.315572  70865.0                    2     23621.666667  
2  43.654260 -79.360636  79729.0                   19      3986.450000  
3  43.718518 -79.464763  72787.0                   12      5599.000000  
4  43.662301 -79.389494      NaN                   26         0.000000  


# Results

## List of top locations

The following is the sorted list of the postal codes and relative revenue for the top locations.  The first item in the list represents the location with the largest relative revenue potential.

In [24]:
sorted_by_revenue = toronto_neighborhoods_data_frame.sort_values(by=['RelativeRevenue'], ascending=False)
top_n = sorted_by_revenue[['Postcode', 'RelativeRevenue']].head(locations_to_identify)
print(top_n)

   Postcode  RelativeRevenue
45      M2L        306301.00
61      M4N        203739.00
5       M9A        160481.00
95      M1X        105913.00
17      M9C         98891.00
91      M4W         89832.75
94      M9W         77220.00
57      M9M         73319.00
66      M2P         67243.50
64      M9N         65571.00


## Map of top locations

The following map is a visualization of the analysis and will be addressed in the 'Discussion' section below. 

In [25]:
# Map of Toronto showing the top locations for restaurants
map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=10)
counter = 0

# add markers to map
for lat, lng, postcode, income, relative_revenue in zip(sorted_by_revenue['Latitude'], sorted_by_revenue['Longitude'], sorted_by_revenue['Postcode'], sorted_by_revenue['Income'], sorted_by_revenue['RelativeRevenue']):
    if counter < locations_to_identify:
        counter += 1
        label = '{}, #{}, ${}'.format(postcode, counter, relative_revenue)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='green',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_toronto)
    else:
        if not math.isnan(income):
            label = '{}, #{}, ${}'.format(postcode, counter, relative_revenue)
            label = folium.Popup(label, parse_html=True)
            folium.CircleMarker(
                [lat, lng],
                radius=5,
                popup=label,
                color='blue',
                fill=True,
                fill_color='#3186cc',
                fill_opacity=0.7,
                parse_html=False).add_to(map_toronto)
        else:
            label = '{}'.format(postcode)
            label = folium.Popup(label, parse_html=True)
            folium.CircleMarker(
                [lat, lng],
                radius=5,
                popup=label,
                color='red',
                fill=True,
                fill_color='#802020',
                fill_opacity=0.7,
                parse_html=False).add_to(map_toronto)
    
map_toronto

# Discussion

The above map shows locations of postal codes in green that represent the best opportunities for a restaurant based on the current analysis.  The locations in red represent locations with insufficient data to contribute to the analysis.  The blue locations represent the remaining postal codes. Clicking on a green or blue location will show the following information:
- Postal code
- Location rank (#1 being most preferred)
- Relative revenue based on competing venues in the area and income for that postal code

Analysis of the data has revealed that most of the locations for a restaurant are actually areas that do not have a competing venue identified.  This represents a potential bias as there are other factors that may make locations unsuitable for a restaurant (such as zoning).  Further refinements can include:
- Change the definition of a competing venue.
- Identify the boundaries of a postal code to better represent the actual area covered by a postal code.
- Identify other significant factors that would affect the performance of a restaurant at a location.

# Conclusion

The purpose of this project was to perform an analysis of the areas in Toronto that could be considered as the location of a new restaurant based on the income and competing venues in a given area.  By using census data for areas of Toronto to determine average income, the revenue potential for an area was determined.  Combining that data with Foursquare restaurant venue information, the relative revenue potential could be computed assuming an even distribution of consumer spending across all venues (including the new restaurant).

The final decision by the stakeholders will be made based on additional information about demographics, real estate availability/cost, parking availability, access to major roads, and other characteristics of the neighborhoods.