
<h1 align="center">  Restaurant and Coffee House Chains in TORONTO </h1>

<h4 align="center"> Author: Andrei Borodich </h4>

<h4 align="center"> Date: 12-MAR-2020 </h4>

<h4 align="center">  </h4>




## Introduction: Business Problem <a name="introduction"></a>

In this project we explore the West End and East End of the *City of Toronto* in order to find attractive locations to open a few food venues that are members of different franchises though running by the same management company. 
Specifically, this report will cover business planning issues pertinent to the fast-food chain companies. 
The management companies like the **Restaurant Brands International Inc. (RBI)** show their  interest to this type of research analytics. Because it provides a scientific, grounded on data, approach to estimate attractive locations for their particular brand. 

The case description may look as it follows. 
The RBI which is a Canadian multinational fast-food holding company, owns  American fast-food restaurant chain  *Burger King* and Canadian coffee shop chain *Tim Hortons*. Relaying on marketing data, the management promotes opening a few franchisees in the "Low Venue Density" neighborhoods of Toronto that are predominantly located in the two boroughs, *Scarborough* (East End) and *Etobicoke* (West End) of the city.   
The task is to locate particular neighborhoods having the high affinity to the fast-food venues. Then, identify those that are more suitable for opening a new *fast-food restaurant (Burger King)* and other for  opening a new *coffee shop (Tim Hortons)*. 

We use the Foursquare location data to create datasets of neighborhood popular venues. We explore a given borough and segment it with the pandas dataframe utilities and methods. The k-Means  Clustering  Method is run on the neighborhood--venue data and groups the neighborhoods based on similarity of their venue features. 
Then we use the Foursquare location data to create other datasets that include only targeted venue categories, as **fast-food restaurants (FFRs)** and **coffee shops (CSs)**, in operation. 
The frequency distribution in neighborhoods is calculated for each category. 
Results of analysis of two groups are merged, the franchise ratio is calculated in two ways, based on popularity and based on operation, and used to propose a list of particular locations suitable to open either a new FFR or CS.



## Data <a name="data"></a>

We begin with creating a representative dataset with the Toronto neighborhood geospatial data. 
The City of Toronto is divided into 6 districts (boroughs). 
They are Old Toronto (Downtown, Midtown and Islands), York, East York, North York, Scarborough and Etobicoke. 

Initially, we prepared 2 dataframes with the boroughs' and neighborhoods' features. 
One dataset contains the names of neighborhoods combined under the same postal codes. 
It was obtained by making calls to the reliable web-site and fetching the html-table from the Toronto web-page into a pandas dataframe.
Other dataset contains the geographical coordinates of combined neighborhoods. 
It was obtained by retrieving  CanadaPost's geocoded data from the postal codes.

With the 2 available datasets, 
we create a combined dataframe of the Toronto neighborhoods including features as name, borough label, latitude, longitude and postal code affiliation. 
From that dataframe we make the two reduced ones for two boroughs, *Scarborough* and *Etobicoke*. Those dataframes, `scarbo_data` and `etobi_data`, are output below.


<img src="dataset-image-01.png" align="center">

There are **17 combined neighborhoods** in the **Scarborough** borough. 

There are **11 combined neighborhoods** in the **Etobicoke** borough. 

Then, we proceed with creating datasets of the **neighborhood most popular venues**, one for Scarborough and other for Etobicoke.
We explore neighborhoods in a given borough utilizing the `Foursquare` API. By making calls we get the top popular venues, with the limit of 100,  in all categories, within a radius of 1250 meters around every neighborhood. As we empirically determined it is an optimal radius for a bigger circular coverage but fewer venue overlap in case of the "low venue density" neighborhoods of Toronto. 
We collect results and extracts the venue features from the json-structure. 
We check the venues counted more than once, because of searching circle overlap, retain the proper venue IDs and discard their duplicates. 
Hence, we create two datasets of neighborhood venues for the two boroughs, Scarborough and Etobicoke. 
Those dataframes, `scarbo_venues ` and `etobi_venues `, are output below. 
We visualize the venue locations on the borough's maps. 


<img src="dataset-image-02a.png" align="center">
<img src="dataset-image-02b.png" align="center">

There are **527 popular venues** collected in **Scarborough**. 


There are **363 popular venues** collected in the **Etobicoke**. 


Then, we proceed with creating datasets of the **neighborhood targeted category venues**, as FFR and CS, in  Scarborough and Etobicoke.
Again, we utilize the `Foursquare` API. We get only FFRs and CSs, each category having the limit of 100,  within a radius of 1250 meters around every neighborhood. 
After extracting venue features from the json-structure and discarding the venue duplicates  we create two datasets for Scarborough, `df_FFRs `  &  `df_CSs `,  and two datasets for Etobicoke, `df_FFRe `  &  `df_CSe `. We visualize the venue locations on the borough's maps. 

<img src="dataset-image-03a.png" align="center">
<img src="dataset-image-03b.png" align="center">

There are **61 fast-food restaurants** and **46 coffee shops** collected in **Scarborough**.


There are **21 fast-food restaurants** and **38 coffee shops** collected in **Etobicoke**. 

## Methodology  <a name="methodology"></a>

### All Venue Categories &mdash; Cluster Analysis

We aim to identify the similar neighborhoods, based on similarity of their venue features.
In other words, we are going to apply  some clustering method to our two neighborhood--venue datasets, Scarborough and Etobicoke.
Clustering algorithms are unsupervised process (learning) to group unlabeled data, based on the similarity of their features.
**The k-Means clustering** algorithm is one of them being the well-documented and implemented in most machine-learning libraries.

The k-Means clustering algorithm comes in several different implementations depending on the computational designs.
We made our preference toward the **Mini-Batch k-Means** method.
The main idea behind this algorithm is to use small random batches of data of a fixed size, so they can be stored in memory.
Each iteration a new random batch from the dataset is obtained and used to update the clusters and this is repeated until convergence.
Every data object in the batch is assigned to one of the clusters, depending on the previous locations of the cluster centroids.
Updating the locations of cluster centroids is based on the new data object from the batch.
The update is a gradient descent update, which is significantly faster than a regular batch k-Means update.

We can evaluate the quality of our clusters using the evaluation metrics. 
The two of them have been called in this study. 
The **Cluster Inertia** is defined as the sum of distances of all the points within a cluster from the centroid of that cluster. 
Hence, *the lesser the inertia value, the better (more compact) our clusters*. 
The **Davies-Bouldin score** is defined as the average ratio of intra-cluster distances to inter-cluster distances.
Hence, *clusters which are farther apart and less dispersed will result in a better score*.


### Targeted Venue Categories &mdash; Frequency Distribution

We aim to quantify the distributions of targeted venue categories. 
Frequency distributions are mostly used for summarizing categorical variables.
A **frequency distribution** is a variable description by a pair of corresponding sets, one is all distinct values in the variable and other is their occurrence numbers (frequencies).
The common way to visualize a frequency distribution is making either a bar plot (showing frequencies for distinct values) or
a histogram (showing frequencies for intervals of values).


## Results <a name="results"></a>

### All Venue Categories

We run  `MiniBatchKMeans` application from the Python machine-learning library `scikit-learn`. The number of clusters for each borough dataset has been selected using the "elbow method" and
its optimal value of 6 clusters has been confirmed in further calculations for both *Scarborough* and *Etobicoke*.

First, the K-Means model segmented the **Scarborough** neighborhoods into 
a 9 member cluster, a 4 member cluster and four sole clusters. 
The dataset representing *527 venues* that belong to *132 different venue categories* 
has been fitted to the model.

Second, the K-Means model clustered the **Etobicoke** neighborhoods into 
a 5 member cluster, a 2 member cluster and four sole clusters. 
The dataset representing *363 venues* that belong to *110 different venue categories* 
has been fitted to the model.

The clustering model of the *Scarborough* data is quantified by the numbers of 
*Inertia_score = 1346* and *Davies_Bouldin_score = 0.67*. 
As for modeling the *Etobicoke*  data, the calculated scores are
*Inertia_score = 544* and  *Davies_Bouldin_score = 1.08*. 
In general, these values of evaluation metrics justify that the obtained clusters are meaningful and robust.
This is applied to both boroughs.


The **Scarborough** clustered neighborhoods are shown in the map (6 clusters). Further, we examine the clusters using the top 10 venue categories in each neighborhood. 

<img src="scarbo_clusters.JPG" align="center">


**Cluster #0, red** (9 members:  2,4,6,8,9,10,12,14,15)   
This cluster attracts people mostly by its recreational venues as Parks, Scatting Rings, Marina that dominate in popularity.
Here, food venues are mostly represented by Ice Cream Shops and Coffee Shops, in the lesser extent, Pizza Places and some Ethnic Restaurants.
Only a few FFRs can be found in this cluster and their popularity is low.    

**Cluster #1, indigo**   
In fact, this cluster comprises the borough center, with lots of coffee shops and different type of restaurants. 
CSs are rather popular. 
FFRs have average popularity.

**Cluster #2, blue**  (4 members:  3,5,11,16)   
CSs and FFRs are popular in this cluster.
Chinese Restaurants are less attractive.

**Cluster #3, cyan**   
Chinese Restaurants and Bubble Tea Shops held the high popularity.
Both FFRs and CSs don't attract many people in this area.

**Cluster #4, green**   
Recreational venues, Zoo and Trails, are popular.
FFRs are well represented and popular.

**Cluster #5, orange**   
Chinese Restaurants lead in popularity.
Interesting, that neither CSs nor FFRs come to the top 10 popular venues in this area.


Thus, based on the overall content of the top 10 lists included in the clusters, <br> we can summarize that 
**clusters #2 and #1** seem to have the high affinity, in Scarborough, towards both CSs and FFRs. 
Then, the recreational **clusters #0** demonstrates "the upper middle" affinity. 


The **Etobicoke** clustered neighborhoods are shown in the map below. Further, we examine the clusters using the top 10 venue categories in each neighborhood. 

<img src="etobi_clusters.JPG" align="center">


**Cluster #0, red**   
CS and Pizza Place are the most popular venues here.

**Cluster #1, indigo**    
CS and Hotel are the most popular venues here.

**Cluster #2, blue**  (2 members:  2,6)   
Odd venues, Convenience Store and Pharmacy, are the most popular venues here. 
In this cluster, CS and FFR have average popularity and compete with Cafe and Pizza Place. 

**Cluster #3, cyan**   
Restaurant (general category) and Coffee Shop are the most popular venues here.

**Cluster #4, green**    
Pizza Place and Grocery Store are the most popular venues here.

**Cluster #5, orange**  (5 members:  1,3,4,5,9)    
In fact, this cluster comprises  the borough center.
Here, food venues and recreational places make a balance.
Mostly the pair as CS & Park lead the popularity list.
FFRs and Pizza Places are popular as well.


Thus, based on the overall content of the top 10 lists included in the clusters, <br> we can summarize that 
**clusters #2 and #5** seem to have the high affinity, in Etobicoke, towards both CSs and FFRs.


### Targeted Venue Categories

We calculate frequency distributions of FFR and CS, in Scarborough and Etobicoke, using the corresponding datasets that include only targeted venue categories. 
Two borough distributions are visualized with the bar charts placed below. Each pair of bars corresponds to the particular neighborhood. Number of pairs of bars equals to the number of neighborhoods. 



<img src="frequency-dist-scarbo.png" width="680" align="center">

<div style="text-align: center"> The 17 neighborhoods are labeled as N0, N1, ..., N16. </div>


Looking at the Scarborough bar graph we find two neighborhoods, *N10 and N13*, 
with the highest number of running CSs & FFRs. 
In both places, active FFRs prevail over CSs in number. 


<img src="frequency-dist-etobi.png" width="680" align="center">

<div style="text-align: center"> The 11 neighborhoods are labeled as N0, N1, ..., N10. </div>


Looking at the Etobicoke bar graph we find two neighborhoods, *N1 and N4*, 
with the highest number of running CSs & FFRs. 
In both places, contrary to the Scarborough case, amount of operating CSs exceeds FFRs. 

## Discussion <a name="discussion"></a>

It is important to underline that the FFR and CS counts in the targeted venue datasets 
essentially exceed those in the popular venue datasets, while all data have been collected with `Foursquare` using the same calling procedure.
The counts of FFRs and CSs in two boroughs are placed in Tables below, together with the calculated CS/FFR ratios. 
The bigger part of the CSs in operation are well-rated and popular. That is not the case of the FFRs. Only a fraction of the FFRs in operation take their places in the popular venue list. 

<h4 align="center">Franchise Ratio, Scarborough.</h4>

| popular <br> venue categories | popular <br> FFRs | popular <br> CSs | the FFR <br> venues | the CS <br> venues |
|  ---| ---:| ---:| ---:| ---:| 
| 132 | 22  |  33 |  61 |  46 |
|     | popular <br> CS/FFR ratio <br> = 1.50 | | operational <br> CS/FFR ratio <br> = 0.75 | | 


<h4 align="center">Franchise Ratio, Etobicoke.</h4>

| popular <br> venue categories | popular <br> FFRs | popular <br> CSs |  the FFR <br> venues |  the CS <br> venues |
|  ---| ---:| ---:| ---:| ---:| 
| 110 | 4   |  27 |  21 |  38 |
|     | popular <br> CS/FFR ratio <br> = 6.75 | | operational <br> CS/FFR ratio <br> = 1.81 | |



From the *Scarborough* table, we conclude that 
the CS/FFR ratio based on popularity is 1.50 and its remarkably bigger than
the CS/FFR ratio based on operation, which is 0.75.

From the *Etobicoke* table, we conclude that 
the CS/FFR ratio based on popularity is 6.75 and its remarkably bigger than
the CS/FFR ratio based on operation, which is 1.81. 


Having the results of the cluster analysis 
we can identify the locations suitable for opening new FFRs and CSs. 

In **Scarborough**, in the 9-member `cluster#0` 
the *operational CS/FFR ratio* has its minimal value for N10 and maximal for N12.
Then, in the 4-member `cluster#2` 
the minimal and maximal operational franchise ratio values are for N5 and N3 respectively. <br>
Hence, taking into account the neigborhood similarity, the recommended the following locations to open new franchisees: 
- *N10* and *N5* to run a Coffee shop, 
- *N12* and *N3* to run a Fast Food Restaurant.

In **Etobicoke**, in the 5-member `cluster#5` 
the *operational CS/FFR ratio* has its minimal value for N4 and maximal for N5.
Then, in the 2-member `cluster#2` 
the minimal and maximal franchise ratio values are for N6 and N2 respectively. <br>
Hence, taking into account the neigborhood similarity, the recommended the following locations to open new franchisees: 
- *N4* and *N6* to run a Coffee shop,  
- *N5* and *N2* to run a Fast Food Restaurant.


## Conclusion <a name="conclusion"></a>



(1) 
The Coffee Shop is the most popular venue category in Scarborough and Etobicoke. 
Majority of neighborhoods in both boroughs show a high affinity to the Coffee Shops.

(2)
The Fast Food Restaurant is a venue category having an average rating in popularity in Scarborough and Etobicoke.
One of the main differences between the food venues in two Toronto districts is that 
in Scarborough FFRs compete mostly with the Chinese restaurants 
while in Etobicoke FFRs compete mostly with the Pizza Places.

(3)
Based on the cluster analysis data and using the operational franchise ratio, <br>
the recommended locations to open new CSs in **Scarborough** are 
*Highland Creek & Rouge Hill & Port Union* (N10) and 
*Clarks Corners & Sullivan & Tam O'Shanter* (N5), 
while 
*Maryvale & Wexford* (N12) and 
*Rouge & Malvern* (N3) 
are the potential locations to run new FFRs. 

(4) 
Based on the cluster analysis data and using the operational franchise ratio, <br>
the recommended locations to open new CSs in **Etobicoke** are 
*Humber Bay Shores & Mimico South & New Toronto* (N4) and 
*Kingsview Village & Martin Grove Gardens & Richview Gardens & St. Phillips* (N6), 
while 
*Humber Bay & King's Mill Park & Kingsway Park South East* (N5) and 
*Bloordale Gardens & Eringate & Markland Wood & Old Burnhamthorpe* (N2) 
are the potential locations to run new FFRs.


## References <a name="references"></a>


[1] <a href="https://www.toronto.ca/city-government/data-research-maps/neighbourhoods-communities/neighbourhood-profiles/">City of Toronto &mdash; Neighbourhood Profiles</a>


[2] <a href="http://geocoder.ca/">Geocode USA and Canada</a>


[3] <a href="https://enterprise.foursquare.com/products/places">Places Data by Foursquare</a>


[4] <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html">k-means algorithm</a>

