# Capstone Project - Starbucks Site selection in Toronto
### IBM Data Science Professional Certificate by Coursera

## Contents

1. [Introduction: Business Problem](#introduction)
2. [Data](#data)

## 1. Introduction: Business Problem <a name="introduction"></a>

Given the recent announcement in early 2021 of **Starbucks** closing down up to 300 coffee shops across Canada in response to a change in customer behaviour and preferences, the company is closing stores in downtown core areas and will be focusing on expanding to more pick-up and convenience-led store formats. **The stakeholders of Starbucks corp. are looking for optimal sites to open their next stores to reduce cost and obtain high benefits.**

**Toronto** is the capital city of the Canadian province of Ontario and the most populous city in Canada. Unlike most of the grid-plan suburbs found in the outskirts of most North American cities, many suburban neighborhoods in Toronto encouraged high-density populations by mixing single-detached housing with higher-density apartment blocks. This kind of **diverse cityscape provides ample opportunites to open the new-format Starbucks.**

Socioeconomic factors such as population, regional income level, consumer demographics (age) and competitors are important influencers in a good site selection strategy. The **objective** of this study is to generate a few most promising neighborhoods in Toronto based on these factors. This study is **targeted** to the global market planning department of Starbucks and Starbucks' local representatives of Toronto, both of whom are looking for the new sites. We will use various Data Science techniques including Clustering to explore the neighborhoods, cluster them based on the above factors, compare to the existing Starbucks locations, and select the optimal neighborhood cluster(s) to open the new-format Starbucks. 

## 2. Data <a name="data"></a>

### 2.1. Data Sources

Based on our business problem, the factors that will influence our decision are:
   * population
   * income
   * age groups
   * number of competitors (coffee shops)

Following data sources will be needed to extract/generate the required information:

a) https://open.toronto.ca/dataset/neighbourhood-profiles/: The population demographics information (population and income of each neighborhoods) will be obtained from this link of **Open Data Portal - City of Toronto**.

b) https://open.toronto.ca/dataset/wellbeing-toronto-demographics/ : Population grouped by age groups in each neighborhood will be generated from this link of **Open Data Portal - City of Toronto**.

c) https://opencagedata.com : The **OpenCage Geocoding API** will provide the geographical coordinates of the neighborhoods.

d) https://developer.foursquare.com/ : Number of coffee shops, their locations and that of Starbucks will be extracted using the explore function of the **Foursquare API**. 


### 2.2. Data Cleaning

### Creating Neighborhoods profile dataframe

#### Load population demographics data

In [1]:
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np # library to handle data in a vectorized manner

In [2]:
%cd "E:\Studies\Coursera\IBM Data Science Prof. Cert\Lab\Course 10 Applied Data Science Capstone project\Capstone Project"

E:\Studies\Coursera\IBM Data Science Prof. Cert\Lab\Course 10 Applied Data Science Capstone project\Capstone Project


In [3]:
#load the neighborhood profile dataset and create a dataframe
df_pop = pd.read_csv('neighbourhood_profiles.csv')
df_pop.head()

Unnamed: 0,_id,Category,Topic,Data Source,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,Bayview Woods-Steeles,Bedford Park-Nortown,Beechborough-Greenbrook,Bendale,Birchcliffe-Cliffside,Black Creek,Blake-Jones,Briar Hill-Belgravia,Bridle Path-Sunnybrook-York Mills,Broadview North,Brookhaven-Amesbury,Cabbagetown-South St. James Town,Caledonia-Fairbank,Casa Loma,Centennial Scarborough,Church-Yonge Corridor,Clairlea-Birchmount,Clanton Park,Cliffcrest,Corso Italia-Davenport,Danforth,Danforth East York,Don Valley Village,Dorset Park,Dovercourt-Wallace Emerson-Junction,Downsview-Roding-CFB,Dufferin Grove,East End-Danforth,Edenbridge-Humber Valley,Eglinton East,Elms-Old Rexdale,Englemount-Lawrence,Eringate-Centennial-West Deane,Etobicoke West Mall,Flemingdon Park,Forest Hill North,Forest Hill South,Glenfield-Jane Heights,Greenwood-Coxwell,Guildwood,Henry Farm,High Park North,High Park-Swansea,Highland Creek,Hillcrest Village,Humber Heights-Westmount,Humber Summit,Humbermede,Humewood-Cedarvale,Ionview,Islington-City Centre West,Junction Area,Keelesdale-Eglinton West,Kennedy Park,Kensington-Chinatown,Kingsview Village-The Westway,Kingsway South,Lambton Baby Point,L'Amoreaux,Lansing-Westgate,Lawrence Park North,Lawrence Park South,Leaside-Bennington,Little Portugal,Long Branch,Malvern,Maple Leaf,Markland Wood,Milliken,Mimico (includes Humber Bay Shores),Morningside,Moss Park,Mount Dennis,Mount Olive-Silverstone-Jamestown,Mount Pleasant East,Mount Pleasant West,New Toronto,Newtonbrook East,Newtonbrook West,Niagara,North Riverdale,North St. James Town,Oakridge,Oakwood Village,O'Connor-Parkview,Old East York,Palmerston-Little Italy,Parkwoods-Donalda,Pelmo Park-Humberlea,Playter Estates-Danforth,Pleasant View,Princess-Rosethorn,Regent Park,Rexdale-Kipling,Rockcliffe-Smythe,Roncesvalles,Rosedale-Moore Park,Rouge,Runnymede-Bloor West Village,Rustic,Scarborough Village,South Parkdale,South Riverdale,St.Andrew-Windfields,Steeles,Stonegate-Queensway,Tam O'Shanter-Sullivan,Taylor-Massey,The Beaches,Thistletown-Beaumond Heights,Thorncliffe Park,Trinity-Bellwoods,University,Victoria Village,Waterfront Communities-The Island,West Hill,West Humber-Clairville,Westminster-Branson,Weston,Weston-Pelham Park,Wexford/Maryvale,Willowdale East,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,1,Neighbourhood Information,Neighbourhood Information,City of Toronto,Neighbourhood Number,,129,128,20,95,42,34,76,52,49,39,112,127,122,24,69,108,41,57,30,71,109,96,133,75,120,33,123,92,66,59,47,126,93,26,83,62,9,138,5,32,11,13,44,102,101,25,65,140,53,88,87,134,48,8,21,22,106,125,14,90,110,124,78,6,15,114,117,38,105,103,56,84,19,132,29,12,130,17,135,73,115,2,99,104,18,50,36,82,68,74,121,107,54,58,80,45,23,67,46,10,72,4,111,86,98,131,89,28,139,85,70,40,116,16,118,61,63,3,55,81,79,43,77,136,1,35,113,91,119,51,37,7,137,64,60,94,100,97,27,31
1,2,Neighbourhood Information,Neighbourhood Information,City of Toronto,TSNS2020 Designation,,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,Emerging Neighbourhood,No Designation,NIA,No Designation,No Designation,No Designation,NIA,NIA,Emerging Neighbourhood,No Designation,No Designation,NIA,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,Emerging Neighbourhood,NIA,NIA,No Designation,NIA,No Designation,No Designation,NIA,NIA,No Designation,NIA,No Designation,No Designation,Emerging Neighbourhood,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,Emerging Neighbourhood,No Designation,No Designation,No Designation,No Designation,NIA,No Designation,NIA,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,NIA,NIA,NIA,No Designation,No Designation,Emerging Neighbourhood,No Designation,No Designation,NIA,No Designation,NIA,NIA,No Designation,No Designation,NIA,No Designation,NIA,No Designation,Emerging Neighbourhood,NIA,NIA,No Designation,No Designation,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,Emerging Neighbourhood
2,3,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2016",2731571,29113,23757,12054,30526,27695,15873,25797,21396,13154,23236,6577,29960,22291,21737,7727,14257,9266,11499,17757,11669,9955,10968,13362,31340,26984,16472,15935,14133,9666,17180,27051,25003,36625,35052,11785,21381,15535,22776,9456,22372,18588,11848,21933,12806,10732,30491,14417,9917,15723,22162,23925,12494,16934,10948,12416,15545,14365,13641,43965,14366,11058,17123,17945,22000,9271,7985,43993,16164,14607,15179,16828,15559,10084,43794,10111,10554,26572,33964,17455,20506,13593,32954,16775,29658,11463,16097,23831,31180,11916,18615,13845,21210,18675,9233,13826,34805,10722,7804,15818,11051,10803,10529,22246,14974,20923,46496,10070,9941,16724,21849,27876,17812,24623,25051,27446,15683,21567,10360,21108,16556,7607,17510,65913,27392,33312,26274,17992,11098,27917,50434,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804
3,4,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2011",2615060,30279,21988,11904,29177,26918,15434,19348,17671,13530,23185,6488,27876,21856,22057,7763,14302,8713,11563,17787,12053,9851,10487,13093,28349,24770,14612,15703,13743,9444,16712,26739,24363,34631,34659,11449,20839,14943,22829,9550,22086,18810,10927,22168,12474,10926,31390,14083,9816,11333,21292,21740,13097,17656,10583,12525,15853,14108,13091,38084,14027,10638,17058,18495,21723,9170,7921,44919,14642,14541,15070,17011,12050,9632,45086,10197,10436,27167,26541,17587,16306,13145,32788,15982,28593,10900,16423,23052,21274,12191,17832,13497,21073,18316,9118,13746,34617,8710,7653,16144,11197,10007,10488,22267,15050,20631,45912,9632,9951,16609,21251,25642,17958,25017,24691,27398,15594,21130,10138,19225,16802,7782,17182,43361,26547,34100,25446,18170,12010,27018,45041,15004,21343,53350,11703,7826,13986,10578,11652,27713,14687
4,5,Population,Population and dwellings,Census Profile 98-316-X2016001,Population Change 2011-2016,4.50%,-3.90%,8.00%,1.30%,4.60%,2.90%,2.80%,33.30%,21.10%,-2.80%,0.20%,1.40%,7.50%,2.00%,-1.50%,-0.50%,-0.30%,6.30%,-0.60%,-0.20%,-3.20%,1.10%,4.60%,2.10%,10.60%,8.90%,12.70%,1.50%,2.80%,2.40%,2.80%,1.20%,2.60%,5.80%,1.10%,2.90%,2.60%,4.00%,-0.20%,-1.00%,1.30%,-1.20%,8.40%,-1.10%,2.70%,-1.80%,-2.90%,2.40%,1.00%,38.70%,4.10%,10.10%,-4.60%,-4.10%,3.40%,-0.90%,-1.90%,1.80%,4.20%,15.40%,2.40%,3.90%,0.40%,-3.00%,1.30%,1.10%,0.80%,-2.10%,10.40%,0.50%,0.70%,-1.10%,29.10%,4.70%,-2.90%,-0.80%,1.10%,-2.20%,28.00%,-0.80%,25.80%,3.40%,0.50%,5.00%,3.70%,5.20%,-2.00%,3.40%,46.60%,-2.30%,4.40%,2.60%,0.70%,2.00%,1.30%,0.60%,0.50%,23.10%,2.00%,-2.00%,-1.30%,8.00%,0.40%,-0.10%,-0.50%,1.40%,1.30%,4.50%,-0.10%,0.70%,2.80%,8.70%,-0.80%,-1.60%,1.50%,0.20%,0.60%,2.10%,2.20%,9.80%,-1.50%,-2.20%,1.90%,52.00%,3.20%,-2.30%,3.30%,-1.00%,-7.60%,3.30%,12.00%,12.90%,3.80%,0.30%,7.20%,0.50%,2.60%,11.70%,7.50%,-0.40%,0.80%


In [4]:
#drop columns that is not needed
df_pop.drop(df_pop.columns[[0,1,2,3]], axis=1, inplace=True)
df_pop.set_index('Characteristic', inplace=True)

#slice to get a dataframe. we are interested in population and after-tax income
df_pop_sliced = df_pop.loc[["Population, 2016",
    "Total - After-tax income groups in 2015 for the population aged 15 years and over in private households - 100% data"]]
df_pop_sliced.drop(['City of Toronto'], axis=1, inplace=True)

#transpose the dataframe and rename the columns
df_neigh = df_pop_sliced.T
df_neigh.columns = ['Population', 'Income']
df_neigh.index.name = 'Neighborhood'

print(df_neigh.shape)
df_neigh.head()

(140, 2)


Unnamed: 0_level_0,Population,Income
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Agincourt North,29113,24995
Agincourt South-Malvern West,23757,20395
Alderwood,12054,10265
Annex,30526,26305
Banbury-Don Mills,27695,23390


#### Load age group data

In [5]:
#load age group data
df_age = pd.read_excel('age_population.xlsx')

# drop Neighbourhood ID and rename the Neighbourhood column
df_age.drop(['NeighbourhoodID'], axis=1, inplace=True)
df_age.rename(columns={'Neighbourhood':'Neighborhood'}, inplace=True)
df_age.set_index(['Neighborhood'])

print(df_age.shape)
df_age.head()

(141, 26)


Unnamed: 0,Neighborhood,0 to 04 years,0 to 14 years,05 to 09 years,10 to 14 years,100 years and over,15 to 19 years,20 to 24 years,25 to 29 years,30 to 34 years,35 to 39 years,40 to 44 years,50 to 54 years,55 to 59 years,55 years and over,60 to 64 years,65 to 69 years,65 years and over,70 to 74 years,75 to 79 years,80 to 84 years,85 to 89 years,85 years and over,90 to 94 years,95 to 99 years,Total Population - All Age Groups - 100% data
0,West Humber-Clairville,1540.0,5060.0,1720.0,1790.0,5.0,2325.0,3120.0,2785.0,2345.0,2035.0,1980.0,2475.0,2195.0,8970.0,1795.0,1595.0,4980.0,1185.0,885.0,700.0,400.0,615.0,160.0,50.0,33320.0
1,Mount Olive-Silverstone-Jamestown,2190.0,7090.0,2500.0,2415.0,0.0,2585.0,2655.0,2400.0,2250.0,2185.0,2275.0,2190.0,1955.0,7040.0,1520.0,1285.0,3560.0,885.0,630.0,465.0,225.0,300.0,70.0,10.0,32950.0
2,Thistletown-Beaumond Heights,540.0,1730.0,600.0,595.0,5.0,650.0,760.0,680.0,715.0,665.0,610.0,770.0,660.0,3065.0,535.0,490.0,1880.0,375.0,335.0,320.0,225.0,350.0,100.0,20.0,10360.0
3,Rexdale-Kipling,560.0,1640.0,515.0,565.0,0.0,635.0,720.0,715.0,680.0,640.0,680.0,815.0,870.0,3255.0,650.0,520.0,1730.0,350.0,295.0,270.0,205.0,300.0,85.0,15.0,10530.0
4,Elms-Old Rexdale,540.0,1805.0,605.0,660.0,0.0,690.0,750.0,600.0,575.0,550.0,540.0,755.0,730.0,2535.0,525.0,415.0,1275.0,305.0,235.0,180.0,105.0,145.0,40.0,5.0,9460.0


#### Merge the two dataframes

In [6]:
df_census = pd.merge(df_neigh, df_age, on = 'Neighborhood', how = 'inner')

# 45 to 49 years age group is missing. Calculate from the data and add to the dataframe
df_census['sum'] = df_census[['10 to 14 years', '15 to 19 years', '20 to 24 years', '25 to 29 years', 
                              '30 to 34 years', '35 to 39 years', '40 to 44 years', '55 years and over']].sum(axis=1)

df_census['45 to 49 years'] = df_census['Total Population - All Age Groups - 100% data'] - df_census['sum']

# replace blank with nan and then fill with median of the column
df_census = df_census.replace('', np.nan)
df_census['45 to 49 years'].fillna(df_census['45 to 49 years'].median(), inplace=True)

# drop some columns
df_census.drop(['sum', 'Total Population - All Age Groups - 100% data', '0 to 04 years','05 to 09 years', 
                '100 years and over', '55 years and over', '65 years and over', 
                '85 years and over', '95 to 99 years'], axis=1, inplace=True)

#add a new column "State" contaning the value Ontario to dataframe 
df_census["State"] = 'Ontario'

# change population and income to float
df_census['Population'] = df_census['Population'].str.replace(',', '').astype(float)
df_census['Income'] = df_census['Income'].str.replace(',', '').astype(float)

print(df_census.shape)
df_census.head()

(139, 22)


Unnamed: 0,Neighborhood,Population,Income,0 to 14 years,10 to 14 years,15 to 19 years,20 to 24 years,25 to 29 years,30 to 34 years,35 to 39 years,40 to 44 years,50 to 54 years,55 to 59 years,60 to 64 years,65 to 69 years,70 to 74 years,75 to 79 years,80 to 84 years,85 to 89 years,90 to 94 years,45 to 49 years,State
0,Agincourt North,29113.0,24995.0,3840.0,1240.0,1705.0,2000.0,2020.0,1775.0,1465.0,1665.0,2440.0,2230.0,2000.0,1905.0,1290.0,1065.0,865.0,565.0,270.0,6960.0,Ontario
1,Agincourt South-Malvern West,23757.0,20395.0,3075.0,935.0,1470.0,1890.0,2020.0,1645.0,1340.0,1360.0,1950.0,1755.0,1510.0,1320.0,875.0,760.0,595.0,365.0,155.0,5720.0,Ontario
2,Alderwood,12054.0,10265.0,1760.0,480.0,570.0,665.0,715.0,840.0,905.0,860.0,1035.0,1030.0,795.0,620.0,420.0,340.0,315.0,190.0,100.0,3175.0,Ontario
3,Annex,30526.0,26305.0,2360.0,675.0,1015.0,2735.0,4350.0,3295.0,2090.0,1750.0,1865.0,1780.0,1700.0,1725.0,1335.0,1030.0,750.0,570.0,340.0,5260.0,Ontario
4,Banbury-Don Mills,27695.0,23390.0,3605.0,1285.0,1370.0,1360.0,1400.0,1595.0,1625.0,1790.0,2225.0,1935.0,1620.0,1670.0,1360.0,1235.0,1085.0,955.0,520.0,6715.0,Ontario


In [7]:
np.where(pd.isnull(df_census))

(array([], dtype=int64), array([], dtype=int64))

In [8]:
# cap and floor the values at 99% and 1% percentile, respectively, to avoid outliers
df_cap = df_census.loc[:, 'Population':'45 to 49 years']

for col in df_cap.columns:
    percentiles = df_cap[col].quantile([0.01, 0.99]).values
    df_cap[col] = np.clip(df_cap[col], percentiles[0], percentiles[1])

# merge back to the census data
df_census = df_census.combine_first(df_cap)

# reorder Neighbourhood, Income, Population columns to the first column
cols_to_order = ['Neighborhood', 'Population', 'Income']
new_columns = cols_to_order + (df_census.columns.drop(cols_to_order).tolist())
df_census = df_census[new_columns]

print(df_census.shape)
df_census.head()

(139, 22)


Unnamed: 0,Neighborhood,Population,Income,0 to 14 years,10 to 14 years,15 to 19 years,20 to 24 years,25 to 29 years,30 to 34 years,35 to 39 years,40 to 44 years,45 to 49 years,50 to 54 years,55 to 59 years,60 to 64 years,65 to 69 years,70 to 74 years,75 to 79 years,80 to 84 years,85 to 89 years,90 to 94 years,State
0,Agincourt North,29113.0,24995.0,3840.0,1240.0,1705.0,2000.0,2020.0,1775.0,1465.0,1665.0,6960.0,2440.0,2230.0,2000.0,1905.0,1290.0,1065.0,865.0,565.0,270.0,Ontario
1,Agincourt South-Malvern West,23757.0,20395.0,3075.0,935.0,1470.0,1890.0,2020.0,1645.0,1340.0,1360.0,5720.0,1950.0,1755.0,1510.0,1320.0,875.0,760.0,595.0,365.0,155.0,Ontario
2,Alderwood,12054.0,10265.0,1760.0,480.0,570.0,665.0,715.0,840.0,905.0,860.0,3175.0,1035.0,1030.0,795.0,620.0,420.0,340.0,315.0,190.0,100.0,Ontario
3,Annex,30526.0,26305.0,2360.0,675.0,1015.0,2735.0,4350.0,3295.0,2090.0,1750.0,5260.0,1865.0,1780.0,1700.0,1725.0,1335.0,1030.0,750.0,570.0,340.0,Ontario
4,Banbury-Don Mills,27695.0,23390.0,3605.0,1285.0,1370.0,1360.0,1400.0,1595.0,1625.0,1790.0,6715.0,2225.0,1935.0,1620.0,1670.0,1360.0,1235.0,1085.0,955.0,520.0,Ontario


#### Get geographical coordinates using OpenCage API

In [9]:
#install and import OpenCageGeocode module
!pip install opencage

from opencage.geocoder import OpenCageGeocode

print("Libraries loaded!")

Libraries loaded!


As the simplest, not-most-efficient approach, I am going to iterate over each row to get the city and state, then use the API to get the corresponding coordinates. I’ll save longitudes and latitudes in two separate lists. Then I can add these two lists as new columns once I’m done.

In [10]:
key = 'b87f2c9f6e4c426caf738d3b6c23a9c4' #API key from https://opencagedata.com

geocoder = OpenCageGeocode(key)

list_lat = [] # create empty lists
list_long = []

for index, row in df_census.iterrows(): #iterate over rows in dataframe
    
    Neighborhood = row['Neighborhood']
    State = row['State']
    query = str(Neighborhood)+','+str(State)
    
    results = geocoder.geocode(query)
    lat = results[0]['geometry']['lat']
    long = results[0]['geometry']['lng']
    
    list_lat.append(lat)
    list_long.append(long)

# Create new columns from lists
df_census['Latitude'] = list_lat
df_census['Longitude'] = list_long

In [11]:
print(df_census.shape)
df_census.head() #check the last columns!

(139, 24)


Unnamed: 0,Neighborhood,Population,Income,0 to 14 years,10 to 14 years,15 to 19 years,20 to 24 years,25 to 29 years,30 to 34 years,35 to 39 years,40 to 44 years,45 to 49 years,50 to 54 years,55 to 59 years,60 to 64 years,65 to 69 years,70 to 74 years,75 to 79 years,80 to 84 years,85 to 89 years,90 to 94 years,State,Latitude,Longitude
0,Agincourt North,29113.0,24995.0,3840.0,1240.0,1705.0,2000.0,2020.0,1775.0,1465.0,1665.0,6960.0,2440.0,2230.0,2000.0,1905.0,1290.0,1065.0,865.0,565.0,270.0,Ontario,43.808038,-79.266439
1,Agincourt South-Malvern West,23757.0,20395.0,3075.0,935.0,1470.0,1890.0,2020.0,1645.0,1340.0,1360.0,5720.0,1950.0,1755.0,1510.0,1320.0,875.0,760.0,595.0,365.0,155.0,Ontario,43.788555,-79.265661
2,Alderwood,12054.0,10265.0,1760.0,480.0,570.0,665.0,715.0,840.0,905.0,860.0,3175.0,1035.0,1030.0,795.0,620.0,420.0,340.0,315.0,190.0,100.0,Ontario,43.601717,-79.545232
3,Annex,30526.0,26305.0,2360.0,675.0,1015.0,2735.0,4350.0,3295.0,2090.0,1750.0,5260.0,1865.0,1780.0,1700.0,1725.0,1335.0,1030.0,750.0,570.0,340.0,Ontario,43.670338,-79.407117
4,Banbury-Don Mills,27695.0,23390.0,3605.0,1285.0,1370.0,1360.0,1400.0,1595.0,1625.0,1790.0,6715.0,2225.0,1935.0,1620.0,1670.0,1360.0,1235.0,1085.0,955.0,520.0,Ontario,43.752339,-79.365716


Now we have our Neighborhoods Profile Dataframe. The dataframe contains the list of neighborhoods in Toronto along with their population, income, population per age group, and geo coordinates. 

Let's visualize the data.

In [12]:
!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [13]:
#Use OpenCageGeocode library to get the latitude and longitude values of Toronto

query_toronto = 'Toronto, Ontario'

location_toronto = geocoder.geocode(query_toronto)

latitude_toronto = location_toronto[0]['geometry']['lat']
longitude_toronto = location_toronto[0]['geometry']['lng']

print('The geographical coordinates of Toronto are {}, {}.'.format(latitude_toronto, longitude_toronto))

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


In [14]:
#Create a map of Toronto with neighborhoods superimposed on top
map_toronto = folium.Map(location=[latitude_toronto, longitude_toronto], zoom_start=10)

#add markers to map
for lat, lng, neighborhood in zip(df_census['Latitude'], df_census['Longitude'], 
                                         df_census['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

### Foursquare API - extracting venues

Let's explore each neighborhood using Foursquare API to get info on Starbucks and coffee shops. Coffee shops are the direct competitors so they are included.

In [15]:
#Define Foursquare credentials and Version
CLIENT_ID = 'CAFGEUTLP40HWRS2HYMCJC5XSAEHY4N04UB3PQL2WXP4UC2I' #your Foursquare ID
CLIENT_SECRET = 'HYFQMVIQZMJ0HH35PGQY3VMYRO3DQ2LKZAWCWPV42HJHI1RM' #your Foursquare Secret
VERSION = '20190425' #Foursquare API version
LIMIT = 100 #A default Foursquare API limit value

print('Your credentials:')
print('CLIENT_ID:' + CLIENT_ID)
print('CLIENT_SECRET' + CLIENT_SECRET)

Your credentials:
CLIENT_ID:CAFGEUTLP40HWRS2HYMCJC5XSAEHY4N04UB3PQL2WXP4UC2I
CLIENT_SECRETHYFQMVIQZMJ0HH35PGQY3VMYRO3DQ2LKZAWCWPV42HJHI1RM


In [16]:
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#Create a function to get nearby venues to the neighborhoods in Toronto
def getNearbyVenues(neighborhood, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for neighborhood, lat, lng in zip(neighborhood, latitudes, longitudes):
            
        #create the API request URL
        url_foursquare = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results_foursquare = requests.get(url_foursquare).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            neighborhood,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results_foursquare])

    #convert the list into a new dataframe
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                             'Neighborhood_Latitude', 
                             'Neighborhood_Longitude', 
                             'Venue', 
                             'Venue_Latitude', 
                             'Venue_Longitude', 
                             'Venue_Category']
    
    return(nearby_venues)

In [17]:
#Now apply the above function on each neighborhood and create a new dataframe
df_nearbyvenues = getNearbyVenues(neighborhood=df_census['Neighborhood'],
                                   latitudes=df_census['Latitude'],
                                   longitudes=df_census['Longitude']
                                  )

print(df_nearbyvenues.shape)
df_nearbyvenues.head()

(2652, 7)


Unnamed: 0,Neighborhood,Neighborhood_Latitude,Neighborhood_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Agincourt North,43.808038,-79.266439,Saravanaa Bhavan South Indian Restaurant,43.810117,-79.269275,Indian Restaurant
1,Agincourt North,43.808038,-79.266439,Menchie's,43.808338,-79.268288,Frozen Yogurt Shop
2,Agincourt North,43.808038,-79.266439,Booster Juice,43.809915,-79.269382,Juice Bar
3,Agincourt North,43.808038,-79.266439,Dollarama,43.808894,-79.269854,Discount Store
4,Agincourt North,43.808038,-79.266439,Congee Town 太皇名粥,43.809035,-79.267634,Chinese Restaurant


#### Starbucks Data

Filter to get neighborhoods with atleast one Coffee Shop

In [18]:
df_coffee_shops = df_nearbyvenues.query('Venue_Category == "Coffee Shop"')

print(df_coffee_shops.shape)
df_coffee_shops.head()

(172, 7)


Unnamed: 0,Neighborhood,Neighborhood_Latitude,Neighborhood_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
11,Agincourt North,43.808038,-79.266439,Tim Hortons,43.809993,-79.269032,Coffee Shop
52,Alderwood,43.601717,-79.545232,Tim Hortons,43.602396,-79.545048,Coffee Shop
86,Annex,43.670338,-79.407117,Tim Hortons,43.666719,-79.404263,Coffee Shop
87,Annex,43.670338,-79.407117,Second Cup (Miles Nadal JCC Fitness),43.666527,-79.403872,Coffee Shop
91,Annex,43.670338,-79.407117,First And Last Coffee Shop,43.67432,-79.40945,Coffee Shop


In [19]:
# Group the neighborhood by the number of coffee shops
df_coffee_sliced = df_coffee_shops.groupby(['Neighborhood']).size().reset_index()
df_coffee_sliced.columns= ['Neighborhood', 'Coffee Shop']
df_coffee_sliced.set_index('Neighborhood')

print(df_coffee_sliced.shape)
df_coffee_sliced.head()  

(73, 2)


Unnamed: 0,Neighborhood,Coffee Shop
0,Agincourt North,1
1,Alderwood,1
2,Annex,3
3,Bathurst Manor,3
4,Bay Street Corridor,1


Filter to get neighborhoods with atleast one Starbucks.

In [20]:
df_starbucks = df_nearbyvenues.query('Venue == "Starbucks"')

print(df_starbucks.shape)
df_starbucks.head()

(40, 7)


Unnamed: 0,Neighborhood,Neighborhood_Latitude,Neighborhood_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
217,Bedford Park-Nortown,43.731516,-79.420191,Starbucks,43.728673,-79.418513,Coffee Shop
219,Bedford Park-Nortown,43.731516,-79.420191,Starbucks,43.732604,-79.419136,Coffee Shop
266,Blake-Jones,43.67617,-79.337378,Starbucks,43.67985,-79.34037,Coffee Shop
386,Church-Yonge Corridor,43.670786,-79.385687,Starbucks,43.67034,-79.388262,Coffee Shop
417,Church-Yonge Corridor,43.670786,-79.385687,Starbucks,43.671082,-79.380756,Coffee Shop


Merge the starbucks dataframe with the census dataframe to get the final dataframe consisting of details of the neighborhoods having Starbucks.

In [21]:
df_starbucks_final = pd.merge(df_census, df_starbucks, on = 'Neighborhood', how = 'inner')

print(df_starbucks_final.shape)
df_starbucks_final.head()

(40, 30)


Unnamed: 0,Neighborhood,Population,Income,0 to 14 years,10 to 14 years,15 to 19 years,20 to 24 years,25 to 29 years,30 to 34 years,35 to 39 years,40 to 44 years,45 to 49 years,50 to 54 years,55 to 59 years,60 to 64 years,65 to 69 years,70 to 74 years,75 to 79 years,80 to 84 years,85 to 89 years,90 to 94 years,State,Latitude,Longitude,Neighborhood_Latitude,Neighborhood_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Bedford Park-Nortown,23236.0,18560.0,4555.0,1705.0,1725.0,1485.0,875.0,1095.0,1370.0,1495.0,6425.0,1835.0,1665.0,1410.0,1225.0,900.0,660.0,540.0,410.0,200.0,Ontario,43.731516,-79.420191,43.731516,-79.420191,Starbucks,43.728673,-79.418513,Coffee Shop
1,Bedford Park-Nortown,23236.0,18560.0,4555.0,1705.0,1725.0,1485.0,875.0,1095.0,1370.0,1495.0,6425.0,1835.0,1665.0,1410.0,1225.0,900.0,660.0,540.0,410.0,200.0,Ontario,43.731516,-79.420191,43.731516,-79.420191,Starbucks,43.732604,-79.419136,Coffee Shop
2,Blake-Jones,7727.0,6280.0,1405.0,395.0,450.0,435.0,495.0,615.0,670.0,635.0,2200.0,605.0,565.0,375.0,325.0,200.0,155.0,100.0,80.0,20.0,Ontario,43.67617,-79.337378,43.67617,-79.337378,Starbucks,43.67985,-79.34037,Coffee Shop
3,Church-Yonge Corridor,31340.0,29095.0,1260.0,270.0,1040.0,4020.0,5540.0,4485.0,2750.0,1980.0,5005.0,2135.0,1800.0,1435.0,1120.0,750.0,530.0,365.0,180.0,65.0,Ontario,43.670786,-79.385687,43.670786,-79.385687,Starbucks,43.67034,-79.388262,Coffee Shop
4,Church-Yonge Corridor,31340.0,29095.0,1260.0,270.0,1040.0,4020.0,5540.0,4485.0,2750.0,1980.0,5005.0,2135.0,1800.0,1435.0,1120.0,750.0,530.0,365.0,180.0,65.0,Ontario,43.670786,-79.385687,43.670786,-79.385687,Starbucks,43.671082,-79.380756,Coffee Shop


Visualize Starbucks and coffee shop locations on a map.

In [22]:
# create map
map_starbucks = folium.Map(location=[latitude_toronto, longitude_toronto], zoom_start=11)

# add markers to the map

for lat, lon, poi, ven in zip(df_coffee_shops['Venue_Latitude'], df_coffee_shops['Venue_Longitude'], df_coffee_shops['Neighborhood'], df_coffee_shops['Venue']):
    label = folium.Popup('{}, Venue: {}'.format(poi, ven), parse_html=True)    
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='orange',
        fill=True,
        fill_color='orange',
        fill_opacity=1).add_to(map_starbucks)

for lat, lon, poi, pop, inc in zip(df_starbucks_final['Venue_Latitude'], df_starbucks_final['Venue_Longitude'], df_starbucks_final['Neighborhood'], df_starbucks_final['Population'], df_starbucks_final['Income']):
    label = folium.Popup('Starbucks - {}, Pop: {}, Inc: {}'.format(poi, pop, inc), parse_html=True)    
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='starbucks green',
        fill=True,
        fill_color='#00704A',
        fill_opacity=1).add_to(map_starbucks)
       
map_starbucks

As we mentioned previously in Introduction: Business Problem section, Starbucks (green circle in the map) is mainly concentrated in the core Downtown of Toronto. The visualization concurs with our problem and is good for our analysis to move outwards for new location.