# WHERE TO START A NEW GYM BUSINESS IN THE CITY OF TORONTO

## INTRODUCTION

The city of Toronto is the most populated city in Canada and one of the most populated cities in North America. As it is a very developed city, so is the business case. That's why any new investment that wants to move to Toronto needs to use market-based insights that will help them understand the business environment, allowing for a strategy to reduce risk. In this case I am exploring the suitable neighborhood in Toronto for a new gym.

## DATA SECTION

    A Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto:

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

    Csv file that has the geographical coordinates of each postal code:

http://cocl.us/Geospatial_data

    FourSquare API to collect data about locations of gyms in the city of Toronto:

https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}

## METHODOLOGY

First I have scraped Wikipedia page and wrangled the data, cleaned it, and then read it into a pandas dataframe. My main target here was to access which neighborhood in the city has no gym so there it is more possiple that it will attract customers. I used the FourSquare API through the venues channel. I made clusters of neighborhoods using kmeans so the future investor will choose between them. I used follium library to visualize the neighborhoods on the Toronto map.

### Scraping and building the initial dataframe

In [1]:
#Importing necessary libraries
! pip install BeautifulSoup4
from bs4 import BeautifulSoup 
import requests  
import pandas as pd
import numpy as np
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 
!pip install folium
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: / 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
                                                                                 /                                 failed

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - cffi -> python[version='2.7.*|3.5.*|3.6.*|3.6.12|3.6.12|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.9,<3.10.0a0|>=3.8,<3.9.0a0|3.7.9|3.6.9|3.6.9|3.6.9|>=2.7,<2.8.0a0|3.6.9|>=3.5,<3.6.0a0|3.4.*',build='0_73_pypy

In [2]:
#url address of Wiki page
url ="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [3]:
# get the contents of the webpage in text format and store in a variable called web
web=requests.get(url).text

In [4]:
#Create a soup object using the class BeautifulSoup
soup = BeautifulSoup(web,"html")

In [5]:
#find a html table in the web page
table = soup.find('table')

In [6]:
#Get all rows from the table
data = []
for row in table.find_all('tr'):
    data.append([t.text.strip() for t in row.find_all('td')])

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [7]:
#Tranform the data into a _pandas_ dataframe
df = pd.DataFrame(data, columns=['Postal Code', 'Borough', 'Neighborhood'])
print(df.shape)
df.head()

(181, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


I only process the cells that have an assigned borough. I ignore cells with a borough that is Not assigned.

In [8]:
#Dropping Not Assigned boroughs
#See the shape
df.drop(df[df['Borough']=="Not assigned"].index, inplace = True)
print(df.shape)
df.head()

(104, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,,,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"


No more than one neighborhood exist in one postal code area

In [9]:
#Checking unique postal codes, there are no more than one neighborhoods in a postal pode area
df.agg(['count', 'size', 'nunique'])

Unnamed: 0,Postal Code,Borough,Neighborhood
count,103,103,103
size,104,104,104
nunique,103,10,99


If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.

In [10]:
#Checking Not assigned  neighborhood
#There aren't any
df[df["Neighborhood"]=="Not assigned"]

Unnamed: 0,Postal Code,Borough,Neighborhood


In [11]:
#Printing the final shape of the dataframe
df.shape

(104, 3)

### Finding latitude and the longitude coordinates of each neighborhood

In [12]:
#converting csv file that has the geographical coordinates of each postal code to a pandas dataframe
latlon_df=pd.read_csv("http://cocl.us/Geospatial_data")
print(latlon_df.shape)
latlon_df.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
toronto_df=pd.merge(df,latlon_df,on="Postal Code")
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Using geopy library to get the latitude and longitude values of Toronto and its neighborhoods


In [14]:
#Finding the geographical coordinate of Toronto
address = 'Toronto'
geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))


The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [15]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Exploring and clustering the neighborhoods in Toronto

In [16]:
#Defining Foursquare Credentials and Version
CLIENT_ID = '52FJZ5Q0QCNBUH23TEPU5HS1O34YUYGPXV1BXUZBXQYQB3MC' 
CLIENT_SECRET = 'PKGONXZTTKTDTWYDE4ISK1KS50ZAWA0IV31GUSBM1NCCAHOE' 
ACCESS_TOKEN = 'FV4CBUSRKR0HWBSQNDTL5MAFWXTSNBYYN12SVHIL1VT4XZ2M'
VERSION = '20180604'
LIMIT = 100 
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 52FJZ5Q0QCNBUH23TEPU5HS1O34YUYGPXV1BXUZBXQYQB3MC
CLIENT_SECRET:PKGONXZTTKTDTWYDE4ISK1KS50ZAWA0IV31GUSBM1NCCAHOE


In [17]:
#Function to find venues in Toronto using FourSquare API
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
#Creating the dataframe of venues
toronto_venues = getNearbyVenues(names=toronto_df['Neighborhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [19]:
#Size of resulting dataframe
print(toronto_venues.shape)
toronto_venues.head()

(2118, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.753259,-79.329656,Bella Vita Catering & Private Chef Service,43.756651,-79.331524,BBQ Joint
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [20]:
#checking how many venues were returned for each neighborhood
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Wilson Heights, Downsview North",23,23,23,23,23,23
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
...,...,...,...,...,...,...
"Willowdale, Willowdale East",34,34,34,34,34,34
"Willowdale, Willowdale West",6,6,6,6,6,6
Woburn,4,4,4,4,4,4
Woodbine Heights,6,6,6,6,6,6


In [21]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 268 uniques categories.


In [22]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
toronto_onehot = toronto_onehot[ ['Neighborhood'] + [ col for col in toronto_onehot.columns if col != 'Neighborhood' ] ]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
#Examining Dataframe size
toronto_onehot.shape

(2118, 268)

In [24]:
#groupping rows by neighborhood and by taking the mean of the frequency of occurrence of each category
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,"Willowdale, Willowdale East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.0,0.0
91,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
92,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
93,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0


In [25]:
#confirm the new size
toronto_grouped.shape

(95, 268)

In [28]:
#Finding the neighborhoods with no gym
no_gym=toronto_grouped[(toronto_grouped["Gym"]==0) & (toronto_grouped["Yoga Studio"]==0) & (toronto_grouped["Gym / Fitness Center"]==0)]
no_gym=no_gym[['Neighborhood','Gym','Yoga Studio','Gym / Fitness Center']]
no_gym.head()

Unnamed: 0,Neighborhood,Gym,Yoga Studio,Gym / Fitness Center
0,Agincourt,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0
5,Berczy Park,0.0,0.0,0.0


In [29]:
#new datafrane with data from neighborhoods with no gym
plot_gym=pd.merge(no_gym,toronto_venues,on="Neighborhood")
plot_gym=plot_gym[['Neighborhood','Neighborhood Latitude','Neighborhood Longitude']]
plot_gym.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude
0,Agincourt,43.7942,-79.262029
1,Agincourt,43.7942,-79.262029
2,Agincourt,43.7942,-79.262029
3,Agincourt,43.7942,-79.262029
4,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259


In [31]:
#visualizing neighborhoods with no gym
map_gym = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(plot_gym['Neighborhood Latitude'], plot_gym['Neighborhood Longitude'], plot_gym['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_gym)  
    
map_gym 

## Using Kmeans clustering

In [32]:
##### set number of clusters
k = 5
toronto_clustering = plot_gym.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([4, 4, 4, 4, 2, 2, 2, 2, 2, 2], dtype=int32)

In [33]:
# add clustering labels
plot_gym.insert(0, 'Cluster Labels', kmeans.labels_)


In [34]:
plot_gym

Unnamed: 0,Cluster Labels,Neighborhood,Neighborhood Latitude,Neighborhood Longitude
0,4,Agincourt,43.794200,-79.262029
1,4,Agincourt,43.794200,-79.262029
2,4,Agincourt,43.794200,-79.262029
3,4,Agincourt,43.794200,-79.262029
4,2,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259
...,...,...,...,...
753,4,Woodbine Heights,43.695344,-79.318389
754,2,York Mills West,43.752758,-79.400049
755,2,York Mills West,43.752758,-79.400049
756,2,York Mills West,43.752758,-79.400049


In [35]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(plot_gym['Neighborhood Latitude'], plot_gym['Neighborhood Longitude'], plot_gym['Neighborhood'], plot_gym['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## RESULTS

I created 5 regions in the city of Toronto with multiple neighborhoods without a gym.

## RECOMMENDATIONS

Most people usually prefer to go to a gym that is close to their work or their home. The criteria for opening a new gym here depend on the lack of a gym in the neighborhoods but we should also take into account the Business Improvement Area (BIA) (association of commercial property owners and tenants within a defined area who work in partnership with the City to create thriving, competitive, and safe business areas that attract shoppers, diners, tourists, and new businesses), the density of the population in each region etc.

https://ckan0.cf.opendata.inter.prod-toronto.ca/dataset/business-improvement-areas

https://www.arcgis.com/apps/webappviewer/index.html?id=1535b9fca54f46b3954bca6aaf3ab3f5

## CONCLUSION

We can rely on the results quoted before but also later we should consider more data to reinforce our choices.