# Capstone Project

This notebook will be used for my IBM Data Science Capstone Project!

Importing the required Libraries.

In [1]:
import pandas as pd
import numpy as np
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
from geopy.geocoders import Nominatim
import requests

Reading the csv created from the previous toronto dataset.

In [2]:
tor_data=pd.read_csv('toronto_geo_data.csv').drop('Unnamed: 0', axis=1)
tor_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Gathering the latitude and longitude of toronto for the map visualization

In [3]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinate of Toronto are 43.653963, -79.387207.


Creating a map of the postcodes in the toronto area.

In [4]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(tor_data['Latitude'], tor_data['Longitude'], tor_data['Borough'], tor_data['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Clinet ID and Client Secret for the foursquare API.

In [5]:
CLIENT_ID = 'YNC4DJD21CJ0M3BKMNQO5V021W3T1UO5MIAHAUWMFEBQJF0R' # your Foursquare ID
CLIENT_SECRET = 'NK5IXFBQH2WZKPIYCFVP2LLA2RVWTQVFSEIFMDEB3XDEY4Z1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT=100

Creating a function to gather all venues within 1000 meters of the given postcode.  
I chose 1000 meters so that it is far enough away from the center of the postcode to include a sufficient number of venues, but also not too far so as to include too many venues in multiple postcode areas.

In [6]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                  'Postcode Latitude', 
                  'Postcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

This runs the above function and save the forusquare data as a csv for faster performace in later runs of code.

In [7]:
#tor_venues = getNearbyVenues(names=tor_data['Postcode'],
                                   #latitudes=tor_data['Latitude'],
                                   #longitudes=tor_data['Longitude']
                                  #)
#tor_venues.to_csv('tor_venues_all.csv')

Reads the csv file created from the foursquare function above. This allows me to more quickly run iterations of this full notebook as changes are made.

In [8]:
tor_venues=pd.read_csv('tor_venues_all.csv').drop('Unnamed: 0', axis=1)
tor_venues.head()

Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Wendy's,43.802008,-79.19808,Fast Food Restaurant
1,M1B,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
2,M1B,43.806686,-79.194353,Staples Morningside,43.800285,-79.196607,Paper / Office Supplies Store
3,M1B,43.806686,-79.194353,Harvey's,43.80002,-79.198307,Restaurant
4,M1B,43.806686,-79.194353,Caribbean Wave,43.798558,-79.195777,Caribbean Restaurant


Now, I did a onehot encoding to the venue dataframe above so that I can easily count the numbers of each type of venues.

In [9]:
# one hot encoding
tor_onehot = pd.get_dummies(tor_venues[['Venue Category']], prefix="", prefix_sep="")

# add postcode column back to dataframe
tor_onehot['Postcode']=tor_venues["Postcode"]
# move neighborhood column to the first column
fixed_columns = [tor_onehot.columns[-1]] + list(tor_onehot.columns[:-1])
tor_onehot = tor_onehot[fixed_columns]

print(tor_onehot.shape)
tor_onehot.head()

(4913, 328)


Unnamed: 0,Postcode,Accessories Store,Afghan Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This below code filters the onehot dataframe above to include only restaurant venue categories.

In [10]:
# Group the dataframe by postcode and sum the restaurants in that venue category by postcode.
tor_rest_all=tor_onehot.groupby(by='Postcode').sum().reset_index()

# This filters the columns to those that inlcude only the word "Restaurant"
columns=tor_rest_all.columns
rest_col=[]
for col in columns:
    if 'Restaurant' in col:
        rest_col.append(col)
    if 'Postcode' in col:
        rest_col.append(col)
tor_rest_all=tor_rest_all[rest_col]

# This adds a column that totals the number of restaurants in that postcode area.
tor_rest_all['Total Restaurants']=tor_rest_all.sum(axis=1)

# This adds the postcode latitude and longitude columns back into the dataframe.
tor_data['isin']=tor_data['Postcode'].isin(tor_rest_all['Postcode'])
tor_data_isin=tor_data[tor_data['isin']==True]
tor_rest_all['Postcode Latitude']=tor_data_isin["Latitude"]
tor_rest_all["Postcode Longitude"]=tor_data_isin["Longitude"]

# This moves the latitude and longitude columns to the fron of the dataframe.
cols = tor_rest_all.columns.tolist()
cols.insert(1, cols.pop(cols.index('Postcode Latitude')))
cols.insert(2, cols.pop(cols.index('Postcode Longitude')))
tor_rest_all=tor_rest_all[cols]

tor_rest_all.head()

Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Afghan Restaurant,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Cantonese Restaurant,...,Taiwanese Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Tibetan Restaurant,Turkish Restaurant,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Total Restaurants
0,M1B,43.806686,-79.194353,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,6
1,M1C,43.784535,-79.160497,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,M1E,43.763573,-79.188711,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
3,M1G,43.770992,-79.216917,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
4,M1H,43.773136,-79.239476,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,8


I want to now filter the dataframe further to include only the venues that are Fast Food Restaurants.

In [12]:
# Filters the dataframe to only include columns that contain the word "fast Food"
tor_fast=tor_onehot.drop(tor_onehot.columns[~tor_onehot.columns.str.contains('Fast Food')], axis=1)

# add postcode column back to dataframe
tor_fast['Postcode']=tor_venues["Postcode"]
# move neighborhood column to the first column
fixed_columns = [tor_fast.columns[-1]] + list(tor_fast.columns[:-1])
tor_fast = tor_fast[fixed_columns]

# Group the dataframe by postcode and sum the restauants up in that postcode.
tor_fast=tor_fast.groupby('Postcode').sum().reset_index()

# Add the postcode and postcode latitude and longitude back into this dataframe.
tor_data['isin']=tor_data['Postcode'].isin(tor_fast['Postcode'])
tor_data_isin=tor_data[tor_data['isin']==True]
tor_fast['Postcode Latitude']=tor_data_isin["Latitude"]
tor_fast["Postcode Longitude"]=tor_data_isin["Longitude"]

# Add the total restaurant column into this dataframe as well.
tor_fast['Total Restaurants']=tor_rest_all['Total Restaurants']

# Create acolumn that calculates the percent of the restaurants in that postcode that are of the fast food category.
tor_fast['Percent Fast Food']=(tor_fast['Fast Food Restaurant']/tor_fast['Total Restaurants']).round(2)
tor_fast=tor_fast.fillna(0)
tor_fast['Percent Fast Food'].astype(int)

# Move the latitude an longitude columns to the front of the dataframe.
cols = tor_fast.columns.tolist()
cols.insert(1, cols.pop(cols.index('Postcode Latitude')))
cols.insert(2, cols.pop(cols.index('Postcode Longitude')))
tor_fast=tor_fast[cols]

tor_fast.head()

Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Fast Food Restaurant,Total Restaurants,Percent Fast Food
0,M1B,43.806686,-79.194353,2,6,0.33
1,M1C,43.784535,-79.160497,0,1,0.0
2,M1E,43.763573,-79.188711,2,3,0.67
3,M1G,43.770992,-79.216917,1,3,0.33
4,M1H,43.773136,-79.239476,1,8,0.12


Now we will do a k-means clustering of the above dataframe so that we can easily see different categories of postcodes.  
This will allow us to determine which postcodes best suit your needs in terms of competition in the area.  
I chose 4 clusters because that results in the most clear-cut segmentation of the postcodes for analysis.

In [38]:
# set number of clusters
kclusters = 4

# Remove the unnecessary columns for the clustering.
tor_fast_clust = tor_fast.drop(['Postcode','Postcode Latitude', 'Postcode Longitude','Percent Fast Food'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(tor_fast_clust)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 2, 0, 0, 0, 0, 0])

Add the Cluster Labels back into the dataframe for later analysis.

In [39]:
tor_fast["Cluster Labels"]=kmeans.labels_

In [40]:
tor_fast.head()

Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Fast Food Restaurant,Total Restaurants,Percent Fast Food,Cluster Labels
0,M1B,43.806686,-79.194353,2,6,0.33,0
1,M1C,43.784535,-79.160497,0,1,0.0,0
2,M1E,43.763573,-79.188711,2,3,0.67,0
3,M1G,43.770992,-79.216917,1,3,0.33,0
4,M1H,43.773136,-79.239476,1,8,0.12,2


The below code will visualize the clusters on a map of Toronto so that we can see where each cluster is generally located within the city.

In [41]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tor_fast['Postcode Latitude'], tor_fast['Postcode Longitude'], tor_fast['Postcode'], tor_fast['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

While the clusters appear to be relatively spread out, we can see that there are commonalities within each cluster.  
It appears as though Cluster 3 (Yellow) is located mostly within the area of Downtown Toronto.  
Cluster 1 (Purple) is generally located close to Downtown, butmostly just outside of it.  
Cluster 2 (Teal) is generally just a little further out from downtown compared to Cluster 1.  
Cluster 0 (Red) is generally the furtherst our from Downtown.  
  
Below, We will look at hte data within each cluster to determine how those general locations effect the total number of restaurants and the Fast Food Restaurants.  
We will also do a quick analysis of each cluster to help you make a more informed decison about where your new fast food restaurant might be most profitable.

# Cluster 0: Red
## Analysis of postcodes in this cluster
As we mentioned above, this cluster includes postcodes that are generally furthest away from Downtown. From this geographical perspective, we think the data makes sense.  
As you can tell from the data below, these postcodes have a small number of total restaurants and a high percetnage of Fast Food restaurants.  
  
This could suggest a number of things, including that these ara have a low popoulation and/or low tourist visitation.  
Based on this, I would not suggest postcodes in this area for your new fast food restaurant.

In [63]:
clust0=tor_fast.loc[tor_fast['Cluster Labels'] == 0]
print("Average percent of restaurants in this cluster that are fast food: ", (clust0['Percent Fast Food'].mean())*100)
clust0.sort_values('Percent Fast Food',ascending=False)

Average percent of restaurants in this cluster that are fast food:  11.547619047619047


Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Fast Food Restaurant,Total Restaurants,Percent Fast Food,Cluster Labels
2,M1E,43.763573,-79.188711,2,3,0.67,0
34,M4B,43.725882,-79.315572,1,2,0.5,0
100,M9V,43.688905,-79.554724,1,2,0.5,0
0,M1B,43.806686,-79.194353,2,6,0.33,0
7,M1L,43.711112,-79.284577,1,3,0.33,0
24,M3A,43.782736,-79.442259,1,3,0.33,0
79,M6M,43.713756,-79.490074,1,3,0.33,0
6,M1K,43.727929,-79.262029,2,6,0.33,0
5,M1J,43.744734,-79.239476,1,3,0.33,0
3,M1G,43.770992,-79.216917,1,3,0.33,0


# Cluster 1: Purple
## Analysis of postcodes in this cluster
As we mentioned above, this cluster is generally the second closest to the downtown tToronto area. From this geographical perspective, we think the data makes sense.  
As you can tell from the data below, these postcodes have a high number of total restaurants and a small percentage of Fast Food restaurants.  
  
This could suggest a number of things, including that these areas have a high population and a higher socio-economic demographic. It could also suggest that the cost to open and operate a restaurant in these areas arerelatively inexpensive as well, and thus could provide a good cost-benefit ratio.  
  
Based on this, I think these postcode areas could be an excellent choice for your new fast food restaurant. I would pay close attention to specific postcodes in the cluster that have a low percentage of restuarants that are fast food as it could result in lower competition. I also think this would be a good cluster to choose from because since most of the postcodes are not exactly in downtown, the cost of owning and operating such a restaurant should be lower than those located in the heart of downtown.

In [61]:
clust1=tor_fast.loc[tor_fast['Cluster Labels'] == 1]
print("Average percent of restaurants in this cluster that are fast food: ", (clust1['Percent Fast Food'].mean())*100)
clust1.sort_values('Percent Fast Food',ascending=False)

Average percent of restaurants in this cluster that are fast food:  3.0909090909090917


Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Fast Food Restaurant,Total Restaurants,Percent Fast Food,Cluster Labels
21,M2N,43.789053,-79.408493,5,38,0.13,1
40,M4K,43.685347,-79.338106,3,32,0.09,1
46,M4S,43.715383,-79.405678,2,36,0.06,1
51,M4Y,43.667967,-79.367675,1,32,0.03,1
84,M7A,43.651571,-79.48445,1,29,0.03,1
42,M4M,43.668999,-79.315572,0,29,0.0,1
64,M5R,43.696948,-79.411307,0,31,0.0,1
66,M5T,43.662696,-79.400049,0,30,0.0,1
69,M5X,43.646435,-79.374846,0,30,0.0,1
74,M6G,43.689026,-79.453512,0,35,0.0,1


# Cluster 2: Teal
## Analysis of postcodes in the cluster
As mentioned above, this cluster is further away from downtown, but not as far as cluster 0. From this geographical perspective, we think this again, makes sense.  
The data below indicates that this cluster includes postcodes with a medium number of restaurants and a medium percent of restaurants that are fast food.  
  
This could suggest a number of things, including that this is mostly a residential area where there are a good number of people that want quick meals reletively nearby, but is likely not a large tourist area.  
  
Based on this data, I would not suggest postcodes in this area as there is quite a bit of direct competition for fast food restaurants and not a lot of other options for people in the area.

In [60]:
clust2=tor_fast.loc[tor_fast['Cluster Labels'] == 2]
print("Average percent of restaurants in this cluster that are fast food: ", (clust2['Percent Fast Food'].mean())*100)
clust2.sort_values('Percent Fast Food',ascending=False)

Average percent of restaurants in this cluster that are fast food:  8.964285714285715


Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Fast Food Restaurant,Total Restaurants,Percent Fast Food,Cluster Labels
71,M6B,43.718518,-79.464763,3,9,0.33,2
15,M1W,43.799525,-79.318389,2,7,0.29,2
32,M3N,43.728496,-79.495697,2,7,0.29,2
14,M1V,43.815252,-79.284577,2,11,0.18,2
13,M1T,43.781638,-79.304302,2,12,0.17,2
70,M6A,43.648429,-79.38228,2,12,0.17,2
61,M5M,43.648198,-79.379817,2,13,0.15,2
17,M2J,43.803762,-79.363452,1,7,0.14,2
10,M1P,43.75741,-79.273304,2,14,0.14,2
4,M1H,43.773136,-79.239476,1,8,0.12,2


# Cluster 3: Yellow
## Analysis of postcodes in the cluster
As mentioned above, this cluster includes postcodes that are generally in or very near downtown Toronto.  
The data below indicates that there is a medium-high number of restaurants and a low percentage of fast food restaurants.  
  
This could suggest a number of things, inclduing that there is likely a large population and that it is likely a high tourist destination as well, but that cost of owning and operating a restaurant is costly.  
  
Based on this, this is a great cluster to choose your postcode location from, but I fear that since there are a lower number of restaurants in this cluster compared to cluster 1 that this cost in this area is high. More study would be necessary to determine that, but that is beyond the scope of this analysis.

In [64]:
clust3=tor_fast.loc[tor_fast['Cluster Labels'] == 3]
print("Average percent of restaurants in this cluster that are fast food: ", (clust3['Percent Fast Food'].mean())*100)
clust3.sort_values('Percent Fast Food',ascending=False)

Average percent of restaurants in this cluster that are fast food:  2.1904761904761902


Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Fast Food Restaurant,Total Restaurants,Percent Fast Food,Cluster Labels
44,M4P,43.72802,-79.38879,4,25,0.16,3
39,M4J,43.705369,-79.349372,3,27,0.11,3
41,M4L,43.679557,-79.352188,1,19,0.05,3
81,M6P,43.673185,-79.487262,1,22,0.05,3
52,M5A,43.66586,-79.38316,1,20,0.05,3
53,M5B,43.65426,-79.360636,1,25,0.04,3
12,M1S,43.7942,-79.262029,0,23,0.0,3
59,M5K,43.640816,-79.381752,0,26,0.0,3
82,M6R,43.661608,-79.464763,0,22,0.0,3
77,M6K,43.647927,-79.41975,0,24,0.0,3
