# Capstone Project: Open a Second Coffee Shop in Toronto 
Author: Elly Zhao

Date: 06/18/2020

This notebook is for the capstone project of IBM Data Science Professional Certificate

# Introduction/Business Problem
**My audience is my client who wants to open a second coffee shop in the city of Toronto.**  

A client of mine owns a coffee shop in Toronto and their business is doing very well that they now want to open a second coffee shop in the city. Imagine my client's coffee shop is in Little Portugal neighborhood of Toronto and with an average rating of 8.0 (out of 10.0). My task is to analyze the neighborhoods of Toronto to determine where would be the best location for my client to open their second shop.    

To help my client succeed in their business, besides their management, the location of the shop is also critical. I tend to pick an environment that is very similar to where they have their first coffee shop. In such way, their second coffee shop will likely to attract similar customers, and they can expect similar business performance, similar traffic or even similar competitors. In conclusion I want to find which neighborhood is the most similar to my client’s current neighborhood, to open the second coffee shop.  

To answer this question, first I will analyze the venue information in each neighborhood using the Foursquare location data, focusing on those drink and food venues. Next, build a machine learning model to segment neighborhoods in Toronto. The clustering technique, k-means, will be used for this task. Finally I will examine and pick the most suitable neighborhood for my client’s second coffee shop.  


# Data

Data required for this project includes:

* Postal, borough, and neighborhood of Canada from [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M), from where I will filter out the Toronto neighborhood only. 
* Latitude and longitude info of each neighborhood from [here](https://cocl.us/Geospatial_data). 
* Nearby venue data extracted from Foursquare database, 
    * Number of coffee shops. 
    * Rating of each coffee shop.


In [1]:
import numpy as np 
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json 
#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 
import requests 
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes 
import folium 
print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    certifi-2020.4.5.2         |   py36h9f0ad1d_0         152 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    ca-certificates-2020.4.5.2 |       hecda079_0         147 KB  conda-forge
    ------------------------------------------------------------
                       

# 1 Prepare Data

## 1.1 Import Neighborhood and Postal Code Data
First, import data from the webpage. The table should have three columns: Postal Code, Borough, and Neighborhood. Let's ignore those neighborhoods with unassigned borough, we have 103 row left. Since there are 103 unique postal codes, this means there is no repeated postal codes in the table, we don't need to further combine rows for shared postal codes. Next, make sure there is no unassigned neighborhoods. After preparation and checking we have final shape of the table is 103x3.   

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfs = pd.read_html(url)
print('There are {} tables in this url, we will use the first table.'.format(len(dfs)))
df = dfs[0]
# df.head()

There are 3 tables in this url, we will use the first table.


In [3]:
df=df[df['Borough']!='Not assigned']

In [4]:
print('size of this table:', df.shape)
print('number of unique postal code:', df['Postal Code'].nunique())

size of this table: (103, 3)
number of unique postal code: 103


In [5]:
print('there are {} neiborhoods not assigned.'.format(
    df[df['Neighborhood']=='Not assigned'].shape[0]))

there are 0 neiborhoods not assigned.


In [6]:
print('shape of this table is:', df.shape)

shape of this table is: (103, 3)


## 1.2 Get latitude and longitude
Get the latitude and longitude coordinates of the postal codes, and merge into the table. 

In [7]:
url = 'https://cocl.us/Geospatial_data'
coordinates = pd.read_csv(url)

In [8]:
df = pd.merge(df, coordinates, how='left', on='Postal Code')

In [9]:
print('shape of this table is:', df.shape)
df.head()

shape of this table is: (103, 5)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


# 2 Explor the neighborhoods of Toronto

## 2.1 Visualize Neighborhoods on Map
There are total 10 boroughs and 103 neighborhoods. We will only work on Toronto area which includes 39 neighborhoods (postal codes).

In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),  df.shape[0]))

The dataframe has 10 boroughs and 103 neighborhoods.


In [11]:
toronto_data = df[df['Borough'].str.contains('Toronto')].reset_index(drop=True)
# toronto_data.head()

In [12]:
print('There are {} neighborhoods in Toronto area'.format(toronto_data.shape[0]))

There are 39 neighborhoods in Toronto area


#### Let's get coordinates of Toronto and visualize the neighborhoods on a map.

In [13]:
address = 'Toronto'

geolocator = Nominatim(user_agent="my_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [14]:
# create map using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

# NOTE: Github sometimes fails to render folium map. As the teaching staff suggested, please kindly enter the GitHub URL in https://nbviewer.jupyter.org/ to view map! Thanks.

## 2.2 Explore neighborhoods using Foursquare API

#### Define Foursquare credentials and verision

In [15]:
CLIENT_ID = '' 
CLIENT_SECRET = '' 
VERSION = '20180605' 

#### Write a function to extract upto 100 nearby venues of each neighborhood, and return location relevant info about each venue.

In [16]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [127]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude'])
print('Total {} venues extracted'.format(toronto_venues.shape[0]))
# toronto_venues.head()

In [18]:
# drop the venues categorized as 'neighborhood' (means nothing to this project) 
print('There are {} venues categorized as \'neighborhood\', drop these values.'.format(
    len(toronto_venues[toronto_venues['Venue Category']=='Neighborhood'])))
toronto_venues = toronto_venues[toronto_venues['Venue Category']!='Neighborhood']

There are 4 venues categorized as 'neighborhood', drop these values.


In [19]:
print('There are {} venues with {} unique categories.'.format(toronto_venues.shape[0],len(toronto_venues['Venue Category'].unique())))

There are 1623 venues with 232 unique categories.


## 2.3 Analyze Each Neighborhood

#### First, encode venue catergory data (to numerical).

In [128]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column (to the last column)
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
# toronto_onehot.head()

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [21]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Stadium,Basketball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Bus Line,Butcher,Cafeteria,Café,Cajun / Creole Restaurant,Camera Store,Candy Store,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Auditorium,College Gym,College Rec Center,Colombian Restaurant,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Convention Center,Cosmetics Shop,Coworking Space,Creperie,Cuban Restaurant,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,General Entertainment,General Travel,German Restaurant,Gift Shop,Gluten-free Restaurant,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Hookah Bar,Hospital,Hostel,Hotel,Hotel Bar,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Korean Restaurant,Lake,Latin American Restaurant,Light Rail Station,Lingerie Store,Liquor Store,Lounge,Malay Restaurant,Market,Martial Arts Dojo,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Museum,Music Venue,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Park,Performing Arts Venue,Pet Store,Pharmacy,Pizza Place,Plane,Playground,Plaza,Poke Place,Poutine Place,Pub,Ramen Restaurant,Record Shop,Recording Studio,Rental Car Location,Restaurant,Roof Deck,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Smoothie Shop,Snack Place,Soup Place,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Strip Club,Supermarket,Sushi Restaurant,Swim School,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.017241,0.034483,0.0,0.0,0.0,0.017241,0.017241,0.0,0.034483,0.0,0.0,0.017241,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.034483,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.017241,0.051724,0.086207,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.017241,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.017241,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.017241,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.017241,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.017241,0.0,0.017241,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.058824,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0625,0.0625,0.0625,0.0625,0.1875,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.030303,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.015152,0.0,0.015152,0.015152,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.015152,0.015152,0.0,0.0,0.0,0.060606,0.060606,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.015152,0.0,0.0,0.030303,0.0,0.075758,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.015152,0.0,0.0,0.015152,0.0,0.0,0.015152


In [22]:
toronto_grouped.shape

(39, 233)

#### Find the top 10 common venues in each neighbothood, i.e. sort the venues in descending order.

In [23]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [61]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Café,Seafood Restaurant,Beer Bar,Restaurant,Cheese Shop,Bakery,Clothing Store,Basketball Stadium
1,"Brockton, Parkdale Village, Exhibition Place",Café,Performing Arts Venue,Coffee Shop,Breakfast Spot,Yoga Studio,Bakery,Convenience Store,Pet Store,Climbing Gym,Restaurant
2,"Business reply mail Processing Centre, South C...",Yoga Studio,Auto Workshop,Garden Center,Gym / Fitness Center,Fast Food Restaurant,Farmers Market,Light Rail Station,Comic Shop,Pizza Place,Recording Studio
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Terminal,Sculpture Garden,Harbor / Marina,Rental Car Location,Plane,Coffee Shop,Boat or Ferry,Bar,Airport Lounge
4,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant,Japanese Restaurant,Café,Burger Joint,Department Store,Salad Place,Thai Restaurant,Bubble Tea Shop


# 3 Cluster Neighborhoods
In this section we will partition the 39 Toronto neighborhoods into clusters. This is a clustering problem and K-means algorithm will be used to build the machine learning model. 
#### Recall the 'toronto_grouped' dataframe which contains encoded venue category information. 

In [129]:
# toronto_grouped.head()
# toronto_grouped.head()

In [62]:
# remove Neighborhood label, all other columns are numerical.
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# set number of clusters
kclusters = 4
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

#### Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [63]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data.copy() #this step is necessary, make sure we don't modify toronto_data table.

toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() 

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Park,Pub,Bakery,Theater,Breakfast Spot,Café,Restaurant,Hotel,Spa
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Sushi Restaurant,Yoga Studio,Bank,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Restaurant,Café
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Clothing Store,Coffee Shop,Cosmetics Shop,Café,Japanese Restaurant,Italian Restaurant,Bubble Tea Shop,Middle Eastern Restaurant,Bookstore,Bakery
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Café,Coffee Shop,Cocktail Bar,American Restaurant,Gastropub,Hotel,Gym,Restaurant,Clothing Store,Italian Restaurant
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,3,Health Food Store,Asian Restaurant,Pizza Place,Pub,Trail,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store


In [64]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# NOTE: Github sometimes fails to render folium map. As the teaching staff suggested, please kindly enter the GitHub URL in https://nbviewer.jupyter.org/ to view map! Thanks.

# 4 Examine Clusters
Finally let's examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster.

#### Cluster 1

In [65]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, 
                   toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Regent Park, Harbourfront",0,Coffee Shop,Park,Pub,Bakery,Theater,Breakfast Spot,Café,Restaurant,Hotel,Spa
1,"Queen's Park, Ontario Provincial Government",0,Coffee Shop,Sushi Restaurant,Yoga Studio,Bank,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Restaurant,Café
2,"Garden District, Ryerson",0,Clothing Store,Coffee Shop,Cosmetics Shop,Café,Japanese Restaurant,Italian Restaurant,Bubble Tea Shop,Middle Eastern Restaurant,Bookstore,Bakery
3,St. James Town,0,Café,Coffee Shop,Cocktail Bar,American Restaurant,Gastropub,Hotel,Gym,Restaurant,Clothing Store,Italian Restaurant
5,Berczy Park,0,Coffee Shop,Cocktail Bar,Café,Seafood Restaurant,Beer Bar,Restaurant,Cheese Shop,Bakery,Clothing Store,Basketball Stadium
6,Central Bay Street,0,Coffee Shop,Sandwich Place,Italian Restaurant,Japanese Restaurant,Café,Burger Joint,Department Store,Salad Place,Thai Restaurant,Bubble Tea Shop
7,Christie,0,Grocery Store,Café,Park,Diner,Baby Store,Candy Store,Nightclub,Coffee Shop,Athletics & Sports,Restaurant
8,"Richmond, Adelaide, King",0,Coffee Shop,Café,Restaurant,Deli / Bodega,Hotel,Gym,Thai Restaurant,Bookstore,Sushi Restaurant,Cosmetics Shop
9,"Dufferin, Dovercourt Village",0,Bakery,Pharmacy,Park,Middle Eastern Restaurant,Café,Bar,Bank,Supermarket,Recording Studio,Brewery
10,"Harbourfront East, Union Station, Toronto Islands",0,Coffee Shop,Aquarium,Café,Hotel,Brewery,Fried Chicken Joint,Scenic Lookout,Restaurant,Sporting Goods Shop,Pizza Place


#### Cluster 2

In [66]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, 
                   toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,"Moore Park, Summerhill East",1,Gym,Trail,Yoga Studio,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


#### Cluster 3

In [67]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, 
                   toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Roselawn,2,Garden,Yoga Studio,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


#### Cluster 4

In [68]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, 
                   toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,The Beaches,3,Health Food Store,Asian Restaurant,Pizza Place,Pub,Trail,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
18,Lawrence Park,3,Park,Bus Line,Swim School,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
21,"Forest Hill North & West, Forest Hill Road Park",3,Park,Jewelry Store,Trail,Sushi Restaurant,Yoga Studio,Department Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
33,Rosedale,3,Park,Playground,Trail,Dance Studio,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


# 5 Further Segmenting

Cluster 0 contains Little Portugal, where my client has their current coffee shop,also contains most of the neighborhoods in Toronto. We can conclude what a typical neighborhood in Toronto would look like: there are plenty choices of coffee shops and cafes, restrants, bars and many other food and drink venues are also easily to be found. Let's further segment this cluster to get more insights about these neighborhoods.

In [130]:
# take neighborhoods in cluster 0
cluster0=toronto_grouped[toronto_grouped['Neighborhood']!='Moore Park, Summerhill East']
cluster0=cluster0[cluster0['Neighborhood']!='Roselawn']
cluster0=cluster0[cluster0['Neighborhood']!='The Beaches']
cluster0=cluster0[cluster0['Neighborhood']!='Lawrence Park']
cluster0=cluster0[cluster0['Neighborhood']!='Forest Hill North & West, Forest Hill Road Park']
cluster0=cluster0[cluster0['Neighborhood']!='Rosedale']
print('cluster 2 shape:', cluster0.shape)
# cluster0

In [73]:
# remove Neighborhood label, all other columns are numerical.
toronto_grouped_clustering = cluster0.drop('Neighborhood', 1)

# set number of clusters
kclusters = 4
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

In [131]:
# add clustering labels
cluster0.insert(0, 'Cluster Labels', kmeans.labels_)

In [132]:
cluster00=cluster0[cluster0['Cluster Labels']==0]
cluster00=cluster00['Neighborhood']
# cluster00

In [91]:
cluster00 = pd.merge( toronto_data, cluster00, how='right', on='Neighborhood')

In [92]:
cluster00.shape

(8, 5)

In [93]:
cluster00

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M6G,Downtown Toronto,Christie,43.669542,-79.422564
1,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
2,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763
5,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
6,M5S,Downtown Toronto,"University of Toronto, Harbord",43.662696,-79.400049
7,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.653206,-79.400049


After further segmentation in this section, we can see that 7 neighborhoods and Little Portugal are grouped into a smaller cluster. That means these neighborhoods are more similar to Little Portugal. In the next section, let's analyze the coffee shops in this cluster and finally pick a neighborhood for my client. 

# 6 Analyze Coffee Shops in the Cluster and Pick a Neighborhood  

#### Find coffee shops in each neighborhood

In [112]:
def searchVenues(search_query, names, latitudes, longitudes, radius=1000, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&query={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT, search_query)
            
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],
            v['id']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue ID' ]
    
    return(nearby_venues)

In [113]:
coffee_venues = searchVenues('coffee',
                              names=cluster00['Neighborhood'],
                                   latitudes=cluster00['Latitude'],
                                   longitudes=cluster00['Longitude'])
print('Total {} venues extracted'.format(coffee_venues.shape[0]))
coffee_venues.head()

Total 156 venues extracted


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue ID
0,Christie,43.669542,-79.422564,Coffee Pocket,43.663949,-79.41696,4cf2a5437e0da1cdf0a69897
1,Christie,43.669542,-79.422564,Canadian Barista & Coffee Academy,43.664251,-79.414429,59eea60b66f3cd46ec6503d8
2,Christie,43.669542,-79.422564,Krave Coffee,43.68074,-79.429417,55a192fb498eba3d2ffb5cc1
3,Christie,43.669542,-79.422564,Hub Coffee Shop,43.666932,-79.43151,528bc5a111d20301cc84ceaa
4,Christie,43.669542,-79.422564,Creeds Coffee Bar,43.6741,-79.410838,55d2781b498e82585d37691e


In [116]:
coffee_venues['Venue ID'].nunique()

130

#### Count number of coffee shops in each neighborhood

In [119]:
venue_count=coffee_venues['Neighborhood'].value_counts().to_frame().reset_index()
venue_count.columns=['Neighborhood', 'count']
venue_count

Unnamed: 0,Neighborhood,count
0,"Kensington Market, Chinatown, Grange Park",50
1,"University of Toronto, Harbord",35
2,"Little Portugal, Trinity",28
3,Christie,14
4,"Dufferin, Dovercourt Village",11
5,Studio District,10
6,"High Park, The Junction South",5
7,"North Toronto West, Lawrence Park",3


By looking at the number of coffee shops in eahc neighborhood, we can tell that there is a a lot of competition in Kensington Market, Chinatown, Grange Park areas. Instead, University of Toronto and Harbord might be a good choice because their coffee shop count is similar to Little Portugal.  

# 7 Conclusion

In this project, Toronto neighborhood segmentation was done based on the common venue types in each neighborhood. There are 39 neighborhood in Toronto in which 33 neighborhoods are typical neighborhoods where you can easily find coffee shops, restaurants and bars. Then I further segmented the 33 neighborhoods and found 7 neighborhoods are more similar to our target neighborhood Little Portugal. Among the 7 finalist neighborhoods, I compared the number of coffee shops so we want what kind of competitions my client will be facing. Finally I picked University of Toronto & Harbord area, which has a similar coffee shop count with Little Portugal, not the most competitive area, but also demonstrates a huge demand. An extra step was to analyze the details of these coffee shops including average ratings, price range, number of likes or dislikes, etc., and compare with my client’s current coffee shop.  (Unfortunately I have reached my premium calls quota the results cannot be displayed!)


## Extra analysis: find ratings of each coffee shop
#### Please note that to get the details of a venue is a premium call, and unfortunately I have reached my quota for premium calls. So results are not displayed below!

In [134]:
final_list=coffee_venues[coffee_venues['Neighborhood']=='University of Toronto, Harbord']

In [122]:
def getRating(venue_ids):
    
    rating_list=[]
    for venue_id in venue_ids:

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(
            venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
            
        try:
            results = requests.get(url).json()["response"]['venue']['rating']
            rating_list.append(results)
        except:
            rating_list.append(0)

    return(rating_list)

In [125]:
venue_id= '4dbae18d4df044e524bbb9de'
url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
results = requests.get(url).json()
results

{'meta': {'code': 429,
  'errorType': 'quota_exceeded',
  'errorDetail': 'Quota exceeded',
  'requestId': '5eecd4e1e826ac002182b210'},
 'response': {}}

In [135]:
coffee_ratings = getRating(final_list['Venue ID'])
# coffee_ratings

In [133]:
coffee_venues['Venue Rating']=coffee_ratings

In [86]:
# drop those venues without any rating
coffee_venues = coffee_venues[coffee_venues['Venue Rating']!=0] 