Background:
Our client is a startup craft beer brewer, they would like to setup a distribution network of their craft beer in one of the area in Toronto. Since the supply of craft beer is limited, they would like to find out where should they setup their selling network to maximize their profit. 

Statement of Problem:
-supply and "Best Tasting Period" are limited, and the target selling price of their beers is 60% more expensive than branded beers e.g. Heineken, Budweiser, 
-beers with special flavour like herbs, sours, salty lemon, etc. (Asian Flavour) 

Based on the above, our client would like to find out a place that there is lots of bar/pubs/restaurants (especially including Asian foods restaurants as our client think that it would be an advantage if people has exposure on Asian culture) and people are willing and affordable to spend money on my beers. 

Audience: Our client (a startup craft beer brewer)

________________________________________________________________________________________
Data:
Analytic approach: we will explore the areas around Toronto, and analyze the distribution of bar/pubs/restaurants and people's spending pattern (i.e. how frequency people go to bar/pubs/restaurants)  

Data Source: FourSquare developer API will provide all the location data we need to investigate this question.

Data required: Initially we will create a bar/pubs/restaurants density measure for each area, and also a ratio of Asian foods restaurants. Then we will analyze people's spending pattern (i.e. how frequency people go to bar/pubs/restaurants)


In [2]:
#import libraries and packages 
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import requests
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pickle
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

In [3]:
#Data Preparation - To extract relevant data of city of Toronto in the province of Ontario (i.e. location, neighbour, venuetype, common visiting place, etc) from FourSquare 

# get the List of postal codes of Canada from the URL
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
r = requests.get(url).text
soup = BeautifulSoup(r,"lxml").find("table",class_="wikitable sortable")

table1=[]
temp = []

for tr in soup.find_all("tr"):
    # search all <tr>
    temp.clear()
    for td in tr.find_all("td"):
        #search all <td> in each <tr>
        temp.append(td.text.replace('\n',''))
    temp = [temp]
    table1 = table1 + temp
    # put the <td> into a table

# convert to dataframe
df=pd.DataFrame(table1)
df.columns = ["Postcode", "Borough", "Neighbourhood"]
df = df.drop([0], axis=0)
df.index = pd.RangeIndex(len(df.index))

#remove "Not assigned" record in Borough
df1 = df[df.Borough != 'Not assigned']

# copy value from Borough to Neighbourhood (for those Neighbourhood = Not assigned) 
for x in df1.loc[df1['Neighbourhood'] == 'Not assigned'].index:
    df1.xs(x)['Neighbourhood'] = df1.xs(x)['Borough']
    
# Generate List of city of Toronto in the province of Ontario
df2 = df1.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda x: "%s" % ', '.join(x))
df2 = pd.DataFrame(df2)
df2=df2.reset_index()

#add column for Latitude and Longitude
df2["Latitude"] = np.nan
df2["Longitude"] = np.nan

#download Latitude and Longitude data 
data = pd.read_csv('http://cocl.us/Geospatial_data')

# add value of Latitude and Longitude data into relevant df2 records 
for x in df2['Postcode']:
    df2.loc[df2.Postcode == x, 'Latitude'] = data.loc[data['Postal Code'] == x]['Latitude']
    df2.loc[df2.Postcode == x, 'Longitude'] = data.loc[data['Postal Code'] == x]['Longitude']


In [4]:
#Get the geograpical coordinate of toronto
address = 'Toronto'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# prepare datafrome for nearby benues analysis
df3=df2.reset_index()

#ready to connect to FourSquare
CLIENT_ID = 'PYJBJMLRS1IGVZAYUO5FS0YSPJ4X2FQ0AC1LAXNWDUUPAXXD' # Foursquare ID
CLIENT_SECRET = 'JYGPSCVR3YAB425PFX141KZOZN31IH1POGC2FZRVOTYME210' # Foursquare Secret
VERSION = '201906028' # Foursquare API version

# define function to get nearby venues

def getNearbyVenues(names, latitudes, longitudes, radius=1500, LIMIT=150):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):           
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name'],
            v['venue']['id'],
            v['venue']['categories'][0]['id']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                  'Venue ID',
                  'Category ID' ]
    
    return(nearby_venues)



In [6]:

#generate the List of nearby Venues and its Category around Toronto
toronto_venues = getNearbyVenues(names=df3['Neighbourhood'],
                                   latitudes=df3['Latitude'],
                                   longitudes=df3['Longitude']
                               )

toronto_venues["Likes count"] = np.nan
toronto_venues["Rating"] = np.nan
toronto_venues["Price"] = np.nan



toronto_venues.head(4)
toronto_venues.shape

(6782, 12)

In [7]:
# prepare datafrome that contain ""Restaurant", "Bar", "Lounge", 
toronto_venues = toronto_venues[toronto_venues['Venue Category'].str.contains("Restaurant|Diner|Bar|Beer|Pub|Brewery|Lounge")]
toronto_venues=toronto_venues.reset_index()

toronto_venues.head(4)
toronto_venues.shape

(2161, 13)

In [19]:
for venue_id in toronto_venues["Venue ID"]:
    url = 'https://api.foursquare.com/v2/venues/' + venue_id + '?&client_id={}&client_secret={}&v={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION)
            
        # make the GET request
results = requests.get(url).json()#["response"]['groups'][0]['items']
results


{'meta': {'code': 429,
  'errorType': 'quota_exceeded',
  'errorDetail': 'Quota exceeded',
  'requestId': '5d239fdaf19f44002531a0fc'},
 'response': {}}

In [6]:
venues_id = '55a59bace4b013909087cb24'
url = 'https://api.foursquare.com/v2/venues/VENUE_ID?&client_id={}&client_secret={}&v={}&venue_id={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION,
            venues_id)
            
        # make the GET request
results = requests.get(url).json()#["response"]['groups'][0]['items']
results




{'meta': {'code': 400,
  'errorType': 'param_error',
  'errorDetail': 'Value VENUE_ID is invalid for venue id',
  'requestId': '5d2140def19f4400258b1322'},
 'response': {}}

In [1]:
# Category IDs corresponding to Asian restaurants and bar were taken from Foursquare web site

food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues
asian_restaurant_categories= ['4bf58dd8d48988d142941735','56aa371be4b08b9a8d573568','52e81612bcbc57f1066b7a03',
                              '4bf58dd8d48988d145941735','52af3a5e3cf9994f4e043bea','52af3a723cf9994f4e043bec',
                              '52af3a7c3cf9994f4e043bed','58daa1558bbb0b01f18ec1d3','52af3a673cf9994f4e043beb',
                              '52af3a903cf9994f4e043bee','4bf58dd8d48988d1f5931735','52af3a9f3cf9994f4e043bef',
                              '52af3aaa3cf9994f4e043bf0','52af3ab53cf9994f4e043bf1','52af3abe3cf9994f4e043bf2',
                              '52af3ac83cf9994f4e043bf3','52af3ad23cf9994f4e043bf4','52af3add3cf9994f4e043bf5',
                              '52af3af23cf9994f4e043bf7','52af3ae63cf9994f4e043bf6','52af3afc3cf9994f4e043bf8',
                              '52af3b053cf9994f4e043bf9','52af3b213cf9994f4e043bfa','52af3b293cf9994f4e043bfb',
                              '52af3b343cf9994f4e043bfc','52af3b3b3cf9994f4e043bfd','52af3b463cf9994f4e043bfe',
                              '52af3b633cf9994f4e043c01','52af3b513cf9994f4e043bff','52af3b593cf9994f4e043c00',
                              '52af3b6e3cf9994f4e043c02','52af3b773cf9994f4e043c03','52af3b813cf9994f4e043c04',
                              '52af3b893cf9994f4e043c05','52af3b913cf9994f4e043c06','52af3b9a3cf9994f4e043c07',
                              '52af3ba23cf9994f4e043c08','4bf58dd8d48988d111941735','55a59bace4b013909087cb0c',
                              '55a59bace4b013909087cb30','55a59bace4b013909087cb21','55a59bace4b013909087cb06',
                              '55a59bace4b013909087cb1b','55a59bace4b013909087cb1e','55a59bace4b013909087cb18',
                              '55a59bace4b013909087cb24','55a59bace4b013909087cb15','55a59bace4b013909087cb27',
                              '55a59bace4b013909087cb12','4bf58dd8d48988d1d2941735','55a59bace4b013909087cb2d',
                              '55a59a31e4b013909087cb00','55a59af1e4b013909087cb03','55a59bace4b013909087cb2a',
                              '55a59bace4b013909087cb0f','55a59bace4b013909087cb33','55a59bace4b013909087cb09',
                              '55a59bace4b013909087cb36','4bf58dd8d48988d149941735','56aa371be4b08b9a8d573502']


    
    
def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Deutschland', '')
    address = address.replace(', Germany', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20190624'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

In [None]:
# Let's now go over our neighborhood locations and get nearby restaurants; we'll also maintain a dictionary of all found restaurants and all found italian restaurants

def get_restaurants(lats, lons):
    restaurants = {}
    asian_restaurants = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        venues = get_venues_near_location(lat, lon, food_category, foursquare_client_id, foursquare_client_secret, radius=350, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_italian = is_restaurant(venue_categories, specific_filter=italian_restaurant_categories)
            if is_res:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_italian, x, y)
                if venue_distance<=300:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_italian:
                    italian_restaurants[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, italian_restaurants, location_restaurants

# Try to load from local file system in case we did this before
restaurants = {}
italian_restaurants = {}
location_restaurants = []
loaded = False
try:
    with open('restaurants_350.pkl', 'rb') as f:
        restaurants = pickle.load(f)
    with open('italian_restaurants_350.pkl', 'rb') as f:
        italian_restaurants = pickle.load(f)
    with open('location_restaurants_350.pkl', 'rb') as f:
        location_restaurants = pickle.load(f)
    print('Restaurant data loaded.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    restaurants, italian_restaurants, location_restaurants = get_restaurants(latitudes, longitudes)
    
    # Let's persists this in local file system
    with open('restaurants_350.pkl', 'wb') as f:
        pickle.dump(restaurants, f)
    with open('italian_restaurants_350.pkl', 'wb') as f:
        pickle.dump(italian_restaurants, f)
    with open('location_restaurants_350.pkl', 'wb') as f:
        pickle.dump(location_restaurants, f)
        
        


In [None]:
# one hot encoding for better prediction
toronto_onehot = pd.get_dummies(tornoto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = tornoto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()

In [None]:
nightlife_category = '4d4b7105d754a06376d81259'
bar_categories= ['4bf58dd8d48988d116941735','52e81612bcbc57f1066b7a0d','56aa371ce4b08b9a8d57356c',
                 '4bf58dd8d48988d117941735','52e81612bcbc57f1066b7a0e','4bf58dd8d48988d11e941735',
                 '4bf58dd8d48988d118941735','4bf58dd8d48988d1d8941735','4bf58dd8d48988d119941735',
                 '4bf58dd8d48988d1d5941735','4bf58dd8d48988d120941735','4bf58dd8d48988d11b941735',
                 '4bf58dd8d48988d11c941735','4bf58dd8d48988d1d4941735','4bf58dd8d48988d11d941735',
                 '56aa371be4b08b9a8d57354d','4bf58dd8d48988d122941735','4bf58dd8d48988d123941735',
                 '50327c8591d4c4b30a586d5d','4bf58dd8d48988d121941735']

In [None]:
# define function to find out most common venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

# Look for top 5 venue
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

# Generate table for Neighbourhood and its top 5 common venues
for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)


In [None]:
### set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df3

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
toronto_merged = toronto_merged.dropna()
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters