# find the best location for Chinese restaurants using location data

# 1. description

The foursquare location data shows the distribution of groceries stores, cafes, retaurants in every venues of every neighborhood in Toronto. The whole idea of this project is to find the best location(particular venue in particular neighborhood) for openning a Chinese restaurant, using the foursquare location data to compare different neighborhoods.

# 2. methodology

First get the geodata of Toronto. 
Segment different neighborhoods by their popularity and cluster using k-means clustering(k=5 should be enough for this project).
Find the cluster with largest number of restaurants, in which determine the neighborhood with highest segmentation


In [65]:
#import libraries
import numpy as np 
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json 
from IPython.display import display_html
from sklearn.cluster import KMeans
import folium 

In [66]:
#get data
s = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(s,'xml')
table = str(soup.table)
display_html(table,raw=True)

Postal Code,Borough,Neighborhood
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M8A,Not assigned,Not assigned
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"


In [67]:
df_toronto = pd.read_html(table)
df_t = df_toronto[0]
headers = ['Postal Code','Borough','Neighborhood']
df_t.columns = headers
df_t.drop(0,axis=0,inplace=True)
location = pd.read_csv('http://cocl.us/Geospatial_data')
df_n = pd.merge(df_t,location,on='Postal Code')
df_n.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [68]:
#crawl data from foursquare
def foursquare_crawler (postal_code_list, neighborhood_list, lat_list, lng_list, LIMIT = 500, radius = 1000):
    result_ds = []
    counter = 0
    for postal_code, neighborhood, lat, lng in zip(postal_code_list, neighborhood_list, lat_list, lng_list):
       
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, 
            lat, lng, radius, LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        tmp_dict = {}
        tmp_dict['Postal Code'] = postal_code; tmp_dict['Neighborhood(s)'] = neighborhood; 
        tmp_dict['Latitude'] = lat; tmp_dict['Longitude'] = lng;
        tmp_dict['Crawling_result'] = results;
        result_ds.append(tmp_dict)
        counter += 1
        print('{}.'.format(counter))
        print('Data is Obtained, for the Postal Code {} (and Neighborhoods {}) SUCCESSFULLY.'.format(postal_code, neighborhood))
    return result_ds;

In [69]:
CLIENT_ID = 'KKYL2MYXQ1AX0A2PKKYAUN0CPVR2ETX0DY5DXJO51LF03P1E'
CLIENT_SECRET = 'KB5YCPMOS55OJNBRDMODTUODHJJUPY5FVI5SZRNC1M3HLREC'

In [70]:
def get_venue_dataset(foursquare_dataset):
    result_df = pd.DataFrame(columns = ['Postal Code', 'Neighborhood', 
                                           'Neighborhood Latitude', 'Neighborhood Longitude',
                                          'Venue', 'Venue Summary', 'Venue Category', 'Distance'])
  
    for neigh_dict in foursquare_dataset:
        postal_code = neigh_dict['Postal Code']; neigh = neigh_dict['Neighborhood(s)']
        lat = neigh_dict['Latitude']; lng = neigh_dict['Longitude']
        print('Number of Venuse in Coordination "{}" Posal Code and "{}" Negihborhood(s) is:'.format(postal_code, neigh))
        print(len(neigh_dict['Crawling_result']))
        
        for venue_dict in neigh_dict['Crawling_result']:
            summary = venue_dict['reasons']['items'][0]['summary']
            name = venue_dict['venue']['name']
            dist = venue_dict['venue']['location']['distance']
            cat =  venue_dict['venue']['categories'][0]['name']
            
            
          
            result_df = result_df.append({'Postal Code': postal_code, 'Neighborhood': neigh, 
                              'Neighborhood Latitude': lat, 'Neighborhood Longitude':lng,
                              'Venue': name, 'Venue Summary': summary, 
                              'Venue Category': cat, 'Distance': dist}, ignore_index = True)
            
    return(result_df)

In [71]:
map_dt = folium.Map(location=[43.654260,-79.360636],zoom_start=14)
map_dt

In [72]:
# segmentation for 10 district in Toronto
def createBins(x,N):
    cut = [i/N for i in range(1,N)]
    df = pd.DataFrame(x)
    cut_points = df.quantile(cut).T.values
    cut_points = np.unique(cut_points)
    cut_points = np.insert(cut_points,0,values=float('-inf'))
    cut_points = np.append(cut_points,values=float('inf'))
    labels = ['{0}'.format(i) for i in range(1,len(cut_points))]
    bins = pd.cut(x,cut_points,labels=labels)
    bins = pd.DataFrame(bins).astype(int)
    return bins
seg = {'Donwntown Toronto':['10'],'North York':['9'],'East York':['8'],'Central Toronto':['7'],'Scarborough':['6'],'York':['5'],'East Toronto':['4'],'Etobicoke':['3'],'West Toronto':['2'],'Mississauga':['1']}
seg_df = pd.DataFrame(seg)
seg_df

Unnamed: 0,Donwntown Toronto,North York,East York,Central Toronto,Scarborough,York,East Toronto,Etobicoke,West Toronto,Mississauga
0,10,9,8,7,6,5,4,3,2,1


In [73]:
#clustering
k=5
cluster = df_n.drop(['Postal Code','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(cluster)
kmeans.labels_
df_n.insert(0, 'Cluster Labels', kmeans.labels_)
df_n

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,4,M3A,North York,Parkwoods,43.753259,-79.329656
1,4,M4A,North York,Victoria Village,43.725882,-79.315572
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,2,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,1,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,4,M3B,North York,Don Mills,43.745906,-79.352188
8,4,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [74]:
# find the best neighborhood
df_b = df_n[df_n['Borough'] == 'Downtown Toronto']
df_b

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,2,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
20,2,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,2,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,2,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,2,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
36,2,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752
42,2,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576


In [75]:
df_ny = df_n[df_n['Borough'] == 'North York']
df_ny

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,4,M3A,North York,Parkwoods,43.753259,-79.329656
1,4,M4A,North York,Victoria Village,43.725882,-79.315572
3,3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
7,4,M3B,North York,Don Mills,43.745906,-79.352188
10,3,M6B,North York,Glencairn,43.709577,-79.445073
13,4,M3C,North York,Don Mills,43.7259,-79.340923
27,3,M2H,North York,Hillcrest Village,43.803762,-79.363452
28,3,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259
33,3,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
34,3,M3J,North York,"Northwood Park, York University",43.76798,-79.487262


In [76]:
df_best = df_ny[df_ny['Cluster Labels']==4]
df_best

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,4,M3A,North York,Parkwoods,43.753259,-79.329656
1,4,M4A,North York,Victoria Village,43.725882,-79.315572
7,4,M3B,North York,Don Mills,43.745906,-79.352188
13,4,M3C,North York,Don Mills,43.7259,-79.340923


# 3. conclusion

Based on the analysis and comparison, the best neighborhood for openning a Chinese restaurant will be Don Mills. 