# Description of the problem
---
### Problem Statment

In Toronto-Canada, if someone is looking to open an Asian restaurant, where would you recommend that they open it?

### Problem Description/Objective
The objective of this project is to find the best suitable location(s) to open a new Asian restaurant in Toronto, Canada. 
To be more Specific, this project tries to find the locations where there is hihg popularity for the Asian food. There might me many other factors to be considered to open a restaurant but in this project we are considering only the number of Asian restaurants already present in that area to recommend a location. 

### Target Audience
Any person or organization who want to open an Asian restaurant in Toronto Canada.

  


# Description of the data
---

To solve this problem, I will need the location related data of the existing Asian restaurants. As this data is not available directly we need to use/join the 3 below data sets.

     1.	Neighborhood-Postcode data from Wikipedia. 
     2.	Postcode-coordinates (Latitude and Longitude) from Geospatial data.
     3.	venue-locations using Foursquare API.

<div class="alert alert-block alert-success">
 The variable of interest in Neighborhood-Postcode data are :

    1.	Postcode – postal code of an area
    2.	Borough - a town or district which is an administrative unit.
    3.	Neighbourhood – a small area
    
</div>

<div class="alert alert-block alert-success">
The variable of interest in Postcode-coordinates data are:

    1.	Postal Code – postal code of an area
    2.	Latitude 
    3.	Longitude
    
</div>

<div class="alert alert-block alert-success">
The variable of interest in Venue-locations data are:

    1.	Neighborhood
    2.	Venue
    3.	Venue Latitude 
    4.	Venue Longitude 
    5.	Venue Category
    
</div>

In [1]:
#Import the required libraries 
import requests
import pandas as pd
import numpy as np
import folium
from sklearn.cluster import KMeans
from dotenv import load_dotenv
load_dotenv()
import os
import matplotlib.cm as cm
import matplotlib.colors as colors

In [2]:
#get the postal code data
postal_code_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
postal_code_df = pd.read_html(postal_code_url, header=0, na_values = ['Not assigned'])[0]
postal_code_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [3]:
#Drop the records with unspecified Borough and assign Borough to unspecified Neighborhood
postal_code_df.dropna(subset=['Borough'],inplace=True)
postal_code_df['Neighbourhood'].fillna(postal_code_df['Borough'], inplace=True)

In [4]:
#Group all the neighborhoods related to a postal code.
neighborhood_df = postal_code_df.groupby(['Postcode','Borough']).Neighbourhood.agg([('Neighbourhood', ', '.join)])
neighborhood_df.reset_index(inplace=True)
neighborhood_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [5]:
#get the location data
geo_url = 'http://cocl.us/Geospatial_data'
geo_coordinates_df = pd.read_csv(geo_url)
geo_coordinates_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [6]:
#To jon the neighborhood data with geo coordinates- standardize the postal code column.
geo_coordinates_df.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
neighborhood_df.rename(columns={'Postcode': 'PostalCode'}, inplace=True)

In [7]:
#merge the neighborhood data with geo coordinates
neighbohoods_coordinates_df = pd.merge(neighborhood_df, geo_coordinates_df, on='PostalCode')
neighbohoods_coordinates_df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [8]:
#filter the  borough delonging to Toronto
df = neighbohoods_coordinates_df[:]
toronto_postal_codes_df = df[df['Borough'].str.contains('Toronto')]
toronto_postal_codes_df.reset_index(inplace=True)
toronto_postal_codes_df.drop('index', axis=1, inplace=True)
toronto_postal_codes_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [9]:
#calculate the coordinates of Toronto
toronto_latitude = toronto_postal_codes_df['Latitude'].mean()
toronto_longitude = toronto_postal_codes_df['Longitude'].mean()
print('The geographical coordinates of Toronto are {}, {}'.format(toronto_latitude, toronto_longitude))

The geographical coordinates of Toronto are 43.66713498717948, -79.38987324871795


In [10]:
#List the different boroughs of toronto
print(toronto_postal_codes_df.groupby('Borough').count()['Neighbourhood'])

Borough
Central Toronto      9
Downtown Toronto    19
East Toronto         5
West Toronto         6
Name: Neighbourhood, dtype: int64


In [11]:
#locate the different Boroughs in toronto
map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_postal_codes_df['Latitude'], 
                                           toronto_postal_codes_df['Longitude'],
                                           toronto_postal_codes_df['Borough'], 
                                           toronto_postal_codes_df['Neighbourhood']):
    label_text = borough + ' - ' + neighborhood
    label = folium.Popup(label_text)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill_color='blue',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

In [12]:
#get the Variables required for Foursquare API call 
CLIENT_ID = os.getenv("client_id")
CLIENT_SECRET = os.getenv("client_secret") 
VERSION = '20180604' 
LIMIT = 100  
radius = 500 

In [13]:
#function to get the venues.
def getVenuesByCoordinates(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [14]:
#Get venues for all neighborhoods in our dataset
toronto_venues = getVenuesByCoordinates(names=toronto_postal_codes_df['Neighbourhood'],
                                latitudes=toronto_postal_codes_df['Latitude'],
                                longitudes=toronto_postal_codes_df['Longitude'])

The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The Junction Sout

In [15]:
#get the list of different venue categories
toronto_venues[toronto_venues['Venue Category'].str.contains("Restaurant")]['Venue Category'].unique()

array(['Greek Restaurant', 'Italian Restaurant', 'Restaurant',
       'Caribbean Restaurant', 'American Restaurant', 'Sushi Restaurant',
       'Fast Food Restaurant', 'Seafood Restaurant',
       'Comfort Food Restaurant', 'Middle Eastern Restaurant',
       'Thai Restaurant', 'Latin American Restaurant',
       'Chinese Restaurant', 'Mexican Restaurant', 'Indian Restaurant',
       'Vietnamese Restaurant', 'Japanese Restaurant',
       'Taiwanese Restaurant', 'Theme Restaurant', 'Ramen Restaurant',
       'Ethiopian Restaurant', 'Mediterranean Restaurant',
       'Afghan Restaurant', 'French Restaurant',
       'Modern European Restaurant', 'New American Restaurant',
       'Vegetarian / Vegan Restaurant', 'German Restaurant',
       'Asian Restaurant', 'Eastern European Restaurant',
       'Portuguese Restaurant', 'Falafel Restaurant', 'Korean Restaurant',
       'Colombian Restaurant', 'Brazilian Restaurant',
       'Gluten-free Restaurant', 'Belgian Restaurant',
       'Dumpling R

In [16]:
#mark the different asian restaurants categories to one name - 'Asian Restaurant'
toronto_venues.loc[toronto_venues['Venue Category'] == 'Sushi Restaurant', 'Venue Category'] = "Asian Restaurant"
toronto_venues.loc[toronto_venues['Venue Category'] == 'Thai Restaurant', 'Venue Category'] = "Asian Restaurant"
toronto_venues.loc[toronto_venues['Venue Category'] == 'Chinese Restaurant', 'Venue Category'] = "Asian Restaurant"
toronto_venues.loc[toronto_venues['Venue Category'] == 'Indian Restaurant', 'Venue Category'] = "Asian Restaurant"
toronto_venues.loc[toronto_venues['Venue Category'] == 'Japanese Restaurant', 'Venue Category'] = "Asian Restaurant"
toronto_venues.loc[toronto_venues['Venue Category'] == 'Taiwanese Restaurant', 'Venue Category'] = "Asian Restaurant"
toronto_venues.loc[toronto_venues['Venue Category'] == 'Afghan Restaurant', 'Venue Category'] = "Asian Restaurant"
toronto_venues.loc[toronto_venues['Venue Category'] == 'Filipino Restaurant', 'Venue Category'] = "Asian Restaurant"

In [17]:
#get only the asian restaurant venues and prepare the data for clustering
toronto_asian = toronto_venues[toronto_venues['Venue Category'] == "Asian Restaurant"]
t = toronto_asian[['Neighborhood','Venue']]
cluster_df = t.groupby(["Neighborhood"],as_index=False).count()
cluster_df

Unnamed: 0,Neighborhood,Venue
0,"Adelaide, King, Richmond",10
1,Berczy Park,2
2,"Cabbagetown, St. James Town",5
3,Central Bay Street,8
4,"Chinatown, Grange Park, Kensington Market",8
5,Church and Wellesley,12
6,"Commerce Court, Victoria Hotel",5
7,Davisville,4
8,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",1
9,"Design Exchange, Toronto Dominion Centre",5


In [18]:
# set 3 clusters and run the Kmeans clustering
clusters = 3
clustering = cluster_df.drop(["Neighborhood"], 1)
kmeans = KMeans(n_clusters=clusters, random_state=0).fit(clustering)

In [19]:
#create a data frame with clusterlabel, neighborhood and venue count
merged = cluster_df.copy()
merged["Cluster Labels"] = kmeans.labels_
merged_df = merged[['Neighborhood','Venue','Cluster Labels']]

In [20]:
#get the count per cluster
merged_df.groupby(["Cluster Labels"],as_index=False).count()

Unnamed: 0,Cluster Labels,Neighborhood,Venue
0,0,10,10
1,1,5,5
2,2,10,10


In [21]:
#look at the cluster 0 values
merged_df[merged_df['Cluster Labels']==0]

Unnamed: 0,Neighborhood,Venue,Cluster Labels
1,Berczy Park,2,0
8,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",1,0
11,"Forest Hill North, Forest Hill West",1,0
14,"High Park, The Junction South",2,0
16,North Toronto West,1,0
17,Queen's Park,2,0
18,"Runnymede, Swansea",2,0
22,Studio District,1,0
23,"The Annex, North Midtown, Yorkville",1,0
24,"The Beaches West, India Bazaar",1,0


In [22]:
#look at the cluster 1 values
merged_df[merged_df['Cluster Labels']==1]

Unnamed: 0,Neighborhood,Venue,Cluster Labels
0,"Adelaide, King, Richmond",10,1
3,Central Bay Street,8,1
4,"Chinatown, Grange Park, Kensington Market",8,1
5,Church and Wellesley,12,1
10,"First Canadian Place, Underground city",8,1


In [23]:
#look at the cluster 2 values
merged_df[merged_df['Cluster Labels']==2]

Unnamed: 0,Neighborhood,Venue,Cluster Labels
2,"Cabbagetown, St. James Town",5,2
6,"Commerce Court, Victoria Hotel",5,2
7,Davisville,4,2
9,"Design Exchange, Toronto Dominion Centre",5,2
12,"Harbord, University of Toronto",4,2
13,"Harbourfront East, Toronto Islands, Union Station",4,2
15,"Little Portugal, Trinity",4,2
19,"Ryerson, Garden District",6,2
20,St. James Town,4,2
21,Stn A PO Boxes 25 The Esplanade,4,2


In [24]:
#prepare the data for the map
toronto_postal_codes_df['Neighborhood']=toronto_postal_codes_df['Neighbourhood']
clstr_merged = merged_df.join(toronto_postal_codes_df.set_index("Neighborhood"), on="Neighborhood")
clstr_merged['Venue count']=clstr_merged['Venue']
map_df = clstr_merged[['Neighborhood','Latitude','Longitude','Venue count','Cluster Labels']]
map_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Neighborhood,Latitude,Longitude,Venue count,Cluster Labels
0,"Adelaide, King, Richmond",43.650571,-79.384568,10,1
1,Berczy Park,43.644771,-79.373306,2,0
2,"Cabbagetown, St. James Town",43.667967,-79.367675,5,2
3,Central Bay Street,43.657952,-79.387383,8,1
4,"Chinatown, Grange Park, Kensington Market",43.653206,-79.400049,8,1


In [25]:
# plot the map 
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(clusters)
ys = [i+x+(i*x)**2 for i in range(clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = ['red','blue','green']

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(map_df['Latitude'], map_df['Longitude'], map_df['Neighborhood'], map_df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster))
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Looking at the resultant clusters it is clear that Cluster 1 is the  group of the neighborhoods with high popularity for Asian food so the recomendation is to open an Asian restaurant at those locations.

>Adelaide, King, Richmond, Central Bay Street, Chinatown, Grange Park, Kensington Market, Church and Wellesley, First Canadian Place, Underground city