## Introduction

In this project, we will attempt to find the optimal location for a restaurant in the city of Boston, MA. Specifically, this report will be targeted at stakeholders interested in outcompeting existing restaurants through high-quality food.

We will be looking for neighborhoods with few restaurants already, and preferably ones with low-rated restaurants we can easily outcompete.

The data we will use is a list of neighborhoods in Boston defined by a scraped webpage, with the latitude and longitude of their centers obtained through Geocoder and the information about restaurants, acquired with FourSquare API.


In [1]:
#Import all the neccessary libraries
import pandas as pd
import numpy as np
import requests
import json # library to handle JSON files


import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans


import folium # map rendering library

import geopy
import geocoder
from geopy.geocoders import Nominatim

import requests
from bs4 import BeautifulSoup


We begin by creating a Beautiful Soup object ands scraping a webpage for a list of the neighborhoods in Boston.



In [2]:
# Send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Neighborhoods_in_Boston").text
# Parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')
# Create a list to store neighbourhood data
neighborhoodList = []
# Append the data into the list
for row in soup.find_all("div", class_="div-col"):
    row = str(row).split('\n')

    for item in row:
        #print(item)
        start = item.find('title="') + len('title="')
        end = item.find('">')
        substring = item[start:end]
        #print(substring)

        neighborhoodList.append(substring)

neighborhoodList

['lass="div-col" style="column-width: 30em;',
 'Allston',
 'Back Bay, Boston',
 'Bay Village, Boston',
 'Beacon Hill, Boston',
 'Brighton, Boston',
 'Charlestown, Boston',
 'Chinatown, Boston',
 'Dorchester, Boston',
 'Downtown Boston',
 'East Boston',
 'Fenway-Kenmore',
 'Hyde Park, Boston',
 'Jamaica Plain, Boston',
 'Mattapan',
 'Mission Hill, Boston',
 'North End, Boston',
 'Roslindale',
 'Roxbury, Boston',
 'South Boston',
 'South End, Boston',
 'West End, Boston',
 'West Roxbury, Boston',
 'arf District</li></ul',
 '']

We then clean the list, make a DataFrame, and use GeoCoder to find the latitude and longitude for the neighborhoods.

In [3]:
del neighborhoodList[0]
del neighborhoodList[-1]
del neighborhoodList[-1]
neighborhoodList

['Allston',
 'Back Bay, Boston',
 'Bay Village, Boston',
 'Beacon Hill, Boston',
 'Brighton, Boston',
 'Charlestown, Boston',
 'Chinatown, Boston',
 'Dorchester, Boston',
 'Downtown Boston',
 'East Boston',
 'Fenway-Kenmore',
 'Hyde Park, Boston',
 'Jamaica Plain, Boston',
 'Mattapan',
 'Mission Hill, Boston',
 'North End, Boston',
 'Roslindale',
 'Roxbury, Boston',
 'South Boston',
 'South End, Boston',
 'West End, Boston',
 'West Roxbury, Boston']

In [4]:
neighFrame = pd.DataFrame(columns=['Neighborhood','Latitude', 'Longitude'])

for i in neighborhoodList:
    locator = Nominatim(user_agent='myGeocoder')
    location = locator.geocode(i)
    try:
        latitude, longitude = location.latitude, location.longitude
        neighFrame_length = len(neighFrame)
        neighFrame.loc[neighFrame_length] = [i,latitude,longitude]
        #print(i, latitude, longitude)
        
    except:
        print("The address was invalid.")
        #print(i)
    
          


In [5]:
neighFrame

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Allston,42.355434,-71.132127
1,"Back Bay, Boston",42.350549,-71.080311
2,"Bay Village, Boston",42.350011,-71.066948
3,"Beacon Hill, Boston",42.358708,-71.067829
4,"Brighton, Boston",42.350097,-71.156442
5,"Charlestown, Boston",42.377875,-71.061996
6,"Chinatown, Boston",42.351329,-71.062623
7,"Dorchester, Boston",42.29732,-71.074495
8,Downtown Boston,42.351871,-71.067565
9,East Boston,42.375097,-71.039217


With the frame created, we create a map of the Boston area using Folium and mark where the neighborhoods are on it.

In [6]:
address = 'Boston, Massachusetts'
geolocator = Nominatim(user_agent="myGeocoder")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map1 = folium.Map(location=[latitude, longitude], zoom_start=11)
# Adding markers to map
for lat, lng, neighborhood in zip(neighFrame['Latitude'],  neighFrame['Longitude'], neighFrame['Neighborhood']):
 label = '{}'.format(neighborhood)
 label = folium.Popup(label, parse_html=True)
 folium.CircleMarker([lat, lng],radius=5,popup=label,color='blue',fill=True,fill_color='#3186cc',fill_opacity=0.7).add_to(map1)
map1

Next, we create the list of all venues, which is then pared down to only restaurants.

In [19]:
CLIENT_ID = 'PF2LTMIMZD24S4LRCYU4U0FMPHF0JCVCEHEBMOSUGK2WQYRR' # your Foursquare ID
CLIENT_SECRET = 'TB5YFEMANIL4YQ5QEODD5BQVVMCUREYWS24XTA23NXWIHVWD' # your Foursquare Secret
ACCESS_TOKEN = 'GZLVX52T2UXNW3NKX3GRVWJCYWPQIVOHFP1T4KNOUJ2Y4KQ5' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 10000
radius = 5000
venues = []
for lat, long, neighborhood in zip(neighFrame['Latitude'], neighFrame['Longitude'], neighFrame['Neighborhood']):
    # Create the API request URL
    #print(neighborhood)
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(CLIENT_ID,CLIENT_SECRET,VERSION,lat,long,radius,LIMIT)
    # Make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # Return only relevant information for each nearby venue
    for venue in results:
        #print(venue)
        venues.append((neighborhood,lat,long,venue['venue']['name'],
        venue['venue']['location']['lat'],venue['venue']['location']    ['lng'],venue['venue']['categories'][0]['name'], venue['venue']['id']))

In [20]:
venuesFrame = pd.DataFrame(venues)
# Defining the column names
venuesFrame.columns = ['Neighborhood', 'Latitude', 'Longitude', 'Venue Name', 'Venue Latitude', 'Venue Longitude', 'Venue Category', 'Venue ID']
print(venuesFrame.shape)
venuesFrame

(2172, 8)


Unnamed: 0,Neighborhood,Latitude,Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Category,Venue ID
0,Allston,42.355434,-71.132127,Lulu's Allston,42.355068,-71.134107,Comfort Food Restaurant,530647fd498e4ac184afea7c
1,Allston,42.355434,-71.132127,Shabu Zen,42.352678,-71.129113,Japanese Restaurant,4a89e36df964a520430920e3
2,Allston,42.355434,-71.132127,Whole Heart Provisions,42.353745,-71.137189,Vegetarian / Vegan Restaurant,5605cfde498e93568a705014
3,Allston,42.355434,-71.132127,Tous les Jours,42.351753,-71.131665,Bakery,5993ab8e356b497a3de157b8
4,Allston,42.355434,-71.132127,Lime Red Tea House,42.352090,-71.124268,Bubble Tea Shop,594ef01f35f9830368f7dc61
...,...,...,...,...,...,...,...,...
2167,"West Roxbury, Boston",42.279265,-71.149497,FoMu,42.314292,-71.114213,Ice Cream Shop,516d7e16498eabc14f3506fc
2168,"West Roxbury, Boston",42.279265,-71.149497,Animal Rescue League of Boston,42.265533,-71.185309,Animal Shelter,4c43258dd7fad13a48160ada
2169,"West Roxbury, Boston",42.279265,-71.149497,THE F.I.T.T. PIT,42.252679,-71.119017,Athletics & Sports,509fc743e4b083e9a60391dc
2170,"West Roxbury, Boston",42.279265,-71.149497,Shanti Taste of India Roslindale,42.287153,-71.127670,Indian Restaurant,50ef8d5c8acaed20c6c6c271


In [21]:


restaurantFrame = pd.DataFrame(columns = ['Neighborhood','Restaurant Name', "Category", "ID"])
#idFrame = pd.DataFrame(columns = ['Restaurant ID'])

typeList = ["Restaurant", "Joint", "Bar", "Sandwich"]

for index, row in venuesFrame.iterrows():
    for i in typeList:
        if i in str(row[-2]):
            
            restaurantFrame_length = len(restaurantFrame)
            restaurantFrame.loc[restaurantFrame_length] = [row[0], row[3],row[-2],row[-1]]

restaurantFrame

Unnamed: 0,Neighborhood,Restaurant Name,Category,ID
0,Allston,Lulu's Allston,Comfort Food Restaurant,530647fd498e4ac184afea7c
1,Allston,Shabu Zen,Japanese Restaurant,4a89e36df964a520430920e3
2,Allston,Whole Heart Provisions,Vegetarian / Vegan Restaurant,5605cfde498e93568a705014
3,Allston,Genki Ya,Sushi Restaurant,4a6f6fd0f964a5202ad61fe3
4,Allston,Raising Cane's Chicken Fingers,Fried Chicken Joint,4ad7572df964a520a30921e3
...,...,...,...,...
687,"West Roxbury, Boston",City Feed & Supply,Sandwich Place,49fe266df964a520776f1fe3
688,"West Roxbury, Boston",Chilacates,Mexican Restaurant,58b364593bd4ab7454103d28
689,"West Roxbury, Boston",Halfway Cafe,American Restaurant,4aef960ff964a52068d921e3
690,"West Roxbury, Boston",Five Guys,Burger Joint,4a99a005f964a520842f20e3


In [23]:
# One hot encoding
onehotFrame = pd.get_dummies(restaurantFrame[['Category']], prefix="", prefix_sep="")
# Adding neighborhood column back to dataframe
onehotFrame['Neighborhood'] = restaurantFrame['Neighborhood']
# Moving neighbourhood column to the first column
fixed_columns = [onehotFrame.columns[-1]] + list(onehotFrame.columns[:-1])
onehotFrame = onehotFrame[fixed_columns]
print(onehotFrame.shape)


(692, 47)


In [24]:
groupedFrame=onehotFrame.groupby(["Neighborhood"]).sum().reset_index()
print(groupedFrame.shape)
groupedFrame

(22, 47)


Unnamed: 0,Neighborhood,American Restaurant,Arepa Restaurant,Asian Restaurant,Australian Restaurant,BBQ Joint,Bar,Belgian Restaurant,Burger Joint,Caribbean Restaurant,...,Sandwich Place,Seafood Restaurant,Southern / Soul Food Restaurant,Sports Bar,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar
0,Allston,2,1,0,0,1,0,0,1,0,...,3,3,0,0,2,1,0,4,0,0
1,"Back Bay, Boston",4,1,1,0,0,1,0,1,0,...,2,4,0,0,0,1,0,0,0,1
2,"Bay Village, Boston",4,1,1,0,0,1,1,0,0,...,3,6,0,0,0,0,0,1,0,1
3,"Beacon Hill, Boston",2,0,1,0,0,0,1,0,0,...,4,5,0,0,0,0,0,1,0,1
4,"Brighton, Boston",2,1,1,0,0,2,0,2,0,...,2,1,0,1,3,1,0,2,0,0
5,"Charlestown, Boston",0,0,1,1,0,0,1,0,0,...,2,3,0,0,0,0,0,1,0,0
6,"Chinatown, Boston",2,0,1,0,0,0,1,0,0,...,4,6,0,0,0,0,0,1,0,1
7,"Dorchester, Boston",6,0,0,0,0,3,0,0,3,...,2,0,1,0,0,2,0,0,5,0
8,Downtown Boston,4,0,1,0,0,1,1,0,0,...,3,8,0,0,0,0,0,1,0,1
9,East Boston,1,0,0,1,0,0,1,0,0,...,2,5,0,0,0,0,0,0,0,0


In [25]:

typeList2 = list(groupedFrame)[1:-1]
print(typeList2)



groupedFrame["Restaurant Count"] = groupedFrame.sum(axis=1)

restFrame2 = groupedFrame[["Neighborhood","Restaurant Count"]]
restFrame2

['American Restaurant', 'Arepa Restaurant', 'Asian Restaurant', 'Australian Restaurant', 'BBQ Joint', 'Bar', 'Belgian Restaurant', 'Burger Joint', 'Caribbean Restaurant', 'Chinese Restaurant', 'Cocktail Bar', 'Comfort Food Restaurant', 'Cuban Restaurant', 'Dive Bar', 'Ethiopian Restaurant', 'Falafel Restaurant', 'Fast Food Restaurant', 'French Restaurant', 'Fried Chicken Joint', 'Greek Restaurant', 'Hot Dog Joint', 'Hotpot Restaurant', 'Indian Restaurant', 'Italian Restaurant', 'Japanese Restaurant', 'Jewish Restaurant', 'Juice Bar', 'Korean Restaurant', 'Latin American Restaurant', 'Mediterranean Restaurant', 'Mexican Restaurant', 'Meze Restaurant', 'Middle Eastern Restaurant', 'New American Restaurant', 'Peruvian Restaurant', 'Restaurant', 'Sandwich Place', 'Seafood Restaurant', 'Southern / Soul Food Restaurant', 'Sports Bar', 'Sushi Restaurant', 'Tapas Restaurant', 'Thai Restaurant', 'Vegetarian / Vegan Restaurant', 'Vietnamese Restaurant']


Unnamed: 0,Neighborhood,Restaurant Count
0,Allston,31
1,"Back Bay, Boston",31
2,"Bay Village, Boston",38
3,"Beacon Hill, Boston",27
4,"Brighton, Boston",34
5,"Charlestown, Boston",25
6,"Chinatown, Boston",30
7,"Dorchester, Boston",35
8,Downtown Boston,37
9,East Boston,22


In [30]:
# Setting the number of clusters
kclusters = 8
clusterFrame = restFrame2.drop(["Neighborhood"], 1)
# Run k-means clustering algorithm
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(clusterFrame)
# Checking cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]


array([2, 2, 5, 6, 4, 1, 2, 7, 0, 3])

In [31]:
# Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
mergedFrame = restFrame2.copy()
# Add the clustering labels
mergedFrame["Count Cluster Labels"] = kmeans.labels_
mergedFrame.head(10)

Unnamed: 0,Neighborhood,Restaurant Count,Count Cluster Labels
0,Allston,31,2
1,"Back Bay, Boston",31,2
2,"Bay Village, Boston",38,5
3,"Beacon Hill, Boston",27,6
4,"Brighton, Boston",34,4
5,"Charlestown, Boston",25,1
6,"Chinatown, Boston",30,2
7,"Dorchester, Boston",35,7
8,Downtown Boston,37,0
9,East Boston,22,3


In [32]:
 # Adding latitude and longitude values to the existing dataframe
mergedFrame['Latitude'] = neighFrame['Latitude']
mergedFrame['Longitude'] = neighFrame['Longitude']

mergedFrame

Unnamed: 0,Neighborhood,Restaurant Count,Count Cluster Labels,Latitude,Longitude
0,Allston,31,2,42.355434,-71.132127
1,"Back Bay, Boston",31,2,42.350549,-71.080311
2,"Bay Village, Boston",38,5,42.350011,-71.066948
3,"Beacon Hill, Boston",27,6,42.358708,-71.067829
4,"Brighton, Boston",34,4,42.350097,-71.156442
5,"Charlestown, Boston",25,1,42.377875,-71.061996
6,"Chinatown, Boston",30,2,42.351329,-71.062623
7,"Dorchester, Boston",35,7,42.29732,-71.074495
8,Downtown Boston,37,0,42.351871,-71.067565
9,East Boston,22,3,42.375097,-71.039217


In [33]:
# Creating the map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# Setting color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mergedFrame['Latitude'], mergedFrame['Longitude'], mergedFrame['Neighborhood'], mergedFrame['Count Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat,lon],radius=5,popup=label,color=rainbow[cluster-1],fill=True,fill_color=rainbow[cluster-1],fill_opacity=0.7).add_to(map_clusters)
map_clusters

We now have a map of what areas in Boston would be best to open up a restaurant, based on the amount of existing competition.

In theory, this next part of code would rate the restaurants. I would then average the rating for each neighborhood, and then cluster the result. However, I can't because FourSquare will not let me call it enough times, especially if multiple people try to use it in one day.  The ratings will instead be from a randomly generated list.


In [37]:
'''ratingList = []
for index, row in restaurantFrame.iterrows():
    
    venue_id = row[-1]
    print(venue_id)
    url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&oauth_token={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET,ACCESS_TOKEN, VERSION)
    result = requests.get(url).json()

    try:
            ratingList.append(float(result['response']['venue']['rating']))
            print(ratingList)
            
    except:

            ratingList.append(0) #change back to nan
print(ratingList)    
ratingArray = np.asarray(ratingList)
restaurantFrame['Restaurant Rating'] = ratingArray
restaurantFrame["Restaurant Rating"] = pd.to_numeric(restaurantFrame["Restaurant Rating"], downcast="float")
restaurantFrame'''



'ratingList = []\nfor index, row in restaurantFrame.iterrows():\n    \n    venue_id = row[-1]\n    print(venue_id)\n    url = \'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&oauth_token={}&v={}\'.format(venue_id, CLIENT_ID, CLIENT_SECRET,ACCESS_TOKEN, VERSION)\n    result = requests.get(url).json()\n\n    try:\n            ratingList.append(float(result[\'response\'][\'venue\'][\'rating\']))\n            print(ratingList)\n            \n    except:\n\n            ratingList.append(0) #change back to nan\nprint(ratingList)    \nratingArray = np.asarray(ratingList)\nrestaurantFrame[\'Restaurant Rating\'] = ratingArray\nrestaurantFrame["Restaurant Rating"] = pd.to_numeric(restaurantFrame["Restaurant Rating"], downcast="float")\nrestaurantFrame'

In [42]:

# Setting the number of clusters
kclusters = 6
restFrame2["Ratings"] = [5,4,6,6,4,7,3,7,8,5,9,3,4,7,6,3,8,6,3,6,8]


clusterFrame = restFrame2.drop(["Neighborhood","Restaurant Count"], 1)
# Run k-means clustering algorithm
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(clusterFrame)
# Checking cluster labels generated for each row in the dataframe


mergedFrame["Rating Cluster Labels"] = kmeans.labels_
mergedFrame["Ratings"] = restFrame2['Ratings']
mergedFrame


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,Neighborhood,Restaurant Count,Count Cluster Labels,Latitude,Longitude,Rating Cluster Labels,Ratings
0,Allston,31,2,42.355434,-71.132127,3,5
1,"Back Bay, Boston",31,2,42.350549,-71.080311,0,4
2,"Bay Village, Boston",38,5,42.350011,-71.066948,2,6
3,"Beacon Hill, Boston",27,6,42.358708,-71.067829,2,6
4,"Brighton, Boston",34,4,42.350097,-71.156442,0,4
5,"Charlestown, Boston",25,1,42.377875,-71.061996,5,7
6,"Chinatown, Boston",30,2,42.351329,-71.062623,4,3
7,"Dorchester, Boston",35,7,42.29732,-71.074495,5,7
8,Downtown Boston,37,0,42.351871,-71.067565,1,8
9,East Boston,22,3,42.375097,-71.039217,3,5


In [45]:
# Creating the map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# Setting color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mergedFrame['Latitude'], mergedFrame['Longitude'], mergedFrame['Neighborhood'], mergedFrame['Rating Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat,lon],radius=5,popup=label,color=rainbow[cluster-1],fill=True,fill_color=rainbow[cluster-1],fill_opacity=0.7).add_to(map_clusters)
map_clusters

## Analysis

With the number of restaurants generated and the average rating found, the next step will be anaylze the data to determine which neighborhood would be best. We will identify all neighborhoods with a restaurant count of less than 30 and an average rating of less than 5.

In [44]:
resultFrame1= mergedFrame[mergedFrame['Restaurant Count'] <= 30]
resultFrame2 = resultFrame1[resultFrame1["Ratings"]<=5]
resultFrame2

Unnamed: 0,Neighborhood,Restaurant Count,Count Cluster Labels,Latitude,Longitude,Rating Cluster Labels,Ratings
6,"Chinatown, Boston",30,2,42.351329,-71.062623,4,3
9,East Boston,22,3,42.375097,-71.039217,3,5
15,"North End, Boston",22,3,42.365097,-71.054495,4,3


## Results



My results indicates that the best neighborhoods to open a fine dining restaurant in Boston would be Chinatown, East Boston, and the North End based off the relative lack of competition.

## Conclusion

The purpose of this program was to aid stakeholders in deciding where to locate a new fine dining restaurant in Boston.

The final decision was based on the number of restaurants in each neighborhood and the average rating of those restaurants.