# Finding the best place to open a restaurant
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a restaurant. Specifically, this report will be targeted to stakeholders interested in opening a restaurant in **Montevideo**, Uruguay.

We will first analyze the neighborhoods to identify the median income and total population in order to create a ranking based on these metrics.
Then we will try to detect which neighborhood has the least number of food venues and from there we will chose a location that is not already too crowded with restaurants.
We will use our data science skills to generate a few most promising neighborhoods based on this criteria.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decision are:
* median income in the neighborhood
* population in the neighborhood
* number of food venues in the neighborhood

We decided to use a json file that contain the segmentation to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* Median income will be obtained using data from the National Statistics Institute **http://www.ine.gub.uy/encuesta-continua-de-hogares1**
* Population will be obtained using data from the National Statistics Institute **http://www.ine.gub.uy/web/guest/censos1**
* Number of food venues and their type and location in every neighborhood will be obtained using **Foursquare API**
* Segmentation of the neighborhoods within the city will be obtained using a JSON file found on GitHub **https://github.com/vierja/geojson_montevideo**

## Methodology <a name="methodology"></a>

In this project we will direct our efforts on finding a neighborhood in Montevideo that has a high median income, a high population, and a low number of food venues. We will limit our analysis to areas 800 meters around the neighborhood center as defined by the geospatial coordinates in the json file.

In the first step we will collect the required **data: population and median income by neighborhood**

Second step in our analysis will be calculation and exploration of the number of '**food venues**' in the top neighborhoods defined in the first step.

In third and final step we will focus on the best neighborhood and create **clusters of locations of food venues.**
We will create clusters (using **k-means clustering**) of those locations to identify general zones which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

### Data Loading

Let's import the libraries that we will use:

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # install geopy library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import requests # library to handle requests

from sklearn.cluster import KMeans # import k-means from clustering stage

import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.4.5.2 |       hecda079_0         147 KB  conda-forge
    certifi-2020.4.5.2         |   py36h9f0ad1d_0         152 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         395 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0

The following packages will b

Let's load the population by neighborhood data:

In [2]:
url2 = 'https://raw.githubusercontent.com/agustinmaciel/Coursera_Capstone/master/poblacion.csv'
df_population = pd.read_csv(url2,encoding= 'unicode_escape',sep=';')
df_population.head()

Unnamed: 0,Barrio,CCZ,Municipio,Poblacion
0,Aguada,1,B,11590
1,Aires Puros,10,C,15315
2,Atahualpa,15,C,12600
3,Banados de Carrasco,9,F,18764
4,Barrio Sur,1,B,11590


Let's drop the fields that we don't need and rename the columns:

In [3]:
df_population.drop(['CCZ', 'Municipio'],axis=1,inplace=True)
df_population.rename(columns={'Barrio':'Neighborhood','Poblacion':'Population'},inplace=True)
df_population.head()

Unnamed: 0,Neighborhood,Population
0,Aguada,11590
1,Aires Puros,15315
2,Atahualpa,12600
3,Banados de Carrasco,18764
4,Barrio Sur,11590


Let's check the most populated neighborhoods:

In [4]:
df_population.sort_values(['Population'],ascending=False).head()

Unnamed: 0,Neighborhood,Population
21,Cordon,43000
44,Paso de la Arena,41668
57,Tres Ombues y Victoria,41556
48,Pocitos,39500
7,Buceo,34000


Let's check the least populated neighborhoods:

In [5]:
df_population.sort_values(['Population'],ascending=False).tail()

Unnamed: 0,Neighborhood,Population
0,Aguada,11590
56,Tres Cruces,9400
9,Carrasco,8700
42,Palermo,6300
43,Parque Rodo,2800


Let's load the income data by neighborhood:

In [6]:
url = 'https://raw.githubusercontent.com/agustinmaciel/Coursera_Capstone/master/ingresos.csv'
df_income = pd.read_csv(url,sep=';')
df_income.head()

Unnamed: 0,Neighborhood,Income
0,Ituzaingo,23968
1,Ituzaingo,0
2,Cordon,53662
3,Cordon,0
4,Pocitos,73500


Let's check the median income by neighborhood:

In [7]:
df_income_median = df_income.groupby('Neighborhood')['Income'].median().reset_index()
df_income_median.head()

Unnamed: 0,Neighborhood,Income
0,Aguada,26000.0
1,Aires Puros,23311.0
2,Atahualpa,35941.0
3,Banados de Carrasco,3000.0
4,Barrio Sur,30000.0


Transform the income column to integer:

In [8]:
df_income_median['Income'] = df_income_median['Income'].astype(int)

Check the data type to confirm:

In [9]:
df_income_median.dtypes

Neighborhood    object
Income           int64
dtype: object

Let's merge the population data with the income data:

In [10]:
df_merged = df_population.join(df_income_median.set_index('Neighborhood'), on='Neighborhood')
df_merged.head()

Unnamed: 0,Neighborhood,Population,Income
0,Aguada,11590,26000
1,Aires Puros,15315,23311
2,Atahualpa,12600,35941
3,Banados de Carrasco,18764,3000
4,Barrio Sur,11590,30000


Confirm the data types are integer for the numeric columns:

In [11]:
df_merged.dtypes

Neighborhood    object
Population       int64
Income           int64
dtype: object

Let's sort the top 10 values by income and population in descending order:

In [12]:
df_merged.sort_values(['Income','Population'],ascending=False).head(10)

Unnamed: 0,Neighborhood,Population,Income
48,Pocitos,39500,40288
52,Punta Carretas,21500,38193
2,Atahualpa,12600,35941
14,Centro,14100,34624
43,Parque Rodo,2800,34238
53,Punta Gorda,19000,32118
35,Malvin,21000,32000
56,Tres Cruces,9400,31014
21,Cordon,43000,30000
49,Pque. Batlle y V. Dolores,26000,30000


Let's create a ranking by multiplying the population of the neighborhood by the median income of the neighborhood, divided by the total population of the city.
Please notice that this value does **not** have a valid statistic meaning, it is just a way to compare neighborhoods.

In [13]:
df_merged['Result'] = df_merged['Population'] * df_merged['Income'] / 1300000

Let's sort the results of this new ranking that we have artificially created, to find out the top 10 neighborhoods relative to median income and population:

In [14]:
df_merged.sort_values(['Result'],ascending=False).head(10).round()

Unnamed: 0,Neighborhood,Population,Income,Result
48,Pocitos,39500,40288,1224.0
21,Cordon,43000,30000,992.0
7,Buceo,34000,29062,760.0
52,Punta Carretas,21500,38193,632.0
49,Pque. Batlle y V. Dolores,26000,30000,600.0
35,Malvin,21000,32000,517.0
53,Punta Gorda,19000,32118,469.0
26,La Blanqueada,19677,30000,454.0
40,Mercado Modelo y Bolivar,22545,25197,437.0
57,Tres Ombues y Victoria,41556,12000,384.0


Let's confirm that there are no null values in the population column:

In [15]:
df_merged['Population'].isnull().values.any()

False

Let's confirm that there are no null values in the income column:

In [16]:
df_merged['Income'].isnull().values.any()

False

#### **Now let's load and analyze the city's neighborhoods geospatial data:**

Let's get and check the neighboorhoods json data from our repository in GitHub:

In [17]:
!wget -q -O 'barrios.geojson' https://raw.githubusercontent.com/agustinmaciel/Coursera_Capstone/master/barrios.geojson
print('Data downloaded!')

Data downloaded!


Load the json data into a variable:

In [18]:
with open('barrios.geojson') as json_data:
    montevideo_data = json.load(json_data)

Display the data:

In [19]:
montevideo_data

{'type': 'FeatureCollection',
 'crs': {'type': 'name',
  'properties': {'name': 'urn:ogc:def:crs:OGC:1.3:CRS84'}},
 'features': [{'type': 'Feature',
   'properties': {'id_barrio': 32,
    'nombre': 'Manga y Toledo Chico',
    'codigo': 'MT'},
   'geometry': {'type': 'Point', 'coordinates': [-56.1859925, -34.7644335]}},
  {'type': 'Feature',
   'properties': {'id_barrio': 58,
    'nombre': 'Colon Sureste y Abayuba',
    'codigo': 'CY'},
   'geometry': {'type': 'Point', 'coordinates': [-56.2248196, -34.7781884]}},
  {'type': 'Feature',
   'properties': {'id_barrio': 61,
    'nombre': 'Villa Garcia y Manga Rur.',
    'codigo': 'VG'},
   'geometry': {'type': 'Point', 'coordinates': [-56.0882308, -34.7772547]}},
  {'type': 'Feature',
   'properties': {'id_barrio': 16,
    'nombre': 'Banados de Carrasco',
    'codigo': 'BC'},
   'geometry': {'type': 'Point', 'coordinates': [-56.1204446, -34.8324871]}},
  {'type': 'Feature',
   'properties': {'id_barrio': 62, 'nombre': 'Manga', 'codigo': 'MG'

The neighborhood data is stored in the "features" field, so let's create a new data frame with this information:

In [20]:
neighborhoods_data = montevideo_data['features']

Let's check the data for a specific neighborhood to verify that it was loaded correctly:

In [21]:
neighborhoods_data[52]

{'type': 'Feature',
 'properties': {'id_barrio': 8, 'nombre': 'Pocitos', 'codigo': 'PO'},
 'geometry': {'type': 'Point', 'coordinates': [-56.1465, -34.9081]}}

Let's create a new data frame for the data:

In [22]:
# define the dataframe columns
column_names = ['Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Let's check the data frame, it should not have data but the columns labels should be there:

In [23]:
neighborhoods

Unnamed: 0,Neighborhood,Latitude,Longitude


Let's insert the spatial data from the json file that we stored in a data frame into the new data frame:

In [24]:
for data in neighborhoods_data:

    neighborhood_name = data['properties']['nombre']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Let's check the data inside the data frame:

In [25]:
neighborhoods.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Manga y Toledo Chico,-34.764434,-56.185992
1,Colon Sureste y Abayuba,-34.778188,-56.22482
2,Villa Garcia y Manga Rur.,-34.777255,-56.088231
3,Banados de Carrasco,-34.832487,-56.120445
4,Manga,-34.777489,-56.185992


Let's check how many neighborhoods we have:

In [26]:
print('The dataframe has  {} neighborhoods.'.format(
        len(neighborhoods['Neighborhood'].unique()),
        neighborhoods.shape[0]))

The dataframe has  62 neighborhoods.


Let's bring the geospatial data from the city that we are analyzing, **"Montevideo"** (capital city of Uruguay):

In [27]:
address = 'Montevideo, Uruguay'

geolocator = Nominatim(user_agent="explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Montevideo are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Montevideo are -34.9059039, -56.1913569.


### Let's plot our map of Montevideo with the markers for each neighborhood:

In [28]:
# create map of Montevideo using latitude and longitude values
map_montevideo = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_montevideo)  
    
map_montevideo

## Analysis <a name="analysis"></a>

#### Let's perform some basic explanatory data analysis and derive some additional info from our raw data. 
Let's bring the food places for each of the top 3 neighborhoods from our ranking. We will analyze these in more detail in order to start defining our selected location.
First let's recall the top 3 neighborhoods:

In [29]:
df_merged.sort_values(['Result'],ascending=False).round().head(3)

Unnamed: 0,Neighborhood,Population,Income,Result
48,Pocitos,39500,40288,1224.0
21,Cordon,43000,30000,992.0
7,Buceo,34000,29062,760.0


### Foursquare
Now that we have our location candidates, let's use Foursquare API to get info on restaurants in each neighborhood.
We're interested in venues in the "food" category.

First, let's define the credentials:

In [30]:
CLIENT_ID = 'PLWZS02RSN2SS21KPQ0PSBDDVQ2N5LWLKRVVQ4KWKT1UXNBB' # your Foursquare ID
CLIENT_SECRET = 'BLPB5FIND45ATA1Y4LUYN2TDVVTJ1L4XFVT4BNBLTRZKMHCV' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PLWZS02RSN2SS21KPQ0PSBDDVQ2N5LWLKRVVQ4KWKT1UXNBB
CLIENT_SECRET:BLPB5FIND45ATA1Y4LUYN2TDVVTJ1L4XFVT4BNBLTRZKMHCV


Let's start with the first neighborhood, "Pocitos". We will bring the geospatial information:

In [31]:
neighborhoods_PO = neighborhoods.loc[neighborhoods['Neighborhood'] == 'Pocitos']
neighborhoods_PO

Unnamed: 0,Neighborhood,Latitude,Longitude
52,Pocitos,-34.9081,-56.1465


Let's store the latitude and longitude information in two variables, so we can pass it on into the Foursquare API request:

In [32]:
neighborhood_latitude = neighborhoods_PO.loc[52, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods_PO.loc[52, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods_PO.loc[52, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Pocitos are -34.9081, -56.1465.


Let's build the url for the Foursquare API request:

In [33]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 800 # define radius

categoryId = '4d4b7105d754a06374d81259' # "Food" cathegory for the venues


# create URL
url_PO = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    categoryId,
    radius, 
    LIMIT)

url_PO # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=PLWZS02RSN2SS21KPQ0PSBDDVQ2N5LWLKRVVQ4KWKT1UXNBB&client_secret=BLPB5FIND45ATA1Y4LUYN2TDVVTJ1L4XFVT4BNBLTRZKMHCV&v=20180605&ll=-34.9081,-56.1465&categoryId=4d4b7105d754a06374d81259&radius=800&limit=100'

Let's make the request to the Foursquare API to bring the top 100 food places that are withing 800 meters radius from the latitude and longitude that we defined (in this case, the center of the neighborhood "Pocitos"):

In [34]:
results = requests.get(url_PO).json()
results

{'meta': {'code': 200, 'requestId': '5ee290b265fdfb234dcb4678'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Pocitos',
  'headerFullLocation': 'Pocitos, Montevideo',
  'headerLocationGranularity': 'neighborhood',
  'query': 'food',
  'totalResults': 55,
  'suggestedBounds': {'ne': {'lat': -34.90089999279999,
    'lng': -56.13773665619636},
   'sw': {'lat': -34.9153000072, 'lng': -56.15526334380365}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e2b8131483bb05f059ff00c',
       'name': 'Doña Inés',
       'location': {'address': 'Miguel Barreiro 3293',
        'crossStreet': 'Alejandro Chucarro',
        'lat': -34.909789565816894,
        'lng': -56.146014978364214,
       

Function that extracts the category of the venue:

In [35]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Let's store the food venues information into a data frame, we want name, category, latitude and longitude of the venues:

In [36]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Doña Inés,Café,-34.90979,-56.146015
1,Miyagi Shushi,Japanese Restaurant,-34.908886,-56.148351
2,La Redonda,Pizza Place,-34.909274,-56.148299
3,Bien Bar,BBQ Joint,-34.909867,-56.147407
4,Vegan Wraps y Licuados,Vegetarian / Vegan Restaurant,-34.90478,-56.148087


Let's check how many food venues we have in this neighborhood, we will use this information to compare and make a decision:

In [37]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

55 venues were returned by Foursquare.


Let's check all the venues to confirm that they are food places:

In [38]:
nearby_venues.head(54)

Unnamed: 0,name,categories,lat,lng
0,Doña Inés,Café,-34.90979,-56.146015
1,Miyagi Shushi,Japanese Restaurant,-34.908886,-56.148351
2,La Redonda,Pizza Place,-34.909274,-56.148299
3,Bien Bar,BBQ Joint,-34.909867,-56.147407
4,Vegan Wraps y Licuados,Vegetarian / Vegan Restaurant,-34.90478,-56.148087
5,Tandory,Restaurant,-34.907692,-56.152395
6,Tony's Pizza,Pizza Place,-34.906961,-56.151723
7,Montecito,BBQ Joint,-34.908825,-56.144343
8,Lokotas,Empanada Restaurant,-34.908826,-56.148432
9,Erevan,Middle Eastern Restaurant,-34.907465,-56.149652


#### Now let's repeat the steps for the next neighborhood in our list, "Cordon":

We will bring the geospatial information:

In [39]:
neighborhoods_CO = neighborhoods.loc[neighborhoods['Neighborhood'] == 'Cordon']
neighborhoods_CO

Unnamed: 0,Neighborhood,Latitude,Longitude
49,Cordon,-34.901533,-56.180649


Let's store the latitude and longitude information in two variables, so we can pass it on into the Foursquare API request:

In [40]:
neighborhood_latitude = neighborhoods_CO.loc[49, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods_CO.loc[49, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods_CO.loc[49, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Cordon are -34.9015327, -56.1806489.


Let's build the url for the Foursquare API request:

In [41]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 800 # define radius

categoryId = '4d4b7105d754a06374d81259'


# create URL
url_CO = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    categoryId,
    radius, 
    LIMIT)

url_CO # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=PLWZS02RSN2SS21KPQ0PSBDDVQ2N5LWLKRVVQ4KWKT1UXNBB&client_secret=BLPB5FIND45ATA1Y4LUYN2TDVVTJ1L4XFVT4BNBLTRZKMHCV&v=20180605&ll=-34.9015327,-56.1806489&categoryId=4d4b7105d754a06374d81259&radius=800&limit=100'

Let's make the request to the Foursquare API to bring the top 100 food places that are within 800 meters radius from the latitude and longitude that we defined (in this case, the center of the neighborhood "Cordon"):

In [42]:
results_CO = requests.get(url_CO).json()
results_CO

{'meta': {'code': 200, 'requestId': '5ee28fac867c843e1af45d78'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Cordón',
  'headerFullLocation': 'Cordón, Montevideo',
  'headerLocationGranularity': 'neighborhood',
  'query': 'food',
  'totalResults': 34,
  'suggestedBounds': {'ne': {'lat': -34.89433269279999,
    'lng': -56.17188625701718},
   'sw': {'lat': -34.9087327072, 'lng': -56.189411542982825}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '590393deb546187ea545ad51',
       'name': 'Donut City',
       'location': {'address': 'Carlos Roxlo 1381',
        'lat': -34.904526,
        'lng': -56.181488,
        'labeledLatLngs': [{'label': 'display',
          'lat': -34.9045

Function that extracts the category of the venue:

In [43]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Let's store the food venues information into a data frame, we want name, category, latitude and longitude of the venues:

In [44]:
venues_CO = results_CO['response']['groups'][0]['items']
    
nearby_venues_CO = pd.json_normalize(venues_CO) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues_CO =nearby_venues_CO.loc[:, filtered_columns]

# filter the category for each row
nearby_venues_CO['venue.categories'] = nearby_venues_CO.apply(get_category_type, axis=1)

# clean columns
nearby_venues_CO.columns = [col.split(".")[-1] for col in nearby_venues_CO.columns]

nearby_venues_CO.head()

Unnamed: 0,name,categories,lat,lng
0,Donut City,Donut Shop,-34.904526,-56.181488
1,Café Gourmand,French Restaurant,-34.906994,-56.175552
2,Don Koto,BBQ Joint,-34.901973,-56.178894
3,Bar Touring,Breakfast Spot,-34.902908,-56.181003
4,La Glorieta,Bakery,-34.9064,-56.184568


Let's check how many food venues we have in this neighborhood, we will use this information to compare and make a decision:

In [45]:
print('{} venues were returned by Foursquare.'.format(nearby_venues_CO.shape[0]))

34 venues were returned by Foursquare.


#### Now let's repeat the steps for the next neighborhood in our list, "Buceo":

Let's bring the geospatial information:

In [46]:
neighborhoods_BU = neighborhoods.loc[neighborhoods['Neighborhood'] == 'Buceo']
neighborhoods_BU

Unnamed: 0,Neighborhood,Latitude,Longitude
43,Buceo,-34.8997,-56.132


Let's store the latitude and longitude information in two variables, so we can pass it on into the Foursquare API request:

In [47]:
neighborhood_latitude = neighborhoods_BU.loc[43, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods_BU.loc[43, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods_BU.loc[43, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Buceo are -34.8997, -56.132.


Let's build the url for the Foursquare API request:

In [48]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 800 # define radius

categoryId = '4d4b7105d754a06374d81259'


# create URL
url_BU = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    categoryId,
    radius, 
    LIMIT)

url_BU # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=PLWZS02RSN2SS21KPQ0PSBDDVQ2N5LWLKRVVQ4KWKT1UXNBB&client_secret=BLPB5FIND45ATA1Y4LUYN2TDVVTJ1L4XFVT4BNBLTRZKMHCV&v=20180605&ll=-34.8997,-56.132&categoryId=4d4b7105d754a06374d81259&radius=800&limit=100'

Let's make the request to the Foursquare API to bring the top 100 food places that are withing 800 meters radius from the latitude and longitude that we defined (in this case, the center of the neighborhood "Buceo"):

In [49]:
results = requests.get(url_BU).json()
results

{'meta': {'code': 200, 'requestId': '5ee28e508e17c56501fb69d0'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Buceo',
  'headerFullLocation': 'Buceo, Montevideo',
  'headerLocationGranularity': 'neighborhood',
  'query': 'food',
  'totalResults': 33,
  'suggestedBounds': {'ne': {'lat': -34.8924999928,
    'lng': -56.123237552550805},
   'sw': {'lat': -34.90690000720001, 'lng': -56.14076244744919}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '561079e5498eb02133f73546',
       'name': 'El Fondito',
       'location': {'address': 'Pedro Bustamante 1222',
        'crossStreet': 'Saldaña Da Gama',
        'lat': -34.90232209207086,
        'lng': -56.13335541252805,
        'labe

Function that extracts the category of the venue:

In [50]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Let's store the food venues information into a data frame, we want name, category, latitude and longitude of the venues:

In [51]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,El Fondito,American Restaurant,-34.902322,-56.133355
1,Panini's,Italian Restaurant,-34.905504,-56.134766
2,La Vaca,Steakhouse,-34.905458,-56.135162
3,Autoría,Restaurant,-34.905223,-56.137254
4,Cafe Ramona,Café,-34.905617,-56.13664


Let's check how many food venues we have in this neighborhood, we will use this information to compare and make a decision:

In [52]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

33 venues were returned by Foursquare.


#### Now we will bring back the neighborhoods ranking data and insert the number of food venues for each neighborhood that we pulled from the Foursquare API in order to compare them.

Let's recall the ranking information, we will store the top 3 neighborhoods into a new data frame:

In [53]:
df_sorted = df_merged.sort_values(['Result'],ascending=False).round().head(3)
df_sorted

Unnamed: 0,Neighborhood,Population,Income,Result
48,Pocitos,39500,40288,1224.0
21,Cordon,43000,30000,992.0
7,Buceo,34000,29062,760.0


Let's create a new data frame with the number of food venues on each neighborhood as indicated by the Foursquare API data:

In [54]:
df_np = pd.DataFrame({"Neighborhood":['Pocitos','Cordon','Buceo'],"Restaurants":[54,34,33]})
df_np

Unnamed: 0,Neighborhood,Restaurants
0,Pocitos,54
1,Cordon,34
2,Buceo,33


Let's join the data so we can compare it side by side:

In [55]:
df_joined = df_sorted.join(df_np.set_index('Neighborhood'), on='Neighborhood')
df_joined

Unnamed: 0,Neighborhood,Population,Income,Result,Restaurants
48,Pocitos,39500,40288,1224.0,54
21,Cordon,43000,30000,992.0,34
7,Buceo,34000,29062,760.0,33


#### From this table we have chosen the neighborhood "Cordon" as it offers great value regarding population and median income, but much less competition that the top option "Pocitos", and better population than the next option "Buceo".

#### Now let's use machine learning to analyze where the food venues are located in this neighborhood so we can chose a place that is not too crowded already.

First let's prepare the data for the clustering, we will assign "dummy" variables to each food venue on this neighborhood considering the category of it:

In [56]:
# one hot encoding
onehot = pd.get_dummies(nearby_venues_CO[['name']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['categories'] = nearby_venues_CO['categories'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.head()

Unnamed: 0,categories,Bar Touring,Burger King,Cafe Central,Café Gourmand,Café Insurgente,Cantina Facultad De Derecho,Ciudad Bar,Don Koto,Donut City,...,McDonald's,Panadería Agualada,Panadería Alhambra,Panadería El Gaucho,Parrillita El Rastro,Pizza Subte,Sabores,Shakespeare Café,Subway,Subway 18 de Julio
0,Donut Shop,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,French Restaurant,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,BBQ Joint,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,Breakfast Spot,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Bakery,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's group the food venues by category and find the mean of each:

In [57]:
CO_grouped = onehot.groupby('categories').mean().reset_index()
CO_grouped

Unnamed: 0,categories,Bar Touring,Burger King,Cafe Central,Café Gourmand,Café Insurgente,Cantina Facultad De Derecho,Ciudad Bar,Don Koto,Donut City,...,McDonald's,Panadería Agualada,Panadería Alhambra,Panadería El Gaucho,Parrillita El Rastro,Pizza Subte,Sabores,Shakespeare Café,Subway,Subway 18 de Julio
0,BBQ Joint,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,...,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0
1,Bakery,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.25,0.25,0.0,0.0,0.0,0.0,0.0,0.0
2,Breakfast Spot,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Café,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0
4,Deli / Bodega,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Donut Shop,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Empanada Restaurant,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Fast Food Restaurant,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Food Court,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
9,French Restaurant,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we will create and run the clustering algorithm, in this case we will use "K-means" to cluster the food venues by category:

In [58]:
# set number of clusters
kclusters = 5

#drop the "categories" column
grouped_clustering = CO_grouped.drop('categories', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 0, 0, 0, 0, 2, 0, 0, 3], dtype=int32)

We merge the data to have the cluster labels inside the data frame together with the venues information:

In [59]:
# add clustering labels
CO_grouped.insert(0, 'Cluster Labels', kmeans.labels_)


# merge data to add latitude and longitude for each food venue
nearby_venues_CO = nearby_venues_CO.join(CO_grouped.set_index('categories'), on='categories')

nearby_venues_CO.head()

Unnamed: 0,name,categories,lat,lng,Cluster Labels,Bar Touring,Burger King,Cafe Central,Café Gourmand,Café Insurgente,...,McDonald's,Panadería Agualada,Panadería Alhambra,Panadería El Gaucho,Parrillita El Rastro,Pizza Subte,Sabores,Shakespeare Café,Subway,Subway 18 de Julio
0,Donut City,Donut Shop,-34.904526,-56.181488,0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Café Gourmand,French Restaurant,-34.906994,-56.175552,3,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Don Koto,BBQ Joint,-34.901973,-56.178894,1,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0
3,Bar Touring,Breakfast Spot,-34.902908,-56.181003,0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,La Glorieta,Bakery,-34.9064,-56.184568,0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.25,0.25,0.0,0.0,0.0,0.0,0.0,0.0


Let's plot the map to display the food venues by cluster:

In [60]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=14)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(nearby_venues_CO['lat'], nearby_venues_CO['lng'], nearby_venues_CO['categories'], nearby_venues_CO['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We have our map with the location of the food venues in this neighborhood, it seems like most venues are located near an avenue, so further analysis at the "street level" will be done by the stakeholders taking this information as a starting point.

## Results and Discussion <a name="results"></a>

Our analysis shows that although the neighborhood with the most median income in Montevideo is "Pocitos", this is also the neighborhood with the most food venues from our selection, so we focused our attention in the second neighborhood, "Cordon", which offers a combination of closeness to the city center, strong ranking regarding population and median income and a relative low number of food venues.
After directing our attention to this more narrow area of interest we clustered the food venues on this neighborhood to create zones of interest which contain greatest number of location candidates.

Result of all this is the last map where we can see all the food venues locations, and this will be used as a starting point by the stakeholders in order to select the optimal location based on their criteria.

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify areas in Montevideo low number of food venues in a high populated and strong median income neighborhood in order to aid stakeholders in narrowing down the search for an optimal location for a new restaurant. 


By calculating the top neighborhoods regarding median income and population we have first identified a subset of strong candidates, and then leveraged data from Foursquare to identify how many food venues exist in these top neighborhoods.
Once we had the best neighborhood which satisfied our requirements we used machine learning to cluster the food venues locations in order to create major zones of interest (containing greatest number of potential locations) and those were plotted on a map to be used as starting points for final "street level" exploration by stakeholders.

Final decision on optimal restaurant location will be made by stakeholders based on their criteria and taking into consideration additional factors like attractiveness of each location, proximity to major roads, real estate availability, prices, etc.