<H1> IBM Applied Data Science Capstone Project

## Introduction: Business Problem <a name="introduction"></a>
<br>
Mexican cuisine was declared an intangible heritage of humanity by UNESCO in 2010. If you were to ask any Mexican what the best thing about Mexico is, chances are the vast majority will answer that it’s the food. Mexican gastronomy is famous all over the world, yet it’s almost impossible to get real Mexican food anywhere other than Mexico, a few restaurants in some southern USA states and a very limited number of Mexican-owned restaurants in other countries.<br>
When talking to tourists or other visitors who just arrived to Mexic0 City, be it for work or leisure, it’s universal: they always want to know what’s the best place to get authentic Mexican Food. It shouldn’t be a hard question. And depending on who you ask, you’ll probably always get a different answer. And in a city of more than 20,000,000 habitants in 1,465 km², with thousands of authentic Mexican restaurants, the prospect can be daunting. That’s why, for this project we will use the power of the data provided by Foursquare to process information about only the best options for authentic Mexican cuisine. We will create clusters of the best options for eating the best Mexican food.


## Data <a name="data"></a>
<br>
All of the data will come form the Foursquare API, where we will be paying special attention to:
<br>
<ul>
  <li>Venue name and id
  <li>Venue Category (Only Mexican restaurants, museums, monuments, historical places of interest and hotels will be considered).
  <li>Venue rating (Unfortunately, due to usage limitations, we will have to settle for likes instead of rating)
  <li>Venue location
  <li>Venue coordinates
</ul>

<b> We start by installing and importing the needed libraries

In [2]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    brotlipy-0.7.0             |py36h8c4c3a4_1000         346 KB  conda-forge
    chardet-3.0.4              |py36h9f0ad1d_1006         188 KB  conda-forge
    cryptography-2.9.2         |   py36h45558ae_0         613 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    pandas-1.0.3               |   py36h83

In [3]:
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
import requests
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
print('Import finished.')

Import finished.


<b>We Initialize the following variables that will be used all along this notebook:

In [38]:
#Foursquare variables
CLIENT_ID = 'GJW1NJFRTAVX5KWLHWJMSBTAH5QBAMCCQTCMDLLYOZQK0LRC' 
CLIENT_SECRET = 'WX2RJB0L1MKSO51YNPC5FXNY5FWMV1BDURVHXS32LNMWNL10' 
VERSION = '20180604' 
CATEGORY_ID= '4bf58dd8d48988d1c1941735' #Mexican Restaurant category id
#Mexico City, city center coordinates
LATITUDE = 19.432895
LONGITUDE = -99.133173
#Other parameters
RADIUS = 10000 #10km from city center
LIMIT = 50 #Top 50 Mexican restaurants

<b>Next we will build the search query URL to get only popular Mexican restaurants

In [39]:
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    LATITUDE, 
    LONGITUDE, 
    CATEGORY_ID,
    RADIUS, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=GJW1NJFRTAVX5KWLHWJMSBTAH5QBAMCCQTCMDLLYOZQK0LRC&client_secret=WX2RJB0L1MKSO51YNPC5FXNY5FWMV1BDURVHXS32LNMWNL10&v=20180604&ll=19.432895,-99.133173&categoryId=4bf58dd8d48988d1c1941735&radius=10000&limit=50'

<b> We execute the request and we we process the result into a data frame.

In [6]:
results = requests.get(url).json()
#print("Popular Mexican restaurants in a " + str(RADIUS) + " radius: " + str(results['response']['totalResults']))

<b> Next, we process the resulting JSON into a data frame for easier manipulation

In [7]:
# assign relevant part of JSON to venues
venues = results['response']['groups'][0]['items']
dataframe_filtered = pd.json_normalize(venues)

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
filtered_columns = ['venue.id', 'venue.name', 'venue.categories','venue.location.lat', 'venue.location.lng']
dataframe_filtered =dataframe_filtered.loc[:, filtered_columns]
    
dataframe_filtered.columns = [col.split(".")[-1] for col in dataframe_filtered.columns]
   
dataframe_filtered

Unnamed: 0,id,name,categories,lat,lng
0,4be5968ebcef2d7fabf903e5,El Cardenal,"[{'id': '4bf58dd8d48988d1c1941735', 'name': 'M...",19.433746,-99.135216
1,4f219909e4b053b178fc1035,Azul Histórico,"[{'id': '4bf58dd8d48988d1c1941735', 'name': 'M...",19.43288,-99.136197
2,50ca02bf245f2d4aa8c2aeef,El Cardenal,"[{'id': '4bf58dd8d48988d1c1941735', 'name': 'M...",19.434967,-99.146196
3,50303691e4b07d6b35566210,Limosneros,"[{'id': '4bf58dd8d48988d1c1941735', 'name': 'M...",19.436125,-99.137972
4,4b058701f964a520ec7a22e3,Café de Tacuba,"[{'id': '4bf58dd8d48988d1c1941735', 'name': 'M...",19.435682,-99.137591
5,4c2455a0b7b8a59346603ce8,La Casa de Toño,"[{'id': '4bf58dd8d48988d1c1941735', 'name': 'M...",19.42488,-99.165188
6,5539583e498e5acdfa716200,Testal - Cocina Mexicana de Origen,"[{'id': '4bf58dd8d48988d1c1941735', 'name': 'M...",19.433371,-99.142906
7,4d49c47b11a36ea827282b1c,Zéfiro,"[{'id': '4bf58dd8d48988d1c1941735', 'name': 'M...",19.427241,-99.13807
8,4c151fbc7f7f2d7f62ede168,El Balcón del Zócalo,"[{'id': '4bf58dd8d48988d1c1941735', 'name': 'M...",19.433852,-99.134142
9,4e7fa0949adf0c88db536e0c,Coox Hanal,"[{'id': '4bf58dd8d48988d1c1941735', 'name': 'M...",19.428297,-99.137157


In [8]:
dataframe_filtered.shape

(50, 5)

<b> Let's visualize these restaurants on a map:

In [40]:
venues_map = folium.Map(location=[LATITUDE, LONGITUDE], zoom_start=13) # generate map centred around the Conrad Hotel

# add a red circle marker to represent the Conrad Hotel
folium.features.CircleMarker(
    [LATITUDE, LONGITUDE],
    radius=10,
    color='red',
    popup='Zocalo, Mexico City Center',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map)

for lat, lng, label in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.name):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup = label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)

# display map
venues_map

<b> We now shall get a list of the top 50 Mexican restaurant ids, so that we can then get the total of likes of each one. We will only get the very best ones.</b>
<br>
Originally, I wanted to use the rating value, since I think it shows a better value for the quality of the restaurant. Unfortunately, the FourSquare API only allows a very small amount of calls to this "premium" data for free, so I will use likes instead.    
<br>
<b> These ids represent the unique id of their respective venues in Foursquare. Looking for a particular venue, using an id, provides much more information than we have until now.

In [12]:
idsList = dataframe_filtered['id'].tolist()
idsList

['4be5968ebcef2d7fabf903e5',
 '4f219909e4b053b178fc1035',
 '50ca02bf245f2d4aa8c2aeef',
 '50303691e4b07d6b35566210',
 '4b058701f964a520ec7a22e3',
 '4c2455a0b7b8a59346603ce8',
 '5539583e498e5acdfa716200',
 '4d49c47b11a36ea827282b1c',
 '4c151fbc7f7f2d7f62ede168',
 '4e7fa0949adf0c88db536e0c',
 '4cf43ba1e942548138d579c5',
 '4d20f74ad7b0b1f7e246199f',
 '4cb08738562d224b99061788',
 '4d3defb205b8721e2e619937',
 '4debc52ee4cdc079f4a66ffc',
 '53d429f3498e216603bee769',
 '50e4aebce4b003353b3fdfc1',
 '4c8aa0661797236a345e6388',
 '4cb8af7a035d236abc17cf4e',
 '4c51e46494790f4727f3cca1',
 '4ea5cd97be7bbf593ae78f5c',
 '4d673b506ddb59412a98f26f',
 '4f597585e4b0300584d4c999',
 '4bc2100c2a89ef3bcc23f388',
 '4b3029d1f964a5202cf724e3',
 '4bdba1d663c5c9b621192968',
 '4ff3f163e4b0d27dd64941ca',
 '4eb16da2b8f74bbb23d84fd9',
 '4bd76e2388559521a98b87a7',
 '4b7c5f66f964a520718f2fe3',
 '5422004c498ef9d49e61e0d1',
 '4e88c43ae5fa8735be44cabb',
 '4b647c67f964a5201fb72ae3',
 '4c954d6d38dd8cfaf89dd162',
 '4f75cf65e4b0

<b> We will now use these ids, and using a for cycle get the rating for each venue.

In [13]:
jsonsList = []
ratingsList = []

for id in idsList:
    url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(id, CLIENT_ID, CLIENT_SECRET, VERSION)
    result = requests.get(url).json()
    try:
        #rint(result['response']['likes']['count'])
        ratingsList.append(result['response']['likes']['count'])
    except:
        print('There was an error trying to get this venue´s likes')
print(ratingsList)

[2590, 2593, 1872, 908, 3213, 5064, 257, 314, 1048, 770, 23, 224, 2818, 1554, 81, 102, 710, 1038, 128, 2043, 35, 88, 718, 1369, 678, 703, 202, 69, 951, 1419, 783, 646, 6637, 348, 627, 287, 372, 537, 977, 54, 167, 565, 953, 3381, 1069, 2421, 472, 124, 200, 866]


<b>Adding likes to our working dataframe

In [14]:
dataframe_filtered['Likes Count'] = ratingsList
del dataframe_filtered['categories']
dataframe_filtered

Unnamed: 0,id,name,lat,lng,Likes Count
0,4be5968ebcef2d7fabf903e5,El Cardenal,19.433746,-99.135216,2590
1,4f219909e4b053b178fc1035,Azul Histórico,19.43288,-99.136197,2593
2,50ca02bf245f2d4aa8c2aeef,El Cardenal,19.434967,-99.146196,1872
3,50303691e4b07d6b35566210,Limosneros,19.436125,-99.137972,908
4,4b058701f964a520ec7a22e3,Café de Tacuba,19.435682,-99.137591,3213
5,4c2455a0b7b8a59346603ce8,La Casa de Toño,19.42488,-99.165188,5064
6,5539583e498e5acdfa716200,Testal - Cocina Mexicana de Origen,19.433371,-99.142906,257
7,4d49c47b11a36ea827282b1c,Zéfiro,19.427241,-99.13807,314
8,4c151fbc7f7f2d7f62ede168,El Balcón del Zócalo,19.433852,-99.134142,1048
9,4e7fa0949adf0c88db536e0c,Coox Hanal,19.428297,-99.137157,770


## Methodology <a name="methodology"></a>

Now we will start the processing these restaurants, and place them in clusters. We will pay special attention to those of top quality (i.e.: most likes, since we have no access to ratings), close to city center. Why close to city center? Because if you are a tourist will limited time, that's where most historical sites are, and you can just walk to them.
<br>
We have already collected the info of the most liked Mexican restaurants and obtained their coordinates.
Next we will create clusters based on how great they are (top, middle bottom)
Finally we will process and visualize these clusters to define the best options for eating authentic Mexican food, and see if we can find any other interesting conclusions.

## Analysis <a name="analysis"></a>

<b> We will now arrange these restaurants in bins, from the very best, to the average to the bottom of the top 50.
<br>
This will require to get some stats and distribution information from the data frame as follows:
    

In [15]:
max = dataframe_filtered['Likes Count'].max()
avg = dataframe_filtered['Likes Count'].mean()
min = dataframe_filtered['Likes Count'].min()
perc1 = np.percentile(dataframe_filtered['Likes Count'], 66)
perc2 = np.percentile(dataframe_filtered['Likes Count'], 33)

#This are our 3 bins:
top = dataframe_filtered['Likes Count']>=perc1
middle = dataframe_filtered[(dataframe_filtered['Likes Count']>perc2) & (dataframe_filtered['Likes Count']<perc1)]
bottom = dataframe_filtered['Likes Count']<=perc2


print("Maximum number of likes: "  + str(max))
print("Average number of likes: "  + str(avg))
print("Minimum number of likes: "  + str(min))
print("Percentile 1: " + str(perc1))
print("Percentile 2: " + str(perc2))

Maximum number of likes: 6637
Average number of likes: 1101.36
Minimum number of likes: 23
Percentile 1: 961.1600000000001
Percentile 2: 352.08000000000004


<b> We will now add a new column to our dataframe to indicate wether a restaurant is in the top, middle or bottom bin:

In [25]:
def conditions(dff):    
    if dff['Likes Count'] >= perc1:
        return 'Top'    
    elif (dff['Likes Count'] < perc1 and dff['Likes Count'] > perc2):
        return 'Middle'
    elif dff['Likes Count'] <= perc2:
        return 'Bottom'

dataframe_filtered['Bin']=dataframe_filtered.apply(conditions, axis=1)
dataframe_filtered

Unnamed: 0,id,name,lat,lng,Likes Count,Bin
0,4be5968ebcef2d7fabf903e5,El Cardenal,19.433746,-99.135216,2590,Top
1,4f219909e4b053b178fc1035,Azul Histórico,19.43288,-99.136197,2593,Top
2,50ca02bf245f2d4aa8c2aeef,El Cardenal,19.434967,-99.146196,1872,Top
3,50303691e4b07d6b35566210,Limosneros,19.436125,-99.137972,908,Middle
4,4b058701f964a520ec7a22e3,Café de Tacuba,19.435682,-99.137591,3213,Top
5,4c2455a0b7b8a59346603ce8,La Casa de Toño,19.42488,-99.165188,5064,Top
6,5539583e498e5acdfa716200,Testal - Cocina Mexicana de Origen,19.433371,-99.142906,257,Bottom
7,4d49c47b11a36ea827282b1c,Zéfiro,19.427241,-99.13807,314,Bottom
8,4c151fbc7f7f2d7f62ede168,El Balcón del Zócalo,19.433852,-99.134142,1048,Top
9,4e7fa0949adf0c88db536e0c,Coox Hanal,19.428297,-99.137157,770,Middle


<b>We're now going to setup a new data frame with binary values for top, middle and bottom bins:


In [26]:
binary_df = pd.get_dummies(dataframe_filtered[['Bin']], prefix="", prefix_sep="")
binary_df['Name'] = dataframe_filtered['name'] 

fixed_columns = [binary_df.columns[-1]] + list(binary_df.columns[:-1])
binary_df = binary_df[fixed_columns]

binary_df

Unnamed: 0,Name,Bottom,Middle,Top
0,El Cardenal,0,0,1
1,Azul Histórico,0,0,1
2,El Cardenal,0,0,1
3,Limosneros,0,1,0
4,Café de Tacuba,0,0,1
5,La Casa de Toño,0,0,1
6,Testal - Cocina Mexicana de Origen,1,0,0
7,Zéfiro,1,0,0
8,El Balcón del Zócalo,0,0,1
9,Coox Hanal,0,1,0


<b>We're now going to run K-Means to start our Clustering algorithm

In [27]:
clustering_df = binary_df.drop('Name', axis=1)

k_clusters = 3
kmeans = KMeans(n_clusters=k_clusters, random_state=0).fit(clustering_df)
kmeans.labels_[0:10]
#Adding labels to data frame
dataframe_filtered['Cluster Label'] = kmeans.labels_
dataframe_filtered

Unnamed: 0,id,name,lat,lng,Likes Count,Bin,Cluster Label
0,4be5968ebcef2d7fabf903e5,El Cardenal,19.433746,-99.135216,2590,Top,0
1,4f219909e4b053b178fc1035,Azul Histórico,19.43288,-99.136197,2593,Top,0
2,50ca02bf245f2d4aa8c2aeef,El Cardenal,19.434967,-99.146196,1872,Top,0
3,50303691e4b07d6b35566210,Limosneros,19.436125,-99.137972,908,Middle,1
4,4b058701f964a520ec7a22e3,Café de Tacuba,19.435682,-99.137591,3213,Top,0
5,4c2455a0b7b8a59346603ce8,La Casa de Toño,19.42488,-99.165188,5064,Top,0
6,5539583e498e5acdfa716200,Testal - Cocina Mexicana de Origen,19.433371,-99.142906,257,Bottom,2
7,4d49c47b11a36ea827282b1c,Zéfiro,19.427241,-99.13807,314,Bottom,2
8,4c151fbc7f7f2d7f62ede168,El Balcón del Zócalo,19.433852,-99.134142,1048,Top,0
9,4e7fa0949adf0c88db536e0c,Coox Hanal,19.428297,-99.137157,770,Middle,1


<b> This is how our clusters look like:

In [41]:
map_clusters = folium.Map(location=[LATITUDE, LONGITUDE], zoom_start=13)

# Clusters colors
x = np.arange(k_clusters)
ys = [i+x+(i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add a yellow circle marker to represent the Conrad Hotel
folium.features.CircleMarker(
    [LATITUDE, LONGITUDE],
    radius=10,
    color='red',
    popup='Zocalo, Mexico City Center',
    fill = True,
    fill_color = 'yellow',
    fill_opacity = 0.6
).add_to(map_clusters)

markers_colors = []
for lat, lon, poi, cluster in zip(dataframe_filtered['lat'], dataframe_filtered['lng'], dataframe_filtered['name'], dataframe_filtered['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<b>What are the top restaurants, in order, from cluster 0? 

In [36]:
(dataframe_filtered.loc[dataframe_filtered['Cluster Label']==0]).sort_values("Likes Count", ascending=False)

Unnamed: 0,id,name,lat,lng,Likes Count,Bin,Cluster Label
32,4b647c67f964a5201fb72ae3,La Casa de Toño,19.403356,-99.155597,6637,Top,0
5,4c2455a0b7b8a59346603ce8,La Casa de Toño,19.42488,-99.165188,5064,Top,0
43,4b7f2d23f964a520d21c30e3,La Polar,19.438798,-99.167759,3381,Top,0
4,4b058701f964a520ec7a22e3,Café de Tacuba,19.435682,-99.137591,3213,Top,0
12,4cb08738562d224b99061788,Taquería El Abanico,19.414927,-99.130358,2818,Top,0
1,4f219909e4b053b178fc1035,Azul Histórico,19.43288,-99.136197,2593,Top,0
0,4be5968ebcef2d7fabf903e5,El Cardenal,19.433746,-99.135216,2590,Top,0
45,4ba65b90f964a520ed4939e3,La Casa de Toño,19.439875,-99.177742,2421,Top,0
19,4c51e46494790f4727f3cca1,El Parnita,19.414036,-99.16273,2043,Top,0
2,50ca02bf245f2d4aa8c2aeef,El Cardenal,19.434967,-99.146196,1872,Top,0


## Results and Discussion <a name="results"></a>
<br>From this data analysis, we can see that even though there are top quality authentic Mexican restaurants pretty much all over the city, there's quite a decent concentration of them <b>very near</b> the city center.
<br>
This provides a great reason why in this case it would be great to book arooom in a hotel near the city center. Not only tourists are close to several great authentic Mexican restaurants, but by being there they are also a few blocks away from many of Mexico city historic landmarks.
<br>
Also interesting, amongst the top restaurants, there are 3 from "La Casa de Toño" a restaurant very dear to Mexicans, since it started very small less than 20 years ago, but due to competitive prices, excellent service and great food as become extremely big and popular, without a drop in quality. Even if any visitors will wander away fromt he city center, they'll be able to easily find a Casa de Toño and get a great meal.

## Conclusion <a name="conclusion"></a>

There is no shortage of great options for getting quality authentic Mexican food in Mexico City, as was to be expected before starting this project.
However, we have found that for tourists the city center, or Zócalo, is a great option for eating, as it is for sight-seeing.
Even though the stakeholders will be the ones to make the final decision, we can see from these clusters that wherever they choose to book a room, even if it's far from the city center, it's likely they'll have a quality options for dinner nearby, even though we only used 50 restaurants in this project.
