# Capstone Project - New Restaurant in Town

## Project by Erick Daniel Rodriguez

### 1. Introduction

Mexico as a country is facing a strong obesity issue, until the late 20th century, dietary issues in Mexico were solely a question of undernutrition or malnutrition, generally because of poverty and distribution issues (obesity was associated with wealth and health). In the past years there has been a significant increase in consumption of high-energy -sugar, -fat, and -salt food featuring various types of sweeteners and animal products and a decrease in whole grains and vegetables. The main reason for this shift is the dominance of transnational food companies in the mexican market.

For this project we will be using data from Mexico City. We want to see which borough would be the most ideal to open a new vegetarian/vegan restaurant taking into consideration that most of this kind of restaurants are more expensive than a "regular" restaurant in the city.

This will be a basic analysis, there are other many factors to consider that will not be done within the scope of this project.

### 2. Data

#### 2.1 Data description

The data that will be used for this project will be:

* Number of restaurants within the area of each borough
    - Foursquare API


* Population and job occupation division of each borough
    - Statistics data from 2017 of Mexico City (https://www.datatur.sectur.gob.mx/ITxEF_Docs/CDMX_ANUARIO_PDF.pdf)


* Borough coordinates
    - Open StreetMap (https://www.openstreetmap.org/relation/1376330)

#### 2.2 Data preparation

The first step will be to import the libraries that will be used.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import tabula as tb # library for PDF to DataFrame conversion

import pandas as pd # library for data analsysis

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Creat dataframes from the "Statistics data from 2017 of Mexico City" report using the library Tabula.

In [2]:
file_path = "https://www.datatur.sectur.gob.mx/ITxEF_Docs/CDMX_ANUARIO_PDF.pdf"
#Convert the file
occupation = tb.read_pdf(file_path, pages=331)

We have downloaded the employed population by delegation and its percentage distribution according to occupational division by March 15th of 2015 from the report, now we will look at the data and make the relevant adjustments.

In [3]:
occupation=occupation[0] # Here we will index it to 0
occupation

Unnamed: 0.1,Unnamed: 0,Total,Unnamed: 1,División ocupacional a/,Unnamed: 2,Unnamed: 3
0,,,,(Porcentaje),,
1,Delegación,,"Funcionarios,",Trabajadores Trabajadores,Comerciantes y,No
2,,,"profesionistas,",agropecuarios en la industria c/,trabajadores en,especificado
3,,,técnicos y,,servicios diversos d/,
4,,,administrativos b/,,,
5,Ciudad de México,4 033 273,43.91,0.39 14.63,39.44,1.62
6,Álvaro Obregón,351 409,42.30,0.14 16.55,39.34,1.66
7,Azcapotzalco,180 813,48.17,0.02 14.19,35.99,1.63
8,Benito Juárez,223 843,69.81,0.01 5.02,24.04,1.12
9,Coyoacán,280 561,54.74,0.04 10.44,32.04,2.74


We can notice there is a problem with the first few rows of the dataframe, therefore, these first rows will be deleted and new ones will be added also considering new titles.

In [4]:
occupation.drop(occupation.head(6).index,inplace=True)
occupation

Unnamed: 0.1,Unnamed: 0,Total,Unnamed: 1,División ocupacional a/,Unnamed: 2,Unnamed: 3
6,Álvaro Obregón,351 409,42.3,0.14 16.55,39.34,1.66
7,Azcapotzalco,180 813,48.17,0.02 14.19,35.99,1.63
8,Benito Juárez,223 843,69.81,0.01 5.02,24.04,1.12
9,Coyoacán,280 561,54.74,0.04 10.44,32.04,2.74
10,Cuajimalpa de Morelos,91 063,41.6,0.26 16.85,39.82,1.46
11,Cuauhtémoc,269 664,51.26,0.06 8.13,38.54,2.02
12,Gustavo A. Madero,498 501,40.56,0.10 16.79,41.49,1.06
13,Iztacalco,175 194,46.45,0.15 14.20,38.1,1.09
14,Iztapalapa,786 218,34.8,0.10 18.50,45.3,1.31
15,La Magdalena Contreras,105 951,35.87,0.23 17.55,43.5,2.86


In [5]:
occupation.columns = ['Borough', 'Total', '%A', '%B', '%C', '%D']
occupation

Unnamed: 0,Borough,Total,%A,%B,%C,%D
6,Álvaro Obregón,351 409,42.3,0.14 16.55,39.34,1.66
7,Azcapotzalco,180 813,48.17,0.02 14.19,35.99,1.63
8,Benito Juárez,223 843,69.81,0.01 5.02,24.04,1.12
9,Coyoacán,280 561,54.74,0.04 10.44,32.04,2.74
10,Cuajimalpa de Morelos,91 063,41.6,0.26 16.85,39.82,1.46
11,Cuauhtémoc,269 664,51.26,0.06 8.13,38.54,2.02
12,Gustavo A. Madero,498 501,40.56,0.10 16.79,41.49,1.06
13,Iztacalco,175 194,46.45,0.15 14.20,38.1,1.09
14,Iztapalapa,786 218,34.8,0.10 18.50,45.3,1.31
15,La Magdalena Contreras,105 951,35.87,0.23 17.55,43.5,2.86


There is a problem with column "%B", the column is displaying two values, therefore, this must be changed in order to have two different columns.

In [6]:
nums1, nums2 = list(), list()
for vals in occupation['%B'].values:
    nums = [float(i) for i in vals.split()]
    nums1.append(nums[0])
    nums2.append(nums[1])

occupation['%B'] = nums1
occupation['%B2'] = nums2 # Temporary name for the splitted of the column
occupation

Unnamed: 0,Borough,Total,%A,%B,%C,%D,%B2
6,Álvaro Obregón,351 409,42.3,0.14,39.34,1.66,16.55
7,Azcapotzalco,180 813,48.17,0.02,35.99,1.63,14.19
8,Benito Juárez,223 843,69.81,0.01,24.04,1.12,5.02
9,Coyoacán,280 561,54.74,0.04,32.04,2.74,10.44
10,Cuajimalpa de Morelos,91 063,41.6,0.26,39.82,1.46,16.85
11,Cuauhtémoc,269 664,51.26,0.06,38.54,2.02,8.13
12,Gustavo A. Madero,498 501,40.56,0.1,41.49,1.06,16.79
13,Iztacalco,175 194,46.45,0.15,38.1,1.09,14.2
14,Iztapalapa,786 218,34.8,0.1,45.3,1.31,18.5
15,La Magdalena Contreras,105 951,35.87,0.23,43.5,2.86,17.55


The column is now correctly splitted, however, the column was sent to the end of the dataframe, so we have to place it in the correct order.

In [7]:
occupation=occupation[['Borough', 'Total', '%A', '%B', '%B2','%C', '%D']]
occupation.round(decimals=2)
occupation.columns = ['Borough', 'Total', '%A', '%B', '%C', '%D', '%E']
occupation.style.set_caption("Employed population by delegation and its percentage distribution according to occupational division")
occupation

Unnamed: 0,Borough,Total,%A,%B,%C,%D,%E
6,Álvaro Obregón,351 409,42.3,0.14,16.55,39.34,1.66
7,Azcapotzalco,180 813,48.17,0.02,14.19,35.99,1.63
8,Benito Juárez,223 843,69.81,0.01,5.02,24.04,1.12
9,Coyoacán,280 561,54.74,0.04,10.44,32.04,2.74
10,Cuajimalpa de Morelos,91 063,41.6,0.26,16.85,39.82,1.46
11,Cuauhtémoc,269 664,51.26,0.06,8.13,38.54,2.02
12,Gustavo A. Madero,498 501,40.56,0.1,16.79,41.49,1.06
13,Iztacalco,175 194,46.45,0.15,14.2,38.1,1.09
14,Iztapalapa,786 218,34.8,0.1,18.5,45.3,1.31
15,La Magdalena Contreras,105 951,35.87,0.23,17.55,43.5,2.86


We now have the correct order. The dataframe is divided by borough.
The meaning of each column is the following:
* Borough = Name of the borough

* Total = Total population

* %A = officers, directors and managers; professionals and technicians; as well as auxiliary workers in administrative activities.

* %B = Agricultural workers.

* %C = Craft workers; as well as industrial machinery operators, assemblers, drivers and transport drivers.

* %D = Merchants, sales employees, and sales agents; workers in personal services and surveillance; as well as workers in elementary and support activities.

* %E = Not specified


Now we will get the latitude and longitude of the neighbourhoods, which are retrieved using Open Street Map Geocoding

In [20]:
#Get Latitude and Longitude for suburbs
address= occupation['Borough']
geolocater= Nominatim(user_agent="mexico_city-explorer")
location=[]
empty=[]

def getcoords(add):
    try:
        coords= geolocater.geocode(add, timeout=10)
        location.append([add, coords.latitude, coords.longitude])
        print("the coords are {}".format(location[-1]))
    
    except GeocoderTimedOut:
        return getcoords(add)
    
    except:
        empty.append([add])
        print("Couldn't find coords of {}".format(empty[-1]))
        
for add in address:
        getcoords(add)

the coords are ['Álvaro Obregón', 19.318148049999998, -99.2778443631872]
the coords are ['Azcapotzalco', 19.4858148, -99.18420573027606]
the coords are ['Benito Juárez', 20.8169666, -98.17826806649418]
the coords are ['Coyoacán', 19.32804005, -99.15106340693589]
the coords are ['Cuajimalpa de Morelos', 19.3187067, -99.32320297716439]
the coords are ['Cuauhtémoc', 19.4416128, -99.1518637]
the coords are ['Gustavo A. Madero', 19.518545449999998, -99.1436399464875]
the coords are ['Iztacalco', 19.39897535, -99.09531197032297]
the coords are ['Iztapalapa', 19.3428293, -99.04689193846701]
the coords are ['La Magdalena Contreras', 19.27547005, -99.26333858358939]
the coords are ['Miguel Hidalgo', 19.429614049999998, -99.19863845640572]
the coords are ['Milpa Alta', 19.138028, -99.05892017210884]
the coords are ['Tláhuac', 19.26950425, -99.00409684032508]
the coords are ['Tlalpan', 19.200877, -99.21701240427146]
the coords are ['Venustiano Carranza', 16.30898425, -92.6379347298267]
the coords

Now we trasnform the obtained borough latitude and longitude values into a dataframe.

In [21]:
loc=pd.DataFrame(location, columns=['Borough','Latitude','Longitude'])
loc

Unnamed: 0,Borough,Latitude,Longitude
0,Álvaro Obregón,19.318148,-99.277844
1,Azcapotzalco,19.485815,-99.184206
2,Benito Juárez,20.816967,-98.178268
3,Coyoacán,19.32804,-99.151063
4,Cuajimalpa de Morelos,19.318707,-99.323203
5,Cuauhtémoc,19.441613,-99.151864
6,Gustavo A. Madero,19.518545,-99.14364
7,Iztacalco,19.398975,-99.095312
8,Iztapalapa,19.342829,-99.046892
9,La Magdalena Contreras,19.27547,-99.263339


We can also obtain the latitude and longitude of Mexico City as a whole

In [22]:
address = 'Mexico City'

geolocator = Nominatim(user_agent="mexico_city-explorer")
location = geolocator.geocode(address)
latitude_CDMX = location.latitude
longitude_CDMX = location.longitude
print('The geograpical coordinate of Mexico City are {}, {}.'.format(latitude_CDMX, longitude_CDMX))

The geograpical coordinate of Mexico City are 19.4326296, -99.1331785.


We now create a map of Mexico City in order to see the division by each borough, this is created with the library folium. An important consideration is that some names on the labels will have an odd format, this is due to the fact that some borough names have accents. 

In [23]:
# Creates map of Mexico City using latitude and longitude values
map_CDMX = folium.Map(location=[latitude_CDMX, longitude_CDMX], zoom_start=10)

# Add markers to map
for lat, lng, borough in zip(loc['Latitude'], loc['Longitude'], loc['Borough']):
    label = '{}'.format(borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_CDMX)  
    
map_CDMX

This ends the first section of the capstone project, on the next part we will begin to use the Foursquare API in order to start the clustering analysis.