# Store locations
*__Finding the optimal location for a low-cost supermarket in Madrid__*

## Table of Contents

* [Introduction](#introduction)
    * [Background](#background)
    * [Business problem](#business-problem)
* [Data](#data)
    * [Wikipedia](#wikipedia)
    * [Geospatial data](#geospatial-data)
    * [Census data](#census-data)
* [Methodology](#methodology)
    * [Data acquisition](#data-acquisition)
        * [Scrapping Wikipedia](#scrapping-wikipedia)
        * [Requesting Foursquare API for geospatial data](#requesting-foursquare)
        * [Loading census data file](#loading-census)
    * [Neighbourhood segmentation](#neighbourhood-segmentation)
    * [Census analysis](#census-analysis)
* [Results and discussion](#results-and-discussion)
* [Conclusion](#conclusion)

## Introduction <a class="anchor" id="introduction"></a>

### Background <a class="anchor" id="background"></a>

Opening a physical store is not without risks. One the most obvious risks to evaluate before opening a new store is receiving enough customers to make the business profitable. Therefore it is essential to pick the right location to make sure it is convenient for the customers and there is enough demand for a new store in the area.

Identifying the target customers is also key taking into account the behaviour of the different groups of consumers and the stores they usually buy at.

### Business problem <a class="anchor" id="business-problem"></a>

In this project, we tackle the problem of a low-cost supermarket chain trying to decide in which area of Madrid (Spain) they should open their new store in order to maximise the revenue. It is important to note that the city of Madrid consists of 21 districts and 128 neighbourhoods with great differences between them. 

The goal is to identify the optimal neighbourhood for opening a store taking different factors into consideration such as the types of neighbourhood (a residential area would be ideal), the amount of people living in those areas (the higher the population the higher the food demand), their average income (working class people are preferred) and the stores that are already available (avoiding areas with a big density of supermarkets).

Using data science, geospatial analysis and machine learning techniques, this project aims to provide a solution for this problem and recommending the best neighbourhood for opening the low-cost supermarket.

## Data <a class="anchor" id="data"></a>

The following sections describe the data that is needed for answering this business question.

### Wikipedia <a class="anchor" id="wikipedia"></a>

The first data that we need is the list of neighbourhoods in Madrid. Even though this information could have been directly obtained from a CSV file from Madrid city council, it has been decided to use web scraping on Wikipedia for learning purposes.

The Wikipedia page “List of neighbourhoods of Madrid” shows a table with the name of each neighbourhood for each of the 21 districts. In our project, we will work directly with the neighbourhoods and ignore the districts since this way we can perform a more granular analysis of the areas.

<p align="center">
  <img width="500" height="500" src="img/wikipedia_madrid.png">
</p>

<p style="text-align: center;">
    Source: 
    <a href="https://en.wikipedia.org/wiki/List_of_neighborhoods_of_Madrid">https://en.wikipedia.org/wiki/List_of_neighborhoods_of_Madrid</a>
</p>

### Geospatial data <a class="anchor" id="geospatial-data"></a>

Since the plan is to target residential areas, we need to analyse the type of food venues present in each neighbourhood. With the Foursquare API we can explore the different food venues, considering that a big density of bars and restaurants over very few supermarkets will most likely refer to a business or recreational area where people don’t usually buy at supermarkets. In the other hand, a large proportion of supermarkets over the rest of food venues might indicate it is a residential area where people normally make their food shopping.

Before we can make use of the Foursquare API we need to convert the neighbourhood names into a pair of latitude and longitude coordinates. We can acquire this with the _geocoder_ Python library.
We can query the Foursquare API using the HTTP GET method on the explore endpoint indicating the geographical coordinates, venue categories and radius.

<p align="center">
  <img width="500" height="500" src="img/foursquare_api.png">
</p>

<p style="text-align: center;">
    Source: 
    <a href="https://developer.foursquare.com/docs/api-reference/venues/explore/">https://developer.foursquare.com/docs/api-reference/venues/explore/</a>
</p>

### Census data <a class="anchor" id="census-data"></a>

Finally we will need data from the census of Madrid. We can obtained this data from Excel files that are accessible from the Madrid city council website. Particularly we are interested in the population and the average income of each neighbourhood.

<p align="center">
  <img width="500" height="500" src="img/census_population.png">
</p>

<p style="text-align: center;">
    Source: 
    <a href="http://www-2.munimadrid.es/TSE6/control/seleccionDatosBarrio">http://www-2.munimadrid.es/TSE6/control/seleccionDatosBarrio</a>
</p>

<p align="center">
  <img width="500" height="500" src="img/census_income.png">
</p>

<p style="text-align: center;">
    Source: 
    <a href="https://datos.madrid.es/portal/site/egob/menuitem.c05c1f754a33a9fbe4b2e4b284f1a5a0/?vgnextoid=d029ed1e80d38610VgnVCM2000001f4a900aRCRD&vgnextchannel=374512b9ace9f310VgnVCM100000171f5a0aRCRD&vgnextfmt=default">https://datos.madrid.es/portal/site/egob</a>
</p>

## Methodology <a class="anchor" id="methodology"></a>

The following sections describe the different stages of this project, starting with the acquisition of the data (venues and census), following with the clustering of the neighbourhoods and finishing with an analysis of the census for the corresponding group of neighbourhoods.

### Libraries import

In [56]:
# for HTTP requests
import requests  

# for HTML scrapping 
from bs4 import BeautifulSoup 

# for table analysis
import pandas as pd
import numpy as np

# for finding geographical coordinates
!pip install geocoder
import geocoder

# for converting an address into latitude and longitude values
from geopy.geocoders import Nominatim

# for rending maps
import folium

# for removing Spanish accents from map labels in Folium (encoding incompatibility)
import unicodedata



### Data acquisition <a class="anchor" id="data-acquisition"></a>

The following sub-sections describe the methodology to obtain the different datasets, either by loading an external file or by scrapping a website.

#### Scrapping Wikipedia <a class="anchor" id="scrapping-wikipedia"></a>

__URL for Wikipedia article__

In [4]:
# URL of wikipedia page from which to scrap tabular data.
wiki_url = "https://en.wikipedia.org/wiki/List_of_neighborhoods_of_Madrid"

__Request & Response__

In [5]:
# If the request was successful, reponse should be '200'.
response = requests.get(wiki_url) #.json()
response

<Response [200]>

__Wrangling HTML With BeautifulSoup__

In [6]:
# Parse response content to html
soup = BeautifulSoup(response.content, 'html.parser')
#soup

__Viewing HTML content__

In [14]:
# Title of Wikipedia page
title = soup.title.string
print(f'Page title: {title}') 

# Find the right table to scrap
wiki_table=soup.find('table', {"class":'wikitable sortable'})

# Get the 1st row of the table i.e. the header
row0 = wiki_table.findAll("tr")[0]

# Show the column names (we're only interested in "Name" i.e. the 4th column)
header = [th.text.rstrip() for th in row0.find_all('th')]
print(f'Column names: {header}') 

Page title: List of neighborhoods of Madrid - Wikipedia
Column names: ['District name (number)', 'District location', 'Number', 'Name', 'Image']


__Scraping the table contents__

In [43]:
# Placeholder for list of neighbourhoods
madrid_neighbourhoods = []

# Iterate through the rows of the table
# Note: each district has sub-rows of neighbourhoods
for row in wiki_table.findAll("tr"):    
    cells = row.findAll('td')

    # Parse 1st neighbourhood of the district
    if len(cells)==5:
        madrid_neighbourhoods.append(cells[3].find(text=True).replace('\n', '').rstrip().lstrip())
    
    # Parse sub-rows (rest of neighbourhoods of the district)
    elif len(cells)==3:
        madrid_neighbourhoods.append(cells[1].find(text=True).replace('\n', '').rstrip().lstrip())
    
print(f'Number of neighbourhoods: {len(madrid_neighbourhoods)}')
madrid_neighbourhoods

Number of neighbourhoods: 128


['Palacio',
 'Embajadores',
 'Cortes',
 'Justicia',
 'Universidad',
 'Sol',
 'Imperial',
 'Acacias',
 'Chopera',
 'Legazpi',
 'Delicias',
 'Palos de Moguer',
 'Atocha',
 'Pacífico',
 'Adelfas',
 'Estrella',
 'Ibiza',
 'Jerónimos',
 'Niño Jesús',
 'Recoletos',
 'Goya',
 'Fuente del Berro',
 'La Guindalera',
 'Lista',
 'Castellana',
 'El Viso',
 'Prosperidad',
 'Ciudad Jardín',
 'Hispanoamérica',
 'Nueva España',
 'Castilla',
 'Bellas Vistas',
 'Cuatro Caminos',
 'Castillejos',
 'Almenara',
 'Valdeacederas',
 'Berruguete',
 'Gaztambide',
 'Arapiles',
 'Trafalgar',
 'Almagro',
 'Ríos Rosas',
 'Vallehermoso',
 'El Pardo',
 'Fuentelarreina',
 'Peñagrande',
 'Pilar',
 'La Paz',
 'Valverde',
 'Mirasierra',
 'El Goloso',
 'Casa de Campo',
 'Argüelles',
 'Ciudad Universitaria',
 'Valdezarza',
 'Valdemarín',
 'El Plantío',
 'Aravaca',
 'Los Cármenes',
 'Puerta del Ángel',
 'Lucero',
 'Aluche',
 'Campamento',
 'Cuatro Vientos',
 'Las Águilas',
 'Comillas',
 'Opañel',
 'San Isidro',
 'Vista Alegre

__Creating a dataframe__

In [48]:
df_madrid = pd.DataFrame(madrid_neighbourhoods, columns=['Neighbourhood'])
print(f'Shape: {df_madrid.shape}')
df_madrid.head()

Shape: (128, 1)


Unnamed: 0,Neighbourhood
0,Palacio
1,Embajadores
2,Cortes
3,Justicia
4,Universidad


__Getting geographical coordinates for the neighbourhoods__

In [49]:
# Function that retrieves the geographical coordinates for a given neighborhood
def get_coordinates(row):
    # initialize variable to None
    lat_lng_coords = None

    # loop until we get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.arcgis(f'{row["Neighbourhood"]}, Madrid')
      lat_lng_coords = g.latlng
    
    # return pair lat,long
    return pd.Series([lat_lng_coords[0], lat_lng_coords[1]])

In [53]:
# Fill coordinates for each row
df_madrid[['Latitude','Longitude']] = df_madrid.apply(get_coordinates, axis=1)
df_madrid.head()

  Neighbourhood  Latitude  Longitude
0       Palacio  40.41517   -3.71273
1   Embajadores  40.40803   -3.70067
2        Cortes  40.41589   -3.69636
3      Justicia  40.42479   -3.69308
4   Universidad  40.42565   -3.70726


Neighbourhood    0
Latitude         0
Longitude        0
dtype: int64

In [58]:
# Make sure we found the coordinates for all the neighbourhoods
df_madrid.isnull().sum()

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Palacio,40.41517,-3.71273
1,Embajadores,40.40803,-3.70067
2,Cortes,40.41589,-3.69636
3,Justicia,40.42479,-3.69308
4,Universidad,40.42565,-3.70726


__Visualizing the neighbourhoods of Madrid__

In [57]:
address = 'Madrid, Spain'
geolocator = Nominatim(user_agent="madrid_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f'The geograpical coordinates of Madrid are {latitude}, {longitude}')

The geograpical coordinates of Madrid are 40.4167047, -3.7035825.


In [79]:
# Spanish accents are not correctly rendered in the map labels so we decided
# to remove them for a better understanding of the map
def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii.decode("utf-8")

In [89]:
# create map of Madrid using latitude and longitude values
map_madrid = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_madrid['Latitude'], df_madrid['Longitude'], df_madrid['Neighbourhood']):
    label = folium.Popup(remove_accents(label), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_madrid)  
    
map_madrid

#### Requesting Foursquare API for geospatial data <a class="anchor" id="requesting-foursquare"></a>

#### Loading census data file<a class="anchor" id="loading-census"></a>

### Neighbourhood segmentation <a class="anchor" id="neighbourhood-segmentation"></a>

### Census analysis <a class="anchor" id="census-analysis"></a>

## Results and discussion <a class="anchor" id="results-and-discussion"></a>

## Conclusion <a class="anchor" id="conclusion"></a>