# Capstone Project

## Business Proposition

In Spain the focus of wealth is focused mainly in two cities, Madrid (the capital) and Barcelona. They have the biggest population density and concentrate most of the industrial fabric of the country. Also, they are known for the stress of their city life, the traffic jams, and the polution.
The covid-19 pandemic has made possible that many traditional companies allow their staff to work from home. 

This situation has extended for almost a year now, and many people are considering to make the change permanent. This change has been accepted by many companies, even those based who thought that in-person work was indispensable.
In many cases this situations has been extended, and many people are moving to remote destinations, only requiring a good internet connection to telework.

People moving to new destinations go preferentially to places not in the meanstream of tourist locations, and instead opt for moving (albeit temporarily) to less known destinations that are cheap and at the same time close to nature (mountains, beaches, etc.)

### Objectives

The purspose of this project is to allow decide what destinations are more interesting for people moving from the main cities for a semi-permanent telework.
This analysis could be used for constructor companies, or new companies providing services to this new generation of workers moving away from cities.


### Project definition and scope

The analisys to be conducted will try to find a balance between a work far from big cities while having access to the commodities and ammenities of the modern life. Thus we discard deep rural zones, as internet connection may pose a problem and access to supermarkets, farmacies and leisure not always available.

The project will analyse what cities from Spain, not including the main ones, have a better balance between 'rural' life and access to 'city commodities'

## Data

### Description of the data

We will use the information contained in the following site to extract information about the cities of Spain
https://códigospostales.es/listado-de-codigos-postales-de-espana/

A CSV file (listado-codigos-postales-con-LatyLon.csv) is available in that site containing a list with the provinces/cities/postal_codes and latitude and longititude.

Below we describe the proccess to collect the data and transform it into a Pandas dataframe that will later be used to conduct the analysis and cluster the different locations.

## Data downloading and preprocessing

In [2]:
import csv
import xml
import requests
import urllib.request
import numpy as np
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

### Let's download the data and store it in the local folder for later processing

In [6]:
url= "https://xn--cdigospostales-lob.es/wp-content/uploads/2018/09/listado-codigos-postales-con-LatyLon.csv"
urllib.request.urlretrieve(url, 'listado-codigos-postales-con-LatyLon.csv')

('listado-codigos-postales-con-LatyLon.csv',
 <http.client.HTTPMessage at 0x10f0fa730>)

In [7]:
cities = pd.read_csv('listado-codigos-postales-con-LatyLon.csv', delimiter=';') 
cities.head()

Unnamed: 0,provincia,poblacion,codigopostalid,lat,lon
0,Araba/Álava,Alegría-Dulantzi,240,-2.712437,42.939812
1,Araba/Álava,Alegría-Dulantzi,1193,-2.712437,42.939812
2,Araba/Álava,Amurrio,1450,-3.000073,43.054278
3,Araba/Álava,Amurrio,1468,-3.000073,43.054278
4,Araba/Álava,Amurrio,1470,-3.000073,43.054278


### Latitude/Longitude incorrect!

It's important to note that the latitude and longitude 'columns' in the CSV are changed. That is, the 'longitude' field refers actually to the latitude. So, the first step is to get it correctly.

In [8]:
cities = cities.rename(columns={'lat':'longitude', 'lon':'latitude'})
cities.head()

Unnamed: 0,provincia,poblacion,codigopostalid,longitude,latitude
0,Araba/Álava,Alegría-Dulantzi,240,-2.712437,42.939812
1,Araba/Álava,Alegría-Dulantzi,1193,-2.712437,42.939812
2,Araba/Álava,Amurrio,1450,-3.000073,43.054278
3,Araba/Álava,Amurrio,1468,-3.000073,43.054278
4,Araba/Álava,Amurrio,1470,-3.000073,43.054278


In [9]:
print('The dataframe has {} provinces and {} cities.'.format(
        len(cities['provincia'].unique()),
        cities.shape[0]
    )
)

The dataframe has 52 provinces and 14665 cities.


#### Remove non-used data

Each Province has a number of cities, with the capital of the provice having the same name. So first we just leave out the minor towns in each province and also we exclude the big cities, Madrid and Barcelona


In [11]:
main_cities = cities[cities['provincia']==cities['poblacion']]
main_cities = main_cities.drop_duplicates(['provincia','poblacion'], keep='last')

indexNames = main_cities[ (main_cities['poblacion']=='Madrid') | (main_cities['poblacion']=='Barcelona')| (main_cities['poblacion']=='Valencia') ].index
main_cities.drop(indexNames , inplace=True)
main_cities.reset_index(inplace = True)
main_cities.head()

Unnamed: 0,index,provincia,poblacion,codigopostalid,longitude,latitude
0,148,Albacete,Albacete,2512,-1.855747,38.995881
1,349,Alicante/Alacant,Alicante/Alacant,3699,-0.483183,38.345487
2,586,Almería,Almería,4160,-2.464132,36.838924
3,791,Ávila,Ávila,5197,-4.697713,40.65587
4,1099,Badajoz,Badajoz,6195,-6.970997,38.878743


In [15]:
# create map of Spain using latitude and longitude values
map_spain = folium.Map(location=[36.976, -4.27], zoom_start=6)

# add markers to map
for lat, lng, province, city in zip(main_cities['latitude'], main_cities['longitude'], main_cities['provincia'], main_cities['poblacion']):
    label = '{}, {}'.format(city, province)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_spain)  
    
map_spain

In [28]:
CLIENT_ID = 'KQE1T3QCCUCSDMTK1VIGDJRHU4ZQHKGVZVV5RIWBUQEKVTZ2' # your Foursquare ID
CLIENT_SECRET = 'KQE1T3QCCUCSDMTK1VIGDJRHU4ZQHKGVZVV5RIWBUQEKVTZ2' # your Foursquare Secret
ACCESS_TOKEN = 'PPA4JQ0QQTS2JREWQ1PURHOKC13H4RZRJDBBEGDZZL1KQK5E' # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KQE1T3QCCUCSDMTK1VIGDJRHU4ZQHKGVZVV5RIWBUQEKVTZ2
CLIENT_SECRET:KQE1T3QCCUCSDMTK1VIGDJRHU4ZQHKGVZVV5RIWBUQEKVTZ2


In [29]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&oauth_token={}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        lat, 
        lng,
        ACCESS_TOKEN,
        radius, 
        LIMIT)
        #url 

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [30]:
#main_cities = main_cities.head()
city = main_cities.loc[:, 'poblacion'] #+'/'+list(map(str, malaga_data.loc[:, 'codigopostalid']))
latitudes = main_cities.loc[:, 'latitude']
longitudes = main_cities.loc[:, 'longitude']

spain_venues = getNearbyVenues(city, latitudes, longitudes, radius=500)

Albacete
Alicante/Alacant
Almería
Ávila
Badajoz
Burgos
Cáceres
Cádiz
Ciudad Real
Córdoba
A Coruña
Cuenca
Girona
Granada
Guadalajara
Huelva
Huesca
Jaén
León
Lleida
Lugo
Málaga
Murcia
Ourense
Palencia
Pontevedra
Salamanca
Santa Cruz de Tenerife
Segovia
Sevilla
Soria
Tarragona
Teruel
Toledo
Valladolid
Zamora
Zaragoza
Ceuta
Melilla


In [32]:
print(spain_venues.shape)
spain_venues.head()

(2787, 7)


Unnamed: 0,City,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Albacete,38.995881,-1.855747,Asador Concepción,38.994365,-1.855289,Spanish Restaurant
1,Albacete,38.995881,-1.855747,Gran Hotel Albacete,38.994349,-1.85396,Hotel
2,Albacete,38.995881,-1.855747,La Bodega de Serapio,38.99508,-1.856989,Winery
3,Albacete,38.995881,-1.855747,Piacere Gelato dil giorno,38.994309,-1.855284,Ice Cream Shop
4,Albacete,38.995881,-1.855747,Teatro Circo,38.995807,-1.854121,Theater


## Conclussion

At this point we have the data ready to be analysed.