# Capstone Project: Battle of the Neighbourhoods

## Table of contents
* [Introduction](#Introduction)
* [Data](#Data)
* [Methodology](#Methodology)
* [Analysis](#Analysis)
* [Results and Discussion](#Results)
* [Conclusion](#Conclusion)

## Introduction  <a name="Introduction"></a>

In this project, I will be analysing neighbourhoods in Madrid, Spain to find candidate locations for a new hotel. 

The ideal location will be close to the city centre and within a certain distance of the main attractions in Madrid.
Additionally, it will be not too close to other hotels and will also be in an area where real estate prices are not too high.

I will then use an unsupervised clustering algorithm which will take into account these criteria to narrow down to a few candidate locations, from which a final location may be chosen.


## Data  <a name="Data"></a>

For this project, the data I will need for performing the analysis will be the following:
* Hotel location data, obtained from the Foursquare API, to find where hotels are located and how many hotels there are in each neighbourhood.
* Madrid main tourist attractions, obtained by webscraping popular tourist information websites.
* Madrid average real estate prices by district, obtained from www.statista.com.

To perform the analysis of the center of Madrid, I will first find the Latitude/Longitude of the center of Madrid using the geopy library for the address of the main square in Madrid: Plaza Mayor.


In [86]:
from geopy.geocoders import Nominatim
address = 'Plaza Mayor, Madrid'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
madrid_center = location.latitude, location.longitude
madrid_center

(40.415392, -3.7073743182788528)

Now, I will create a square grid of 20x21 points around the center of Madrid, of 0.02 degrees in length.

Then I will save it as a list of 441 coordinates (points_list).

In [79]:
import pandas as pd
import numpy as np

a = np.linspace(madrid_center[0]-.02,madrid_center[0]+.02,21)
b = np.linspace(madrid_center[1]-.02,madrid_center[1]+.02,21)

points = []
for lat in a:
    points.append([[lat,long] for long in b])
    
points
points_list = [item for sublist in points for item in sublist]
points_list[0:10]

[[40.395391999999994, -3.727374318278853],
 [40.395391999999994, -3.7253743182788526],
 [40.395391999999994, -3.723374318278853],
 [40.395391999999994, -3.721374318278853],
 [40.395391999999994, -3.719374318278853],
 [40.395391999999994, -3.7173743182788526],
 [40.395391999999994, -3.715374318278853],
 [40.395391999999994, -3.713374318278853],
 [40.395391999999994, -3.711374318278853],
 [40.395391999999994, -3.7093743182788526]]

We can visualize this grid using the Folium library:

In [80]:
import folium

madrid_map = folium.Map(location=madrid_center, zoom_start=13)
for point in points_list:
    folium.CircleMarker(point,
        radius=1,
        color='red',
        fill = False
    ).add_to(madrid_map)
madrid_map


Now, I will use the Foursquare API to get nearby hotels (within 200m) for each of these points and save it to a Pandas DataFrame.

In [46]:
import requests
CLIENT_ID = 'UB1VBRTDO3C53ZQI1CIPLEGSPZ5OROD3DXFRFAWL0TTCW3W4' 
CLIENT_SECRET = '2BCGRX1ST521BQUDCPGWBNAS0RDVYRZ045UJ0VE5LG5J0YDR' 
ACCESS_TOKEN = "C2DFDBS14FP2HO0QESIAI2VKL1RLTSEPOKHT5JYABUKQIJAH" 
VERSION = '20180604'
LIMIT = 50
search_query = 'Hotel'
radius = 200

ID=[]
Name=[]
Address=[]
Latitude=[]
Longitude=[]
for loc in points_list:
    lat = loc[0]
    long = loc[1]
    try:
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, long, ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)
        results = requests.get(url).json()
        hotels = results['response']['venues']
        df_hot = pd.json_normalize(hotels)
        df_hot = df_hot[['id','name','location.address','location.lat','location.lng']]
        for i,n,a,lt,lg in zip(df_hot['id'],df_hot['name'],df_hot['location.address'],df_hot['location.lat'],df_hot['location.lng']):
            ID.append(i)
            Name.append(n)
            Address.append(a)
            Latitude.append(lt)
            Longitude.append(lg)
    except:
        continue

df_hotels = pd.DataFrame(data={'ID':ID,'Name':Name,'Address':Address, 'Latitude':Latitude, 'Longitude':Longitude})
df_hotels.head()

Unnamed: 0,ID,Name,Address,Latitude,Longitude
0,4dc7ae6d1520cca629d31061,Restaurante Asador Las Brasas,abba Atocha hotel ***,40.398775,-3.701248
1,4dc7ae6d1520cca629d31061,Restaurante Asador Las Brasas,abba Atocha hotel ***,40.398775,-3.701248
2,4fc6dcade4b00c1861c70359,Hotel Praga madrid,Antonio lopez 62,40.396478,-3.692974
3,4fc6dcade4b00c1861c70359,Hotel Praga madrid,Antonio lopez 62,40.396478,-3.692974
4,4dc7ae6d1520cca629d31061,Restaurante Asador Las Brasas,abba Atocha hotel ***,40.398775,-3.701248


Save the results as a csv file for future use.

In [48]:
df_hotels.to_csv('Madrid_Hotels.csv')

We can see we've picked up a few copies of the same Hotel, so lets remove duplicate Hotels:

In [49]:
df_hotels.drop_duplicates(inplace=True)
len(df_hotels)

356

So we have 356 Hotels in center of Madrid. Let's visualize where they are located using a Heat Map.

In [129]:
from folium import plugins
from folium.plugins import HeatMap

hotel_locations = [[hot[3],hot[4]] for hot in df_hotels.values]

madrid_map = folium.Map(location=madrid_center, zoom_start=13)
HeatMap(hotel_locations).add_to(madrid_map)
folium.Marker(madrid_center).add_to(madrid_map)
folium.Circle(madrid_center, radius=1000, fill=False, color='white').add_to(madrid_map)
folium.Circle(madrid_center, radius=2000, fill=False, color='white').add_to(madrid_map)
folium.Circle(madrid_center, radius=3000, fill=False, color='white').add_to(madrid_map)
madrid_map

So we can see there are fewer hotels to the South and South-West of the center of Madrid and also directly North of the center. 

Now, lets open the csv file containing the top 10 tourist attractions in Madrid, and visualize where they are on the map.

In [39]:
Madrid_attractions = pd.read_csv('Madrid_Attractions.csv')
Madrid_attractions

Unnamed: 0,Attraction,Latitude,Longitude
0,Gran Vía,40.42012,-3.70399
1,Palacio Real,40.417498,-3.708331
2,Museo del Prado,40.41378,-3.692127
3,El Rastro Market,40.408623,-3.707338
4,El Retiro Park,40.41526,-3.6845
5,Plaza Mayor,40.415524,-3.707488
6,Museo Nacional Reina Sofia,40.407913,-3.694557
7,Puerta de Alcala,40.419991,-3.688737
8,Puerta del Sol,40.416729,-3.703339
9,Templo de Debod,40.424023,-3.71777


In [131]:
madrid_map = folium.Map(location=madrid_center, zoom_start=13)
for att in Madrid_attractions.values:
    folium.CircleMarker([att[1],att[2]],
        radius=5,
        color='red',
        label=att[0],
        fill = False
    ).add_to(madrid_map)
madrid_map

And finally, data on the average price (in Euros) per square metre of real estate by district in Madrid.

In [91]:
Madrid_prices = pd.read_csv('Madrid_Neighbourhoods_Prices.csv')
Madrid_prices.describe()

Unnamed: 0,price
count,21.0
mean,3062.57
std,1028.1
min,1604.0
25%,2196.0
50%,2821.0
75%,3659.0
max,5048.0


Now, find the latitude/longitude for each neighbourhood and append to Madrid_prices dataframe:

In [102]:
latitude = []
longitude = []
for nbhd in Madrid_prices['district']:
    try:
        address = '{}, Madrid'.format(nbhd)
        geolocator = Nominatim(user_agent="foursquare_agent")
        location = geolocator.geocode(address)
        latitude.append(location.latitude)
        longitude.append(location.longitude)
    except:
        latitude.append(np.NaN)
        longitude.append(np.NaN)
Madrid_prices['Latitude'] = latitude
Madrid_prices['Longitude'] = longitude
Madrid_prices.head()

Unnamed: 0,district,price,Latitude,Longitude
0,Salamanca,5048,40.43,-3.68
1,Chamberí,4785,40.44,-3.7
2,Chamartín,4445,40.46,-3.68
3,Centro,4374,40.42,-3.71
4,Retiro,4256,40.41,-3.68


## Methodology  <a name="Methodology"></a>

The aim of this project was to find areas of low hotel density, low price but high accessibility in a region about 3km in radius from the centre of Madrid.

The first step was obtaining data needed for the analysis:
* From the Foursquare API: on the location and number of hotels in the neighbourhood of each point in our square grid of points surrounding the centre of Madrid. 
* On the locations of the main tourist attractions in Madrid.
* On real estate prices for each of Madrid's districts.

The second step was choosing a metric for each of these three factors to find the main areas of possible locations for a new hotel.

The final step was clustering these possible areas to find cluster centres, which would be ideal locations for a new hotel, which would then be shown to interested stakeholders, for them to perform a final search in those locations.   

## Analysis <a name="Analysis"></a>

First, let's calculate the distance from each of our points in the grid to each of the main tourist attractions in Madrid.

Using the distance method in geopy library:

In [40]:
import geopy.distance

Distances_all = []
for loc in points_list:
    Distances=[]
    location = loc[0], loc[1]
    for att in Madrid_attractions.values:
        location2 = att[1], att[2]
        Distances.append(geopy.distance.distance(location,location2))
    Distances_all.append(Distances)
len(Distances_all)

441

Now, lets construct a DataFrame with each row being a point in our grid:

In [59]:
madrid_points = pd.DataFrame({'Latitude': [point[0] for point in points_list], 'Longitude':[point[1] for point in points_list],
                             'D0':[Distances_all[d][0] for d in range(len(Distances_all))],'D1':[Distances_all[d][1] for d in range(len(Distances_all))],
                             'D2':[Distances_all[d][2] for d in range(len(Distances_all))],'D3':[Distances_all[d][3] for d in range(len(Distances_all))],
                             'D4':[Distances_all[d][4] for d in range(len(Distances_all))],'D5':[Distances_all[d][5] for d in range(len(Distances_all))],
                             'D6':[Distances_all[d][6] for d in range(len(Distances_all))],'D7':[Distances_all[d][7] for d in range(len(Distances_all))],
                             'D8':[Distances_all[d][8] for d in range(len(Distances_all))],'D9':[Distances_all[d][9] for d in range(len(Distances_all))]})
madrid_points.head()

Unnamed: 0,Latitude,Longitude,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9
0,40.4,-3.72,3.089440246083345 km,2.649045293610205 km,3.2847697134057623 km,1.918699571697774 km,3.9161967660398984 km,2.4998754163164105 km,2.772727192020974 km,3.936865737658203 km,2.814808699357332 km,3.0685308647433374 km
1,40.4,-3.72,3.000469489194311 km,2.5691819598835335 km,3.1473653473249916 km,1.7988883061510372 km,3.772662706232891 km,2.4100491774566875 km,2.6218661605178313 km,3.8101821804652856 km,2.713398957290193 km,3.0448664212165126 km
2,40.4,-3.72,2.918669922149547 km,2.4983182330122022 km,3.013262589650898 km,1.687673214031574 km,3.63139307431049 km,2.329150432673933 km,2.4734590147269415 km,3.6869642803997715 km,2.619076844951337 km,3.030541022010563 km
3,40.4,-3.72,2.8446602452636176 km,2.43723924123509 km,2.882922212701073 km,1.5868627113548244 km,3.492662652809569 km,2.2581388792757924 km,2.327975103682217 km,3.5675711494869184 km,2.532634373386414 km,3.0256873197735694 km
4,40.4,-3.72,2.7790628897859606 km,2.3866963191800505 km,2.756877904587115 km,1.4985580736697257 km,3.3567862806632855 km,2.1979730248898797 km,2.185998152224011 km,3.4523996225670137 km,2.4549040612894646 km,3.030350827202156 km


The first constraint I am going to place on the possible hotel locations is that the distance to each of the main 10 attractions be less than 3km:



In [42]:
condition = np.where((madrid_points['D0']<3)&(madrid_points['D1']<3)&(madrid_points['D2']<3)&(madrid_points['D3']<3)
    &(madrid_points['D4']<3)&(madrid_points['D5']<3)&(madrid_points['D6']<3)&(madrid_points['D7']<3)
    &(madrid_points['D8']<3)&(madrid_points['D9']<3))
madrid_final = madrid_points.loc[condition]
madrid_final.reset_index(drop=True,inplace=True)
madrid_final

Unnamed: 0,Latitude,Longitude,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9
0,40.398775,-3.70979,2.4207916005246286 km,2.0827728896151836 km,2.2414727314175957 km,1.113178898940657 km,2.82122740923912 km,1.8700871220404958 km,1.6437101587604015 km,2.9569548928558564 km,2.067490357403817 km,2.8842623945440775 km
1,40.398775,-3.70779,2.392048923960362 km,2.0795912591168664 km,2.1316506520108778 km,1.0942185013693695 km,2.6942988830040755 km,1.8600269394697335 km,1.5137767593292346 km,2.8575623134141663 km,2.0291409599629224 km,2.9287803454450234 km
2,40.398775,-3.70579,2.3751217376740676 km,2.090238759114938 km,2.0301018546858307 km,1.1014127014905841 km,2.572316490055265 km,1.86542693278687 km,1.3924462703602236 km,2.7650283458219422 km,2.004482905654187 km,2.9823103233826807 km
3,40.400775,-3.71179,2.247830621622528 km,1.880074639082457 km,2.2070971609950005 km,0.9498775374001359 km,2.8201312589805463 km,1.6779820551692652 km,1.6638067554784435 km,2.8951581963609168 km,1.9112945947149043 km,2.6309428959433783 km
4,40.400775,-3.70979,2.2038092595015164 km,1.8611282255152526 km,2.081678109672188 km,0.8959733287273802 km,2.682432075800441 km,1.6493810690757191 km,1.5166918639517413 km,2.7832356506700933 km,1.8542675142672695 km,2.6688955614235432 km
...,...,...,...,...,...,...,...,...,...,...,...,...
251,40.432775,-3.69979,1.4497412899488018 km,1.8447043027507344 km,2.2072328341303793 km,2.7573449957989657 km,2.3380237406705757 km,2.023929543138688 km,2.7962377133563727 km,1.7014568383991928 km,1.8070675695030847 km,1.8089722552866743 km
252,40.432775,-3.69779,1.5005119582723705 km,1.9177485077838463 km,2.163314731965362 km,2.80164577034019 km,2.248266229078973 km,2.0849110980640324 km,2.7743452341151706 km,1.6141181861702318 km,1.8429688041993801 km,1.954243874015002 km
253,40.432775,-3.69579,1.5680984001446987 km,2.002547646452945 km,2.1320412777065187 km,2.855364521827591 km,2.16810085870767 km,2.1575522763104384 km,2.762726578192938 km,1.540587518215858 km,1.8934592271869632 km,2.1031795960669095 km
254,40.432775,-3.69379,1.6504360558743052 km,2.0976766137887517 km,2.1139737285640936 km,2.9179811541280376 km,2.0986271470309372 km,2.240719395635327 km,2.7615114254733384 km,1.4829202676244644 km,1.9574102012474859 km,2.255053547624944 km


So we now have 256 points remaining.

For each of these I will now use the Foursquare API to find how many hotels there are within 200m:

In [60]:
CLIENT_ID = 'UB1VBRTDO3C53ZQI1CIPLEGSPZ5OROD3DXFRFAWL0TTCW3W4' 
CLIENT_SECRET = '2BCGRX1ST521BQUDCPGWBNAS0RDVYRZ045UJ0VE5LG5J0YDR' 
ACCESS_TOKEN = "C2DFDBS14FP2HO0QESIAI2VKL1RLTSEPOKHT5JYABUKQIJAH" 
VERSION = '20180604'
LIMIT = 50
search_query = 'Hotel'
radius = 200

nearby_hotels= []
for lat,long in zip(madrid_final['Latitude'], madrid_final['Longitude']):

    try:
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, long, ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)
        results = requests.get(url).json()
        hotels = results['response']['venues']
        df_hot = pd.json_normalize(hotels)
        nearby_hotels.append(len(df_hot))
        
    except:
        continue

madrid_final['Nearby_Hotels'] = nearby_hotels
madrid_final.head(15)

Unnamed: 0,Latitude,Longitude,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,Nearby_Hotels
0,40.4,-3.71,2.4207916005246286 km,2.0827728896151836 km,2.2414727314175957 km,1.113178898940657 km,2.82122740923912 km,1.8700871220404958 km,1.6437101587604015 km,2.9569548928558564 km,2.067490357403817 km,2.8842623945440775 km,0
1,40.4,-3.71,2.392048923960362 km,2.0795912591168664 km,2.1316506520108778 km,1.0942185013693695 km,2.6942988830040755 km,1.8600269394697335 km,1.5137767593292346 km,2.8575623134141663 km,2.0291409599629224 km,2.9287803454450234 km,0
2,40.4,-3.71,2.3751217376740676 km,2.090238759114938 km,2.0301018546858307 km,1.1014127014905841 km,2.572316490055265 km,1.86542693278687 km,1.3924462703602236 km,2.7650283458219422 km,2.004482905654187 km,2.9823103233826807 km,0
3,40.4,-3.71,2.247830621622528 km,1.880074639082457 km,2.2070971609950005 km,0.9498775374001359 km,2.8201312589805463 km,1.6779820551692652 km,1.6638067554784435 km,2.8951581963609168 km,1.9112945947149043 km,2.6309428959433783 km,1
4,40.4,-3.71,2.2038092595015164 km,1.8611282255152526 km,2.081678109672188 km,0.8959733287273802 km,2.682432075800441 km,1.6493810690757191 km,1.5166918639517413 km,2.7832356506700933 km,1.8542675142672695 km,2.6688955614235432 km,1
5,40.4,-3.71,2.172198293846635 km,1.857567107541361 km,1.962939749954735 km,0.8723051714025641 km,2.548604507595308 km,1.6379662368672965 km,1.3748102931743986 km,2.6774050735172445 km,1.8114108648699627 km,2.7169432474706463 km,0
6,40.4,-3.71,2.153544294913525 km,1.8694792082728662 km,1.8521673810203894 km,0.881312461543861 km,2.4192911364286602 km,1.6440955640599304 km,1.2399598041532285 km,2.5784167048530255 km,1.7837463240547964 km,2.774561550615805 km,0
7,40.4,-3.7,2.14818482896553 km,1.896572994962076 km,1.7508736072583595 km,0.9220380871910883 km,2.2952550711416264 km,1.6675756063669296 km,1.1146950845169872 km,2.4870876574679976 km,1.771985580117127 km,2.841168260093657 km,0
8,40.4,-3.7,2.156219032363876 km,1.938211908984457 km,1.6607936674910777 km,0.990577634562051 km,2.1773983679719517 km,1.7076908182235764 km,1.0026154741272704 km,2.404290927512849 km,1.7764445288514985 km,2.916147537727541 km,0
9,40.4,-3.7,2.1774986563809295 km,1.993484724500336 km,1.5838420563997635 km,1.081656647184457 km,2.066778419859502 km,1.7633062146924972 km,0.9086133878495127 km,2.330935916891129 km,1.797002435279896 km,2.9988714438676696 km,0


In [120]:
madrid_final['Nearby_Hotels'].mean()

5.4453125

So, on average, our points have 5.44 hotels within a 200m distance.

With this in mind, the next constraint I will choose is that there must be fewer than 5 hotels within 200m for our hotel to be successful.

In [62]:
candidates = madrid_final[madrid_final['Nearby_Hotels']<5]
candidates.head()

Unnamed: 0,Latitude,Longitude,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,Nearby_Hotels
0,40.4,-3.71,2.4207916005246286 km,2.0827728896151836 km,2.2414727314175957 km,1.113178898940657 km,2.82122740923912 km,1.8700871220404958 km,1.6437101587604015 km,2.9569548928558564 km,2.067490357403817 km,2.8842623945440775 km,0
1,40.4,-3.71,2.392048923960362 km,2.0795912591168664 km,2.1316506520108778 km,1.0942185013693695 km,2.6942988830040755 km,1.8600269394697335 km,1.5137767593292346 km,2.8575623134141663 km,2.0291409599629224 km,2.9287803454450234 km,0
2,40.4,-3.71,2.3751217376740676 km,2.090238759114938 km,2.0301018546858307 km,1.1014127014905841 km,2.572316490055265 km,1.86542693278687 km,1.3924462703602236 km,2.7650283458219422 km,2.004482905654187 km,2.9823103233826807 km,0
3,40.4,-3.71,2.247830621622528 km,1.880074639082457 km,2.2070971609950005 km,0.9498775374001359 km,2.8201312589805463 km,1.6779820551692652 km,1.6638067554784435 km,2.8951581963609168 km,1.9112945947149043 km,2.6309428959433783 km,1
4,40.4,-3.71,2.2038092595015164 km,1.8611282255152526 km,2.081678109672188 km,0.8959733287273802 km,2.682432075800441 km,1.6493810690757191 km,1.5166918639517413 km,2.7832356506700933 km,1.8542675142672695 km,2.6688955614235432 km,1


In [63]:
len(candidates)

171

So far we have narrowed down to 171 out of 441 points.

Let's narrow it down further by incorporating the district price data.

We will assign each point to a district by finding the smallest distance to the center of each district using geopy, and creating a new column in our DataFrame with the average price per square metre for that district:

In [107]:
nbhd=[]
for loc in candidates[['Latitude','Longitude']].values:
    distances=[]
    location = loc[0], loc[1]
    for dist in Madrid_prices.values:
        location2 = dist[2], dist[3]
        distances.append(geopy.distance.distance(location,location2))
    min_dist = distances.index(min(distances))
    nbhd.append(Madrid_prices.loc[min_dist,'price'])
candidates['price'] = nbhd
candidates.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  candidates['price'] = nbhd


Unnamed: 0,Latitude,Longitude,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,Nearby_Hotels,district,price
0,40.4,-3.71,2.4207916005246286 km,2.0827728896151836 km,2.2414727314175957 km,1.113178898940657 km,2.82122740923912 km,1.8700871220404958 km,1.6437101587604015 km,2.9569548928558564 km,2.067490357403817 km,2.8842623945440775 km,0,3659,3659
1,40.4,-3.71,2.392048923960362 km,2.0795912591168664 km,2.1316506520108778 km,1.0942185013693695 km,2.6942988830040755 km,1.8600269394697335 km,1.5137767593292346 km,2.8575623134141663 km,2.0291409599629224 km,2.9287803454450234 km,0,3659,3659
2,40.4,-3.71,2.3751217376740676 km,2.090238759114938 km,2.0301018546858307 km,1.1014127014905841 km,2.572316490055265 km,1.86542693278687 km,1.3924462703602236 km,2.7650283458219422 km,2.004482905654187 km,2.9823103233826807 km,0,3659,3659
3,40.4,-3.71,2.247830621622528 km,1.880074639082457 km,2.2070971609950005 km,0.9498775374001359 km,2.8201312589805463 km,1.6779820551692652 km,1.6638067554784435 km,2.8951581963609168 km,1.9112945947149043 km,2.6309428959433783 km,1,3659,3659
4,40.4,-3.71,2.2038092595015164 km,1.8611282255152526 km,2.081678109672188 km,0.8959733287273802 km,2.682432075800441 km,1.6493810690757191 km,1.5166918639517413 km,2.7832356506700933 km,1.8542675142672695 km,2.6688955614235432 km,1,3659,3659


In [112]:
candidates.describe()

Unnamed: 0,Latitude,Longitude,Nearby_Hotels,price
count,171.0,171.0,171.0,171.0
mean,40.42,-3.7,1.27,4358.34
std,0.01,0.01,1.38,440.25
min,40.4,-3.72,0.0,3659.0
25%,40.41,-3.71,0.0,4256.0
50%,40.41,-3.7,1.0,4374.0
75%,40.42,-3.69,2.0,4785.0
max,40.43,-3.68,4.0,5048.0


The final constraint I will then place on the data is that the price per square metre be less than €4500.

In [115]:
candidates_final = candidates[candidates['price']<4500]
len(candidates_final)

121

**This leaves us a total of 121 candidate points with low hotel density, good location and reasonable price.**

Let's visualize these points:

In [116]:
madrid_map = folium.Map(location=madrid_center, zoom_start=13)
for lat,long in zip(candidates_final['Latitude'], candidates_final['Longitude']):
    folium.CircleMarker([lat, long],
        radius=1,
        color='red',
        fill = False
    ).add_to(madrid_map)
madrid_map

Now, we will use Kmeans clustering algorithm to find cluster centers of good locations to use for a final candidate list.


In [117]:
from sklearn.cluster import KMeans

X = candidates_final[['Latitude', 'Longitude']].values

kmeans = KMeans(n_clusters=10, random_state=0).fit(X)

In [118]:
kmeans.cluster_centers_

array([[40.401775  , -3.70836143],
       [40.414025  , -3.68854   ],
       [40.41333056, -3.71401222],
       [40.42563214, -3.70607571],
       [40.40908269, -3.70502077],
       [40.405775  , -3.71519   ],
       [40.42048929, -3.71607571],
       [40.402775  , -3.70139   ],
       [40.422775  , -3.69612333],
       [40.40739038, -3.69671308]])

Let's look at these cluster centers on our Madrid map and superimpose the Heat Map created above to see where the cluster centers are in relation to other hotels.

In [132]:
cluster_centers = kmeans.cluster_centers_
madrid_map = folium.Map(location=madrid_center, zoom_start=14)
for lat,long in cluster_centers:
    folium.CircleMarker([lat, long],
        radius=5,
        color='red',
        fill = False
    ).add_to(madrid_map)
HeatMap(hotel_locations).add_to(madrid_map)
madrid_map

Finally, lets get an address for each of these locations so as to present a list of addresses to interested parties as our final findings:

In [127]:
for cluster in cluster_centers:
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.reverse("{}, {}".format(cluster[0],cluster[1]))
    print(" - "+location.address)

 - Farmacia - Paseo Doctor Vallejo Nájera 25, 25, Paseo de Juan Antonio Vallejo-Nájera Botas, Arganzuela, Imperial, Madrid, Área metropolitana de Madrid y Corredor del Henares, Comunidad de Madrid, 28005, España
 - Paseo del Marqués de Pontejos, Jerónimos, Retiro, Madrid, Área metropolitana de Madrid y Corredor del Henares, Comunidad de Madrid, 28009, España
 - Viaducto de Segovia, Calle de Bailén, Palacio, Madrid, Área metropolitana de Madrid y Corredor del Henares, Comunidad de Madrid, 28005, España
 - 39, Calle del Espíritu Santo, Universidad, Centro, Madrid, Área metropolitana de Madrid y Corredor del Henares, Comunidad de Madrid, 28004, España
 - 16, Calle de los Cabestreros, Embajadores, Madrid, Área metropolitana de Madrid y Corredor del Henares, Comunidad de Madrid, 28012, España
 - Colegio Público Joaquín Costa, Travesía de Gil Imón, Arganzuela, Imperial, Madrid, Área metropolitana de Madrid y Corredor del Henares, Comunidad de Madrid, 28005, España
 - 26, Cuesta de San Vicent

Now, let's assume each hotel room will be approximately 10 square metres, and thus the space required for a hotel of x rooms, will be 20x square metres, allowing for corridor space and reception.

Lets find average price per square metre for real estate for each of our cluster centres:

In [133]:
cluster_prices=[]
for loc in cluster_centers:
    distances=[]
    location = loc[0], loc[1]
    for dist in Madrid_prices.values:
        location2 = dist[2], dist[3]
        distances.append(geopy.distance.distance(location,location2))
    min_dist = distances.index(min(distances))
    cluster_prices.append(Madrid_prices.loc[min_dist,'price'])
cluster_prices

[3659, 4256, 4374, 4374, 4374, 4374, 4374, 3659, 4374, 3659]

In [153]:
avg_price = np.array(cluster_prices).mean()
print('€{}'.format(avg_price))

€4147.7


So, for our interested investors, let's look at how much they would have to invest, approximately, and just for the real estate, for different sizes of hotel: 

In [148]:
no_of_rooms = np.linspace(10,100,10)

hotel_size = no_of_rooms*20

initial_investment = hotel_size * avg_price

investment_table = pd.DataFrame({'rooms': no_of_rooms , 'initial_investment':initial_investment})
investment_table


Unnamed: 0,rooms,initial_investment
0,10.0,829540.0
1,20.0,1659080.0
2,30.0,2488620.0
3,40.0,3318160.0
4,50.0,4147700.0
5,60.0,4977240.0
6,70.0,5806780.0
7,80.0,6636320.0
8,90.0,7465860.0
9,100.0,8295400.0


## Results and Discussion <a name="Results"></a>

The result of the analysis is that, while there are many hotels in the centre of Madrid, there are plenty of areas of low hotel density within a 3km region of the centre.

These regions were then clustered using KMeans, to obtain a list of 10 possible areas for a location for a new hotel, which are all areas of low hotel density, are within 3km distance of all major Madrid tourist attractions and are reasonably priced (less than €4500 per square metre).

These locations are not necessarily good locations for a hotel - there might not be any real estate available there, for example - but are excellent starting points for anybody interested in opening a hotel in Madrid.

Finally, an estimation of initial investment required was made based on real estate price data and average hotel size. For example, a 100 room hotel in one of these locations would require an initial investment of approximately €8 million to purchase the property.

## Conclusion <a name="Conclusion"></a>

The aim of this project was to identify some candidate locations for a new hotel in Madrid, based on three main ideas: not being too close to existing hotels, being close to main tourist attractions and the price of the real estate not being too high.

Using the data we acquired, we found a number of locations that satisfied all of these criteria. We then used a clustering algorithm to create 10 cluster centres based on these locations.

This list of 10 locations will be passed to interested parties, as a guide, so they can make informed decisions on which neighbourhoods to look for properties in. To make their final decision on a hotel location, however, they will take into account other factors which were not included in this study, such as availability of property, attractiveness of locations, available capital, etc.