# Capstone Project

### Introduction
London is the capital of the United Kingdom, and attracts 30 million visitors every year. Therefore, there is a huge hospitality industry in London. 
Imagine that you are an entrepreneur who owns a chain of upmarket hotels. You currently do not have a hotel in London, and you are trying to figure out the best place in London to open a hotel. We will use machine learning and data science techniques to solve this problem.
Factors you may wish to consider:
•	real estate price
•	proximity to tourist sites
•	whether or not there are already competitor hotels in the area

### Data
As mentioned above, we wish to consider real estate price, proximity to tourist sites, and whether or not there are already competitor hotels in the area.
From the web I found data regarding the average house price in each borough of London, which I will use as a proxy for the real estate price.  I read this data into a CSV file called, which can be imported in the notebook.
I will use the Foursquare API to gather data regarding proximity to tourist sites and competitor hotels.


In [11]:
pip install geocoder

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Note: you may need to restart the kernel to use updated packages.


In [12]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print("Libraries imported.")

Libraries imported.


In [5]:
price = pd.read_csv(r'C:\Users\Tommy\OneDrive\Tommy\Study\Extra-Curricular_Study\IBM_Data_Science\Course 8 - Python for Machine Learning\Prices by borough.csv')

In [6]:
price

Unnamed: 0,Borough,Price
0,Barking and Dagenham,303631
1,Barnet,516896
2,Bexley,341237
3,Brent,505388
4,Bromley,452429
5,Camden,779779
6,City of London,738263
7,City of Westminster,984511
8,Croydon,374089
9,Ealing,472015


In [8]:
price.shape

(33, 2)

Get Geographical Coordinates

In [20]:
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London, UK'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [21]:
coords = [ get_latlng(x) for x in price["Borough"].tolist() ]
coords

[[51.54393218800004, 0.1331566440000529],
 [51.62729399999999, -0.25375949999999536],
 [51.622832082632826, -0.08065551264637581],
 [51.60977558622067, -0.19467855915694893],
 [51.431820424666206, -0.016565652066046606],
 [51.53236000000004, -0.1279599999999732],
 [51.52050000000003, -0.09742999999997437],
 [51.49728434722135, -0.1372930927058332],
 [51.593209478792595, -0.0833901391886536],
 [51.51406000000003, -0.30072999999993044],
 [51.54002082084554, -0.07750147872409442],
 [51.48454000000004, 0.002750000000048658],
 [51.54505000000006, -0.05531999999993786],
 [51.4826899378211, -0.2129099233942567],
 [51.589264499999985, -0.10640475000000119],
 [51.513180000000034, -0.10697999999996455],
 [51.54460488121069, -0.14410476132484631],
 [51.484225357256605, -0.0964798815487847],
 [51.471390726355075, -0.3513748870497065],
 [51.532790000000034, -0.1061399999999253],
 [51.510380000000055, -0.3314699999999675],
 [51.41087440213694, -0.2919463207868962],
 [51.49084000000005, -0.1110799999

In [22]:
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
df_coords

Unnamed: 0,Latitude,Longitude
0,51.543932,0.133157
1,51.627294,-0.253759
2,51.622832,-0.080656
3,51.609776,-0.194679
4,51.43182,-0.016566
5,51.53236,-0.12796
6,51.5205,-0.09743
7,51.497284,-0.137293
8,51.593209,-0.08339
9,51.51406,-0.30073


Merge the data frames

In [24]:
price['Latitude'] = df_coords['Latitude']
price['Longitude'] = df_coords['Longitude']

In [25]:
price

Unnamed: 0,Borough,Price,Latitude,Longitude
0,Barking and Dagenham,303631,51.543932,0.133157
1,Barnet,516896,51.627294,-0.253759
2,Bexley,341237,51.622832,-0.080656
3,Brent,505388,51.609776,-0.194679
4,Bromley,452429,51.43182,-0.016566
5,Camden,779779,51.53236,-0.12796
6,City of London,738263,51.5205,-0.09743
7,City of Westminster,984511,51.497284,-0.137293
8,Croydon,374089,51.593209,-0.08339
9,Ealing,472015,51.51406,-0.30073


In [28]:
london_map = folium.Map(location=[51.48, 0], zoom_start=11)

In [29]:
london_map

In [31]:
for lat, lng, neighborhood in zip(price['Latitude'], price['Longitude'], price['Borough']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(london_map)  

In [32]:
london_map

Save the map to HTML

In [33]:
london_map.save('london_map.html')

Let's use the Foursquare API to explore the neighbourhoods

In [34]:
CLIENT_ID = 'RI10WYDTBTSQZNMPBVKQT2BPYX5VNKRIHUL3YTCY0XJGYH0I' # your Foursquare ID
CLIENT_SECRET = 'O4UU1GSVKPRJX1JL2F30NXWRZ2WDRPW3LS3WCJFZPF4BYKPM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: RI10WYDTBTSQZNMPBVKQT2BPYX5VNKRIHUL3YTCY0XJGYH0I
CLIENT_SECRET:O4UU1GSVKPRJX1JL2F30NXWRZ2WDRPW3LS3WCJFZPF4BYKPM


In [36]:
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(price['Latitude'], price['Longitude'], price['Borough']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

KeyError: 'groups'

In [37]:


# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()



(246, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Barking and Dagenham,51.543932,0.133157,Capital Karts,51.531792,0.118739,Go Kart Track
1,Barking and Dagenham,51.543932,0.133157,Mayesbrook Park,51.549842,0.108544,Park
2,Barking and Dagenham,51.543932,0.133157,wilko,51.541002,0.148898,Furniture / Home Store
3,Barking and Dagenham,51.543932,0.133157,Co-op Food,51.540093,0.127522,Grocery Store
4,Barking and Dagenham,51.543932,0.133157,Goodmayes Park,51.558503,0.116386,Park


In [38]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Barking and Dagenham,37,37,37,37,37,37
Barnet,42,42,42,42,42,42
Bexley,77,77,77,77,77,77
Brent,90,90,90,90,90,90


Let's find out how many unique categories can be curated from all the returned venues

In [39]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 82 uniques categories.


In [40]:
venues_df['VenueCategory'].unique()[:50]

array(['Go Kart Track', 'Park', 'Furniture / Home Store', 'Grocery Store',
       'Supermarket', 'Movie Theater', 'Hotel', 'Pizza Place', 'Pub',
       'Bowling Alley', 'Fast Food Restaurant', 'Metro Station',
       'Soccer Field', 'Gym / Fitness Center', 'Rugby Pitch', 'Gym',
       'Chinese Restaurant', 'River', 'Recreation Center',
       'Cosmetics Shop', 'Warehouse Store', 'Skate Park',
       'History Museum', 'Food & Drink Shop', 'Bus Stop', 'Golf Course',
       'Café', 'Bakery', 'Farm', 'Indian Restaurant',
       'Italian Restaurant', 'Juice Bar', 'Sandwich Place', 'Restaurant',
       'Bookstore', 'Pharmacy', 'Sushi Restaurant',
       'Argentinian Restaurant', 'Stationery Store', 'Fish & Chips Shop',
       'Convenience Store', 'Salon / Barbershop', 'Campground',
       'Burger Joint', 'Athletics & Sports', 'Platform', 'Steakhouse',
       'Performing Arts Venue', 'Middle Eastern Restaurant',
       'Turkish Restaurant'], dtype=object)

Observations

The overwhelming majority of the tourist sites in London are concentrated near 'the square mile'. Therefore, despite this area of London being relatively expensive, I would recommend building the new hotel in that area.