# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

Having recently worked in Germany, I have been engaged by a consortium of German investors in the restaurant business. They are investigating whether to open one, and perhaps even several, ***German-themed restaurants*** in ***Central London***. They have commissioned me to find (an) optimal location(s) for (a) restaurant(s) in Central London.

London is one of the world's oldest, most popular and most visited cities. London has become very metropolitan and there are many restaurants with different nationality's cuisine. There is a high density of restaurants and food stands in Central London. I will try to map the restaurants in the vicinity of what I consider to be the tube stops making up Central London. I will investigate how many German-themed restaurants are in a radius of 500 metres from the selected tube-stops. I will use data science techniques to collate this information. Ideally I would like to find the tube stops with few or no German-themed restaurants. I will use data science techniques to collate this information. I will also present the data graphically, and use machine-learning techniques like Clustering to determine good street addresses to set up German-themed restaurants. 

As a caveat, this analysis is provided as a guide. It will still be incumbent on the consortium of investors to do further due diligence in the locations provided. They would need to perform further market research in the suggested locations to have a high confidence of success.

## Data <a name="data"></a>

Based on the definition of our business problem, factors that will influence our decision are:
* The number of existing restaurants in the tube-stop radius, i.e. any type of restaurant
* The number of German-themed restaurants in the tube-stop radius, if any

I have discussed this with the consortium of investors and we selected 14 tube stops that we consider to make up Central London. (Obviously this is subjective, but this is **a starting point that has been agreed with the investors**). I have also decided that Waterloo is the centre point for Central London. This is obviously another subjective decision. Each tube stop will be analysed to see the number of restaurants, and number of German-themed restaurants. The 350 metre radius around each tube-stop can overlap. To prevent duplicate restaurants appearing, I store the restaurants in a dictionary with their Foursquare Id.

We will need the following data sources to extract and generate the required information:
* List of London tube-stops and coordinates: **https://wiki.openstreetmap.org/wiki/List_of_London_Underground_stations**. I have scraped the Web page and created a csv file with tube-stop name, latitiude, and longitude. From this I curated the list of 14 tube-stops we are interested in analysing, and saved them in a csv file, **SubSelected.csv**.
* The number of restaurants and their type and location in every tube-stop radius will be obtained using the **Foursquare API**.

In [1]:
import pandas as pd
import requests

In [3]:
selectedSubDF = pd.read_csv(r'C:\work\Coursera\IBM Data Science Professional\Course 9 - Capstone Project\Week 5 - Project Part 2\My Submission\SubSelected.csv')
selectedSubDF

Unnamed: 0,Name,Latitude,Longitude
0,Waterloo,51.50322,-0.11328
1,Westminster,51.50121,-0.12489
2,Embankment,51.50717,-0.12195
3,Charing Cross,51.507108,-0.122963
4,St.James's Park,51.49971,-0.13394
5,Victoria,51.496629,-0.144009
6,Piccadilly Circus,51.51022,-0.13392
7,Green Park,51.50674,-0.14276
8,Oxford Circus,51.51517,-0.14119
9,Tottenham Court Road,51.516721,-0.130939


Now let's break out each column individually:

In [4]:
tube_name = selectedSubDF['Name']
latitude = selectedSubDF['Latitude']
longitude = selectedSubDF['Longitude']
centLonCentre = [latitude[0], longitude[0]]

Now let's visualise the 14 tube-stops against a map of Central London:

In [5]:
#!pip install folium
import folium

In [6]:
map_cent_lon = folium.Map(location = centLonCentre, zoom_start = 13)
folium.Marker(centLonCentre, popup='Waterloo').add_to(map_cent_lon)

for name, lat, lon in zip(tube_name, latitude, longitude):
    label = '{}'.format(name)
    label = folium.Popup(label, parse_html = True)
    folium.Circle([lat, lon], radius = 300, color='blue', fill = False).add_to(map_cent_lon)
    folium.Marker([lat, lon], popup = label).add_to(map_cent_lon)
    
map_cent_lon

### Foursquare
We now use the Foursquare API to get information on restaurants within a radius of 350 metres from each tube-stop.

We are only interested in venues in the 'food' category, and only those that are **proper restaurants**. We will exclude coffee shops, pizza places, bakeries etc. as these are not considered direct competitors for a restaurant by the investor consortium. So, we will include only venues that have 'restaurant' in the category name in our list, and we will ensure we detect and include all the subcategories of the specific 'German Restaurant' category, as we need information on all German-themed restaurants in the tube-stop radius.

In [7]:
# Please enter your FourSquare Client Id and Client Secret below to execute this code
foursquare_client_id = ''
foursquare_client_secret = ''

In [8]:
# Note: I reused some functions in the example Jupyter Notebook given on Coursera, with changes 
# specific to my requirements. I believe these functions were originally written by Michael Hörz.

# Category Id's corresponding to German restaurants were taken from Foursquare web site:
# https://developer.foursquare.com/docs/resources/categories

food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

# A list with 'German Restaurant' category and all the subcategories of German restaurants
german_rest_cat = ['4bf58dd8d48988d10d941735','56aa371ce4b08b9a8d573583','56aa371ce4b08b9a8d573572',
                   '56aa371ce4b08b9a8d57358e','56aa371ce4b08b9a8d57358b','56aa371ce4b08b9a8d573574',
                   '56aa371ce4b08b9a8d573592','56aa371ce4b08b9a8d573578','56aa371ce4b08b9a8d57357b',
                   '56aa371ce4b08b9a8d573587','56aa371ce4b08b9a8d57357f','56aa371ce4b08b9a8d573576']

# Determine if this listing is a restaurant, and if it is a German-themed specific restaurant
def check_is_restaurant(categories, specific_filter = None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse']
    restaurant = False
    specific = False
    
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
                
        if 'fast food' in category_name:
            restaurant = False
            
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
            
    return restaurant, specific

# Break out the categories in the JSON returned by Foursquare
def get_categories(categories):

    return [(cat['name'], cat['id']) for cat in categories]

# Format the address, and exclude the country as it is irrelevant for the analysis
def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', United Kingdom', '')
    
    return address

# Use Foursquare to get locations around tube-stops and curate the JSON data returned
def get_venues_near_location(tube_stop_name, lat, lon, category, client_id, client_secret, radius = 500, limit = 100):

    venues = []
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
        
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        venues = [(tube_stop_name, item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
        
    return venues

In [9]:
# Main controlling function to get Foursquare restaurant details for each tube-stop
def get_restaurants(tube_stop_name, lats, lons):

    restaurants = {}
    german_restaurants = {}
    tube_stop_restaurants = []

    print('Obtaining venues around candidate tube-stops:', end = '')
    
    for tsname, lat, lon in zip(tube_stop_name, lats, lons):
    
        # Not too concerned with overlap or missing restaurants by not covering full area; proximity to tube-stop is most important factor 
        # Using dictionaries to remove any duplicates resulting from radius overlaps where they occur
        
        venues = get_venues_near_location(tsname, lat, lon, food_category, foursquare_client_id, foursquare_client_secret, radius = 350, limit = 100)

        area_restaurants = []
        
        for venue in venues:
            tube_stop_name = venue[0]
            venue_id = venue[1]
            venue_name = venue[2]
            venue_categories = venue[3]
            venue_latlon = venue[4]
            venue_address = venue[5]
            venue_distance = venue[6]
            
            # if venue_name == 'The Delaunay': 
            #     print('The Delaunay, categories: {0}'.format(venue_categories))
            
            is_restaurant, is_german = check_is_restaurant(venue_categories, specific_filter = german_rest_cat)
            
            if is_restaurant:
                # create tuple with restaurant details
                restaurant = (tube_stop_name, venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_german)
                
                if venue_distance <= 350:
                    area_restaurants.append(restaurant)
                    
                restaurants[venue_id] = restaurant # append to a dictionary with venue id as key
                
                if is_german:
                    german_restaurants[venue_id] = restaurant # append to a dictionary with venue id as key
                    
        tube_stop_restaurants.append(area_restaurants) # append to a list
        
        print(' .', end = '')
        
    print(' done.')
    
    return restaurants, german_restaurants, tube_stop_restaurants

# Call Foursquare with all selected tube stops to find restaurant details
restaurants, german_restaurants, tube_stop_restaurants = get_restaurants(tube_name, latitude, longitude)

Obtaining venues around candidate tube-stops: . . . . . . . . . . . . . . done.


In [10]:
import numpy as np

print('Total number of restaurants:', len(restaurants))
print('Total number of German-themed restaurants:', len(german_restaurants))
print('Percentage of German-themed restaurants: {:.2f}%'.format(len(german_restaurants) / len(restaurants) * 100))
print('Average number of restaurants in tube-stop radius:', np.array([len(r) for r in tube_stop_restaurants]).mean())

Total number of restaurants: 438
Total number of German-themed restaurants: 1
Percentage of German-themed restaurants: 0.23%
Average number of restaurants in tube-stop radius: 35.9285714286


Now let's see the German-themed restaurants in Central London:

In [11]:
german_restaurants

{'4ca5bf7b7334236a03b01858': ('Charing Cross',
  '4ca5bf7b7334236a03b01858',
  'Herman ze German',
  51.50821519463621,
  -0.12415690292012567,
  '19 Villiers St, Charing Cross, Greater London, WC2N 6NE',
  148,
  True)}

**Note:** There are very few German-themed restaurants in Central London! I have been at **The Delaunay** about 200 metres from **Charing Cross tube-stop**. This is a very good Viennese-style (Austrian) restaurant. It has not been tagged by Foursquare to be picked up as a German-themed restaurant in our search. I would say that this style of restaurant would not be direct competition to a German-themed restaurant anyway.

In [12]:
map_cent_lon = folium.Map(location = centLonCentre, zoom_start = 13)
folium.Marker(centLonCentre, popup='Waterloo').add_to(map_cent_lon)

# put the tube-stops on the map
for name, lat, lon in zip(tube_name, latitude, longitude):
    label = '{}'.format(name)
    label = folium.Popup(label, parse_html = True)
    folium.Circle([lat, lon], radius = 350, color='blue', fill = False).add_to(map_cent_lon)
    folium.Marker([lat, lon], popup = label).add_to(map_cent_lon)

# put the restaurants on the map
for rest in restaurants.values():
    lat = rest[3]
    lon = rest[4]
    is_german = rest[7]
    color = 'red' if is_german else 'blue'
    
    folium.CircleMarker([lat, lon], radius = 3, color = color, fill = True, fill_color = color, fill_opacity = 1).add_to(map_cent_lon)
    
map_cent_lon

## Methodology <a name="methodology"></a>

We will now analyse the data we have collected. We will look at the 'restaurant density' across the different central London tube-stops. We will use heat maps to identify the promising areas, with a low number of restaurants in general (and no German-themed restaurants in the vicinity). We will focus on those areas.

The final step will involve **creating clusters** of locations that meet some basic requirements established in our discussion with the investor's consortium: We will consider locations with **no more than thirty restaurants in a radius of 250 meters**. (We will ignore German-themed restaurants in the area due to only having 1. This will have a negligible effect on the analysis). We will map all such locations and also create clusters (using k-Means clustering) of those locations to identify areas which should be a **starting point** for final analysis for a suitable street location. This final analysis will be performed by the investor's consortium.

## Analysis <a name="analysis"></a>

In [13]:
tube_stop_restaurants_count = [len(res) for res in tube_stop_restaurants]
# tube_stop_restaurants_count

selectedSubDF['Restaurants in area'] = tube_stop_restaurants_count

print('Average number of restaurants in radius of 350 metres within each tube-stop: ', np.array(tube_stop_restaurants_count).mean())

selectedSubDF

Average number of restaurants in radius of 350 metres within each tube-stop:  35.9285714286


Unnamed: 0,Name,Latitude,Longitude,Restaurants in area
0,Waterloo,51.50322,-0.11328,26
1,Westminster,51.50121,-0.12489,1
2,Embankment,51.50717,-0.12195,10
3,Charing Cross,51.507108,-0.122963,10
4,St.James's Park,51.49971,-0.13394,11
5,Victoria,51.496629,-0.144009,23
6,Piccadilly Circus,51.51022,-0.13392,73
7,Green Park,51.50674,-0.14276,65
8,Oxford Circus,51.51517,-0.14119,56
9,Tottenham Court Road,51.516721,-0.130939,70


In [14]:
restaurant_latlons = [[res[3], res[4]] for res in restaurants.values()]

german_latlons = [[res[2], res[3]] for res in german_restaurants.values()]

In [15]:
from folium import plugins
from folium.plugins import HeatMap

map_cent_lon = folium.Map(location = centLonCentre, zoom_start = 13)
folium.TileLayer('cartodbpositron').add_to(map_cent_lon) #cartodbpositron cartodbdark_matter
HeatMap(restaurant_latlons).add_to(map_cent_lon)
folium.Marker(centLonCentre).add_to(map_cent_lon)
folium.Circle(centLonCentre, radius = 1000, fill = False, color = 'black').add_to(map_cent_lon)
folium.Circle(centLonCentre, radius = 2000, fill = False, color = 'black').add_to(map_cent_lon)
folium.Circle(centLonCentre, radius = 3000, fill = False, color = 'black').add_to(map_cent_lon)
map_cent_lon

One can see that the Central London area is quite densely populated with restaurants. (The black concentric circles show each kilometre from the centre at Waterloo). There are some less densely populated areas around the fringes. The main point of interest is the general sparsity of German-themed restaurants. We could postulate that placing a German-themed restaurant anywhere in the Central London tube-stop area where there are currently no German-themed restaurants within 350 metres would be acceptable. (Only Charing Cross has a German-themed restaurant within 350 metres; **we could exclude this tube-stop from further analysis**). However, if we look at the list of tube-stops and number of restaurants, there are some tube-stops where there are less restaurants, i.e. a lower density:

In [16]:
selectedSubDF.sort_values(by = 'Restaurants in area', ascending = False)

Unnamed: 0,Name,Latitude,Longitude,Restaurants in area
6,Piccadilly Circus,51.51022,-0.13392,73
11,Leicester Square,51.51148,-0.12849,73
9,Tottenham Court Road,51.516721,-0.130939,70
7,Green Park,51.50674,-0.14276,65
13,Covent Garden,51.51308,-0.12427,63
8,Oxford Circus,51.51517,-0.14119,56
0,Waterloo,51.50322,-0.11328,26
5,Victoria,51.496629,-0.144009,23
10,Baker Street,51.522236,-0.15708,12
4,St.James's Park,51.49971,-0.13394,11


Charing Cross is towards the bottom of the table. Based on the low density of restaurants at that tube-stop and the high market value of the area, I would **include Charing Cross for the final analysis**.

Let's look at all tube-stops where there are no more than 30 restaurants within 250 metres of the tube stop:

In [17]:
# put the tube-stop restaurant data in a DataFrame
distDF = pd.DataFrame(columns = ['Tube Stop', 'Restaurant', 'Distance'])
i = 0

for ts in tube_stop_restaurants: 
    for rest in ts:
        distDF.loc[i] = [rest[0], rest[2], rest[6]]
        i += 1
        
distDF.head()

Unnamed: 0,Tube Stop,Restaurant,Distance
0,Waterloo,Po Cha,254
1,Waterloo,Marie's Cafe,239
2,Waterloo,Wahaca,247
3,Waterloo,Brasserie Blanc,313
4,Waterloo,Mamuska,290


In [48]:
# sort the results to find the final list of tube stops with 30 or less restaurants within 250 metres
dist250DF = distDF[distDF['Distance'] <= 250]
dist250SumDF = dist250DF.groupby('Tube Stop').count().sort_values(by = 'Distance', ascending = False)
dist250SumDF.reset_index(drop = True)
del dist250SumDF['Restaurant']
dist250SumDF.rename(columns = {'Distance': 'Total'}, inplace = True)
dist250SumDF

Unnamed: 0_level_0,Total
Tube Stop,Unnamed: 1_level_1
Piccadilly Circus,50
Leicester Square,37
Covent Garden,35
Green Park,34
Oxford Circus,26
Tottenham Court Road,26
Victoria,18
Waterloo,13
Baker Street,7
Charing Cross,5


From this, we are left with 10 tube-stops with 30 or less restaurants in a 250 metre radius from the stop. I have put the details in file "FinalSelected.csv".

In [19]:
finalStopDF = pd.read_csv(r'C:\work\Coursera\IBM Data Science Professional\Course 9 - Capstone Project\Week 5 - Project Part 2\My Submission\FinalSelected.csv')
finalStopDF

Unnamed: 0,Name,Latitude,Longitude
0,Waterloo,51.50322,-0.11328
1,Westminster,51.50121,-0.12489
2,Embankment,51.50717,-0.12195
3,Charing Cross,51.507108,-0.122963
4,St.James's Park,51.49971,-0.13394
5,Victoria,51.496629,-0.144009
6,Oxford Circus,51.51517,-0.14119
7,Tottenham Court Road,51.516721,-0.130939
8,Baker Street,51.522236,-0.15708
9,Regent's Park,51.52344,-0.14713


Now use k-Means clustering to **determine 5 centres** to use as final locations for the investor consortium.

In [20]:
from sklearn.cluster import KMeans

finalCoords = finalStopDF[['Latitude', 'Longitude']].values  # create a list using "values"

no_of_clusters = 5
kmeans = KMeans(n_clusters = no_of_clusters, random_state = 0).fit(finalCoords)
# for coords in kmeans.cluster_centers_: print(coords)
cluster_centres = kmeans.cluster_centers_
cluster_centres

array([[ 51.522838  ,  -0.152105  ],
       [ 51.50322   ,  -0.11328   ],
       [ 51.51594537,  -0.13606456],
       [ 51.49816935,  -0.13897426],
       [ 51.50516267,  -0.12326767]])

Now plot the 5 centres from k-Means analysis. Note that the points are within 3.5 kilometres of each other:

In [21]:
map_cent_lon = folium.Map(location = centLonCentre, zoom_start = 14)
folium.TileLayer('cartodbpositron').add_to(map_cent_lon)
HeatMap(restaurant_latlons).add_to(map_cent_lon)
folium.Circle(centLonCentre, radius=3500, color='white', fill = True, fill_opacity = 0.4).add_to(map_cent_lon)
folium.Marker(centLonCentre).add_to(map_cent_lon)

for lat, lon in cluster_centres:
    folium.Circle([lat, lon], radius = 350, color = 'green', fill = False).add_to(map_cent_lon) 
    
for lat, lon in cluster_centres:
    folium.CircleMarker([lat, lon], radius = 2, color = 'blue', fill = True, fill_color = 'blue', fill_opacity = 1).add_to(map_cent_lon)
    
map_cent_lon

In [22]:
# Please enter your Google Maps API Key below to execute this code
google_api_key = ""

In [23]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None

Now, find the addresses for the 5 centre points from the clustering:

In [24]:
candidate_area_addresses = []

print('==============================================================')
print('Addresses of centers of areas recommended for further analysis')
print('==============================================================\n')

for lat, lon in cluster_centres:

    addr = get_address(google_api_key, lat, lon).replace(', UK', '')
    
    candidate_area_addresses.append(addr)    
    print(addr)

Addresses of centers of areas recommended for further analysis

Heron House, 19 Marylebone Rd, Marylebone, London NW1 5LT
Waterloo Underground Station, York Rd, South Bank, London SE1 7ND
129 Oxford St, Soho, London W1D 2HU
1-2 Castle Ln, Westminster, London SW1E 6DR
A3211, Westminster, London SW1A 2EF


## Results and Discussion <a name="results"></a>

The analysis shows that there are surprisingly few German-themed restaurants in Central London. When I performed a search for German-themed restaurants, there are relatively few and only one in the Central London area we have investigated. By further analysing the Central London tube-stops, we found the best 10 tube-stops where there are a lower density of restaurants in the vicinity. By using k-Means clustering analysis, we found 5 potential locations to act as starting points for a marketing research initiative.

## Conclusion <a name="conclusion"></a>

German culture like that exhibited at Oktoberfest, i.e. Durndls and Lederhosen, Wurst and big mugs of beer are gaining popularity in England, and especially London. The analysis seems to clearly point to a market opportunity to establish more German-themed restaurants in Central London. The caveat is that the investor consortium will need to do more due diligence on the starting addresses I provide. There would be a need to conduct at least basic through to thorough market research in these areas to establish a reasonable confidence of success with the venture.