# Capstone Project - The Battle of Neighbourhoods (Week 2)

# Best London neighbourhood for a coffee shop

## Table of contents
* [Introduction & Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results](#results)
* [Discussion](#discussion)
* [Conclusion](#conclusion)

## Introduction & Business Problem <a name="introduction"></a>

UK’s coffee consumption has soared to 95 million cups a day in 2018, up from 70 million in 2008. That’s an increase of 25 million over the last 10 years. This coffee popularity has translated in a big increase in coffee shops or cafés in London, the UK capital and one of the world's global cities. Most of these popular cafés are big chains such as the USA's Starbucks or the UK's own chains Costa Coffee or Café Nero.

But Londoners are starting to get tired of having the same chain coffee every day and are starting to look into more independent and speciality  coffee shops, where they take great care of each coffee cup and they use single origin, ethically sourced and organic coffee beans that are roasted locally by artisan roasters.

London’s obsession with coffee is showing no signs of slowing. Across the city, cafés are constantly popping up, serving up perfectly executed flat whites, espressos and cold-drip Americanos to the masses. As the following graph from the BBC demonstrates, the number of cafés and other hospitality venues selling coffee has increased significantly since the early 2000s.

<img src="https://ichef.bbci.co.uk/news/624/cpsprodpb/1845B/production/_97791499_uk_coffee_shop_624.png">

##### Aim
The aim of this project is to find out which neighbourhood in London would be the best to open a new café. 

##### Target audience
The target audience would be an entrepreneur or group of entrepreneurs that are looking to set up their new independent café in London. This project would help them to find out which are the neighbourhoods with more and less cafés so they can open their café in the less saturated neighbourhood.


## Data <a name="data"></a>

This section will describe the data that will be used to solve the problem. First we are going to load the libraries needed:

In [1]:
# library to handle data in a vectorized manner
import numpy as np

# library for data analsysis
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# install the Geocoder library to get location data
!pip install geocoder
import geocoder

#library for processing XML
!pip install lxml
import lxml

#library to handle JSON files
import json 

#library to handle requests
import requests 

#tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
!pip install folium 
import folium 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors



#### Dataset 1

London is divided in 32 London boroughs and "the City of London" which is the central part of London or downtown. Each borough then has several neighbourhoods (although there are some neighbourhoods that may expand several boroughs).

Our London neighbourhood and borough data will come for the Wikipedia page "List of areas of London": https://en.wikipedia.org/wiki/List_of_areas_of_London

The data is presented in a Wikipedia table and we transform it to a pandas data frame for our analysis:

In [0]:
#The wikipedia table is extracted into a panda dataframe
london_df=pd.read_html("https://en.wikipedia.org/wiki/List_of_areas_of_London")[1]

#Rename columns
london_df.columns = ['Neighbourhood', 'Borough', 'Post town', 'Postcode district', 'Dial code', 'OS grid ref']

# Remove Borough reference numbers with []
london_df['Borough']= london_df['Borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))

# Remove Neighbourhood text in between parentheses
london_df['Neighbourhood']=london_df['Neighbourhood'].str.replace(r"\([^()]*\)","")

#We are going to remove all those neighbourhood where the post town is not london, as those are neighbourhoods in the outskirts
indexPost=london_df[london_df['Post town'] != 'LONDON'].index
london_df.drop(indexPost,inplace= True)
london_df.reset_index(drop=True, inplace= True)

#Drop the Post town, postcode, dial code and OS grid ref columns as we don't need that data for our analysis
london_df.drop(columns=['Post town', 'Postcode district','Dial code','OS grid ref'], inplace= True)

Now we have our pandas data frame with the all the London neighbourhoods and boroughs:

In [266]:
london_df.head(5)

Unnamed: 0,Neighbourhood,Borough,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich",51.49245,0.12127
1,Acton,"Ealing, Hammersmith and Fulham",51.51324,-0.26746
2,Aldgate,City,51.513308,-0.077762
3,Aldwych,Westminster,51.513307,-0.117092
4,Anerley,Bromley,51.41233,-0.06539


#### Dataset 2

To obtain the coordinate data of the London neighbourhoods, the Geocoder package is used to get the latitude and longitude data for each neighbourhood that is needed for the Foursquare API. 

The Geocoder location data will be used to enrich the data frame of London neighbourhoods obtained from Wikipedia above.

In [0]:
# Defining a function to get the coordinates of the different London neighbourhoods
def get_latlng(arcgis_geocoder):
    
    # Initialize the Location (lat. and long.) to "None"
    lat_lng_coords = None
    
    # While loop is used to continuously run until all the neighbourhood coordinates are geocoded
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London, United Kingdom'.format(arcgis_geocoder))
        lat_lng_coords = g.latlng
    return lat_lng_coords
# Geocoder ends here

In [0]:
#We call the get_latlng function that we defined earlier passing all the London neighbourhoods
london_neighbourhoods=london_df['Neighbourhood']

coordinates = [get_latlng(neighbourhood) for neighbourhood in london_neighbourhoods.tolist()]


In [269]:
# The obtained coordinates (latitude and longitude) are joined with the london dataframe
coordinates_df = pd.DataFrame(coordinates, columns = ['Latitude', 'Longitude'])

london_df['Latitude'] = coordinates_df['Latitude']
london_df['Longitude'] = coordinates_df['Longitude']

#Now the london_df data frame has Neighbourhood and Borough enriqued with Latitude and Longitude data from Geocoder
london_df.head(5)

#Now we export the dataframe containing all london Neighbourhood with coordinates into a csv to speed up the process if we need to run it again
#export_csv = london_df.to_csv ('london_neighbourhood_df.csv', index = None, header=True)

Unnamed: 0,Neighbourhood,Borough,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich",51.49245,0.12127
1,Acton,"Ealing, Hammersmith and Fulham",51.51324,-0.26746
2,Aldgate,City,51.513308,-0.077762
3,Aldwych,Westminster,51.513307,-0.117092
4,Anerley,Bromley,51.41233,-0.06539


#### Dataset 3

The Foursquare API will be used to search for a specific venue category (in our case cafés or coffee shops) for the geographical location data for each London neighbourhood.

We will use the explore Foursquare API and the parameter 'section'=coffee, which can have one of the following values: food, drinks, coffee, shops, arts, outdoors, sights, trending, nextVenues , to limit the venues that we found to those that serve coffee (these may include cafés and coffee shops but also restaurants or ice cream parlours)

## Methodology <a name="methodology"></a>

First of all we define the Foursquare credentials and version (this will be in a hidden cell bellow for privacy):

In [0]:
CLIENT_ID = 'VGPZ4K33AHPTSEWPSXDEO30IE55VYKZOC1IU2MIL1CZ3T2GB' # your Foursquare ID
CLIENT_SECRET = 'H3YPHRNHKXZBDZGV3NXMHSISOKNYMJV25NRVMJ2XRKGABEIK' # your Foursquare Secret
VERSION = '20190901' # Foursquare API version

Let's create a function to explore the venues of all the neighbourhoods in London:

In [0]:
def getNearbyVenues(names, latitudes, longitudes, section='coffee', radius=500, LIMIT=30):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&section={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            section,
            radius, 
            LIMIT)
        
        try:
          #make the GET request
          results = requests.get(url).json()["response"]['groups'][0]['items']

          #return only relevant information for each nearby venue
          venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
        except:
          print('No venues')
          
          
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 'Neighbourhood Latitude', 'Neighbourhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    
    return(nearby_venues)

Now we run the above function on each neighborhood and create a new dataframe called london_venues.

In [111]:
london_venues = getNearbyVenues(names=london_df['Neighbourhood'],
                                   latitudes=london_df['Latitude'],
                                   longitudes=london_df['Longitude']
                                  )

Abbey Wood
Acton
Aldgate
Aldwych
Anerley
Angel
Archway
Arnos Grove
Balham
Bankside
Barbican
Barnes
Barnsbury
Battersea
Bayswater
Bedford Park
Belgravia
Bellingham
Belsize Park
Bermondsey
Bethnal Green
Blackfriars
Blackheath
Blackheath Royal Standard
Blackwall
Bloomsbury
Bounds Green
Bow
Bowes Park
Brent Cross
Brent Park
Brixton
Brockley
Bromley 
Brompton
Brondesbury
Brunswick Park
Burroughs, The
Camberwell
Cambridge Heath
Camden Town
Canary Wharf
Cann Hall
Canning Town
Canonbury
Castelnau
Catford
Chalk Farm
Charing Cross
Charlton
Chelsea
Childs Hill
Chinatown
Chinbrook
Chingford
Chiswick
Church End
Church End
Clapham
Clerkenwell
Colindale
Colliers Wood
Colney Hatch
Covent Garden
Cricklewood
Crofton Park
Crossness
Crouch End
Crystal Palace
Cubitt Town
Custom House
Dalston
Dartford
De Beauvoir Town
Denmark Hill
Deptford
Dollis Hill
Dulwich
Ealing
Earls Court
Earlsfield
East Dulwich
East Finchley
East Ham
East Sheen
Edmonton
Elephant and Castle
Eltham
Farringdon
Finchley
Finsbury
Finsbury

In [271]:
#We save the results to a csv file just in case we need to re-run the script as the Foursquare API has limited calls
#export_df_csv = london_venues.to_csv ('london_venues_df.csv', index = None, header=True)
#london_venues=pd.read_csv('london_venues_df.csv')
london_venues.head(10)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Abbey Wood,51.49245,0.12127,Bean @ Work,51.491172,0.120649,Coffee Shop
1,Abbey Wood,51.49245,0.12127,Abbey Cafe,51.489754,0.120822,Café
2,Aldgate,51.513308,-0.077762,The Association,51.513733,-0.079132,Coffee Shop
3,Aldgate,51.513308,-0.077762,Kahaila Aldgate,51.514046,-0.077001,Coffee Shop
4,Aldgate,51.513308,-0.077762,Benk + Bo,51.515731,-0.075875,Bakery
5,Aldgate,51.513308,-0.077762,Notes Coffee Roaster & Wine Bar,51.514643,-0.080671,Coffee Shop
6,Aldgate,51.513308,-0.077762,Curators Coffee Studio,51.512085,-0.082568,Coffee Shop
7,Aldgate,51.513308,-0.077762,Tifinbox,51.516345,-0.077195,Indian Restaurant
8,Aldgate,51.513308,-0.077762,Planet Organic,51.516827,-0.078636,Organic Grocery
9,Aldgate,51.513308,-0.077762,Black Sheep Coffee,51.51399,-0.075459,Coffee Shop


This data frame contains all venues that serve coffee, but we only want to focus on Cafés or Coffee Shops for the scope of this project, so we are going to drop all the venues where the category is not 'Café' or 'Coffee Shop'

In [273]:
# Get names of indexes for which column Venue Category has value of Cafe or Coffee Shop
indexCafe=london_venues[ (london_venues['Venue Category'] != 'Café') & (london_venues['Venue Category'] != 'Coffee Shop')].index
london_cafes=london_venues.drop(indexCafe)
london_cafes.reset_index(drop=True, inplace= True)
london_cafes.head(5)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Abbey Wood,51.49245,0.12127,Bean @ Work,51.491172,0.120649,Coffee Shop
1,Abbey Wood,51.49245,0.12127,Abbey Cafe,51.489754,0.120822,Café
2,Aldgate,51.513308,-0.077762,The Association,51.513733,-0.079132,Coffee Shop
3,Aldgate,51.513308,-0.077762,Kahaila Aldgate,51.514046,-0.077001,Coffee Shop
4,Aldgate,51.513308,-0.077762,Notes Coffee Roaster & Wine Bar,51.514643,-0.080671,Coffee Shop


Now let's check how many venues were returned for each neighborhood

In [0]:
cafes_Neighbourhood=pd.DataFrame(london_cafes.groupby('Neighbourhood').size())

In [0]:
#We reset the index
cafes_Neighbourhood.reset_index(inplace= True)
#Create two columns
cafes_Neighbourhood.columns=['Neighbourhood','Count']

The top 10 neighbourhoods with more cafes in London are:

In [211]:
cafes_Neighbourhood.sort_values(by='Count',ascending=False).head(10)

Unnamed: 0,Neighbourhood,Count
184,Pentonville,29
144,Lisson Grove,26
178,Paddington,26
1,Aldgate,24
177,Oval,23
8,Barbican,23
126,Islington,22
200,Shepherd's Bush,22
22,Bloomsbury,22
13,Bedford Park,22


The bottom 10 neighbourhoods with less cafes in London are:

In [212]:
cafes_Neighbourhood.sort_values(by='Count',ascending=False).tail(10)

Unnamed: 0,Neighbourhood,Count
174,Oakleigh Park,1
145,Little Ilford,1
43,Catford,1
103,Hackney Marshes,1
240,Totteridge,1
155,Middle Park,1
66,Dartford,1
234,The Hyde,1
164,Neasden,1
122,Honor Oak,1


Now we are going to cluster the neighbourhoods using the k-means algorithm:

In [227]:
# set number of clusters
kclusters = 3

london_grouped_clustering = cafes_Neighbourhood.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(london_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 2, 0, 1, 0, 1, 2, 2, 0], dtype=int32)

Let's create a new dataframe by adding the clusters to the dataframe with the different neighbourhoods and number of cafes:

In [0]:
# add clustering labels
cafes_Neighbourhood.insert(0, 'Cluster Labels', kmeans.labels_)

Now we will merge the dataframe with the number of cafes and clusters with the dataframe with the coordinates of each neighbourhood so we can represent it on a map:

In [229]:
london_merged = cafes_Neighbourhood

# merge london_df with london_merged to add latitude/longitude for each neighborhood
london_merged = london_merged.join(london_df.set_index('Neighbourhood'), on='Neighbourhood')

london_merged.head(5) 

Unnamed: 0,Cluster Labels,Neighbourhood,Count,Borough,Latitude,Longitude
0,0,Abbey Wood,2,"Bexley, Greenwich",51.49245,0.12127
1,2,Aldgate,24,City,51.513308,-0.077762
2,2,Aldwych,20,Westminster,51.513307,-0.117092
3,0,Angel,5,Islington,51.5005,-0.06051
4,1,Archway,12,Islington,51.565746,-0.134917


Now, we will examine each cluster and determine the discriminating venue categories that distinguish each cluster.

**Cluster 0:**

In [0]:
cluster0=london_merged.loc[london_merged['Cluster Labels'] == 0]

In [284]:
cluster0['Count'].describe()

count    168.000000
mean       3.904762
std        1.720737
min        1.000000
25%        3.000000
50%        4.000000
75%        5.000000
max        7.000000
Name: Count, dtype: float64

**Cluster 1:**

In [0]:
cluster1=london_merged.loc[london_merged['Cluster Labels'] == 1]

In [286]:
cluster1['Count'].describe()

count    70.000000
mean     10.457143
std       2.012178
min       8.000000
25%       9.000000
50%      10.000000
75%      12.000000
max      15.000000
Name: Count, dtype: float64

**Cluster 2:**

In [0]:
cluster2=london_merged.loc[london_merged['Cluster Labels'] == 2]

In [288]:
cluster2['Count'].describe()

count    42.000000
mean     20.119048
std       2.864436
min      16.000000
25%      18.000000
50%      20.000000
75%      21.750000
max      29.000000
Name: Count, dtype: float64

Lastly we are going to visualize the resulting clusters on a London map. First we need to obtain the coordinates of London to start the map.

In [247]:
location = geocoder.arcgis('London, United Kingdom')
lnd_lat_lng = location.latlng
print('The geograpical coordinate of London are {}, {}.'.format(lnd_lat_lng[0], lnd_lat_lng[1]))

The geograpical coordinate of London are 51.50642000000005, -0.1272099999999341.


Now we create the map:

In [281]:
# create map
map_clusters = folium.Map(location=[lnd_lat_lng[0], lnd_lat_lng[1]], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_merged['Latitude'], london_merged['Longitude'], london_merged['Neighbourhood'], london_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results <a name="results"></a>



Throughout our analysis we have selected 299 neighbourhoods in London belonging to one of each of the 32 London boroughs. We have removed those neighbourhoods that are on the outskirts to reduce the number of neighbourhoods to explore through the Foursquare API, as it has limited number of free requests.

The Foursquare API was used to explore venues that belong to the 'coffee' section parameter. This helped us to limit the venues retrieved from the API to those that sell coffee. The number of venues obtained was: 2833.

Those venues included cafés and coffee shops but also restaurants or ice cream parlours or any other venues serving coffee. Hence we restricted our results to those with venue category of 'café' and 'coffee shop'. The number of venues remaining were: 2215.

We obtained the number of cafes per neighbourhood. The top 10 neighbourhoods with more cafes ranged from Pentonville with 29 to Bedford Park with 22. Pentonville is a neighbourhood on the northern fringe of Central London which is located near the international transportation hub of Kings Cross-St Pancras train station. Pentonville is also very close to the Angel area famous for its restaurants and shops. Both of these factors explain the high number of coffee shows in Pentonville.

On the other hand, the bottom 10 neighbourhoods with less number of cafes in London all had 1 only. These are neighbourhoods such as Hackney Marshes, Totteridge or Neasden. These are all residential neighbourhoods in the outer rim of London, where people that work in central London live; hence this explains the lack of coffee shops.

Our analysis used Machine Learning to cluster the different London neighbourhoods based on the number of cafes in each one in order to find those neighbourhoods with more and less cafes. 

The K-means algorithm grouped the neighbourhoods in 3 clusters:


*   Cluster 0: those neighbourhoods with a minimum number of cafes of 1 and a maximum of 7.
*   Cluster 1: those neighbourhoods with a minimum number of cafes of 8 and a maximum of 15.
*   Cluster 2: those neighbourhoods with a minimum number of cafes of 16 and a maximum of 29.





We plotted the location of the three clusters on the map and that helped us to understand the meaning of each of the clusters.

Cluster 0 includes all those neighbourhoods with a low number of coffee shops. These tend to be in the outskirts of London, and are mainly residential areas. People living here commute into London for both work and leisure.

Cluster 1 includes all those neighbourhoods with a medium number of coffee shops. These tend to be closer to central London, and are a mixture of residential, work and leisure areas.

Cluster 2 includes all those neighbourhoods with a high number of coffee shops. These neighbourhoods are part of central London, and their main use is leisure (shopping and restaurant districts) and work (office buildings). These are also the most touristic neighbourhoods.

## Discussion <a name="discussion"></a>

Our Machine Learning analysis has helped us identify three clusters of neighbourhoods based on their number of coffee shops. This, together with the location of these clusters on the map, has helped us understand which would be the best neighbourhoods London to open a new coffee shop. 

These are the neighbourhoods we would recommend for a potential entrepreneur looking to open a coffee shop:


*   One of the best neighbourhoods is Elephant and Castle. This neighbourhood is part of Cluster 0 (low number of cafes) and according to our analysis has 6 cafes. In addition this neighbourhood is close to central London and is undergoing urban regeneration with a lot of investment.

In [276]:
cafes_Neighbourhood.loc[cafes_Neighbourhood['Neighbourhood']=='Elephant and Castle']

Unnamed: 0,Cluster Labels,Neighbourhood,Count
80,0,Elephant and Castle,6


*   Another great neighbourhood would be Mayfair. This neighbourhood is part of Cluster 1 (medium number of cafés) and according to our analysis has 13 cafes. Mayfair is an area famous for its high-end restaurants and hotels, and luxury shops. This may make this location very good for an upper-end speciality café.


In [277]:
cafes_Neighbourhood.loc[cafes_Neighbourhood['Neighbourhood']=='Mayfair']

Unnamed: 0,Cluster Labels,Neighbourhood,Count
153,1,Mayfair,13


In addition to these two neighbourhoods we can also recommend areas of London based on the location of the cluster 0 neighbourhoods (with less than 7 cafes) on the London map. We have identified 2 areas where there is a general low number of cafés:


*   South London.
*   East London. 

Both of these areas of London have been traditionally less touristic and have less leisure and work districts, however they are recently becoming trendier and undergoing significant regeneration. A clear example would be Stratford (which right now only has 4 cafes). This neighbourhood was majorly regenerated and developed for the London 2012 Olympic Games and has become a shopping and leisure destination. Hence these would be areas with potential.


In [278]:
cafes_Neighbourhood.loc[cafes_Neighbourhood['Neighbourhood']=='Stratford']

Unnamed: 0,Cluster Labels,Neighbourhood,Count
225,0,Stratford,4


## Conclusion <a name="conclusion"></a>

The purpose of this project was to identify London neighbourhoods with low number of coffee shops in order to aid stakeholders in narrowing down the search for optimal location for a new coffee shop. 

By calculating coffee shop density from Foursquare data we have first identified neighbourhoods with high and low number of cafés. Clustering of those locations was then performed in order to group the neighbourhoods into 3 clusters based on their number of cafés (high, medium and low). Finally the neighbourhoods and their clusters were plotted in a map to find the geographical distribution of the three clusters. This allowed us to identify that in general, neighbourhoods in central London have a higher number of cafes than those in the outskirts. However, some interesting outliers were found (such as the neighbourhoods of Elephant and Castle and Mayfair), which would be good neighbourhoods for a new coffee shop.

The final decision on the optimal coffee shop location will be made by stakeholders based on the number of cafes already in the neighbourhood (data presented in this project) but also taking into consideration additional factors such as attractiveness of each location (proximity to parks or shopping centre), proximity to a tube station, levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighbourhood etc.