# Uncovering the Popular Locations in London

## Introduction

London has become a popular place among global and local travellers. It is home to the famous and gorgeous clock tower, Big Ben, and the giant iconic wheel near it, London Eye. Hyde Park, possible the largest and most famous park in London, is popular among travelers too. In addition to history, it is where the tale of Sherlock Holmes began at 221B Baker Street. Holmes is arguably the most famous detective in the fictional world and is feared by all criminals. Many visitors go to London mainly to visit his fictional dwelling place.

Besides visiting all the famous places in London, visitors want to get the most out of their experience. Some will visit all the popular venues in various neighborhoods in London, while others will dwell at the most famous spot for hours. Regardless of the traveling style, it would be nice to have a list of popular venues in each area in London, so that visitors can plan their visit optimally. In this regards, data science can help us uncover these popular spots in London.

## Business Problem

The goal of this project is to help visitors choose their destinations depending on their likings and the experiences that the neighbourhoods have to offer. This also helps people make decisions if they are thinking about migrating to London or even if they want to relocate their neighbourhoods within the city. Our findings will help stakeholders make informed decisions and address any concerns they have including the different kinds of cuisines, stores, leisure activities, and what the neighborhood has to offer.

## Data Description

To determine the popular locations in London, we need the geographical location data of London, the popularity of each location, and the information of each location. Three sources of databases will be drawn upon as follows:

1. Wikipedia
Wikipedia provides us with information on all neighborhoods, although we will limit our analysis to London (https://en.wikipedia.org/wiki/List_of_areas_of_London). With the list of neighborhoods and postal codes, we can analyze the most popular venues in each category. Some of the data we will be using from this database are neighborhood, borough, and postal codes, such as Bexley Greenwich (borough) - London (town) - WE2 (post code).

borough: Name of Neighbourhood <br> 
town: Name of the borough <br>
post_code: Postal codes for London.

2. ArcGIS API

ArcGIS API allows for the analysis regarding people and locations. With this API, we can use interactive maps to gain some insights. In this case, ArcGIS allows us to plot the various locations on the map using the latitude and longitude information, such as (for Bexley) 51.49245 deg latitude and 0.12127 deg longitude.

latitude: Latitude for Neighbourhood <br>
longitude: Longitude for Neighbourhood

3. Foursquare API

After having the locations from the previous two databases, we can get more specific information using Foursquare API, including photos. Based on all the information collected, a clustering model will be utilized to uncover the patterns based on similar venue categories. Thus, a piece of important information from this database is the venue category of each location, such as (for Bexley) "supermarket", "historic site", and "coffee shop."

Neighbourhood : Name of the Neighbourhood <br>
Neighbourhood Latitude : Latitude of the Neighbourhood <br>
Neighbourhood Longitude : Longitude of the Neighbourhood <br>
Venue : Name of the Venue <br>
Venue Latitude : Latitude of Venue <br>
Venue Longitude : Longitude of Venue <br>
Venue Category : Category of Venue

## Methodology

Start off by importing the following libraries

In [1]:
pip install folium

Collecting folium
  Downloading folium-0.12.0-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 4.4 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import requests
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium

# import k-means for the clustering stage
from sklearn.cluster import KMeans

The approach to conduct the analysis is separated into 4 main steps as follows.
1. Gather the required data of London
2. Plot the map to show the neighbourhoods being considered
3. Build our model by clustering all of the similar neighbourhoods together 
4. Plot the new map with the clustered neighbourhoods to draw insights and discuss findings.

### 1. Gathering Data

(Scratch)
- cities: London + ?
- collect data from wiki https://en.wikipedia.org/wiki/List_of_areas_of_London
- preprocess the data: remove space and add "_" between words
- select features: boroughs, Postal codes, Post town --> drop the rest
- feature engineering: select only London
- Geolocation: use arcGIS API
- get latitudes and longitudes of the boroughs in London

#### 1.1 London Neighborhoods

First, we begin by collecting the list of areas of London from the Wikipedia page.

In [3]:
url_london = "https://en.wikipedia.org/wiki/List_of_areas_of_London"
wiki_london_url = requests.get(url_london)
wiki_london_data = pd.read_html(wiki_london_url.text)[1]
wiki_london_data

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,020,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",020,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,020,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,020,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",020,TQ478728
...,...,...,...,...,...,...
526,Woolwich,Greenwich,LONDON,SE18,020,TQ435795
527,Worcester Park,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4,020,TQ225655
528,Wormwood Scrubs,Hammersmith and Fulham,LONDON,W12,020,TQ225815
529,Yeading,Hillingdon,HAYES,UB4,020,TQ115825


Then, we process the data by removing spaces in the column titles and adding _ instead. We only need the some columns, so we will get rid of the rest. The required columns are London boroughs, Post town, and Postcode district.

In [4]:
wiki_london_data.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
df1 = wiki_london_data.drop( [ wiki_london_data.columns[0], wiki_london_data.columns[4], wiki_london_data.columns[5] ], axis=1)
df1.columns = ['Borough','Town','Post_code']
# removing the [x] after the Borough names in df1
df1['Borough'] = df1['Borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))
df1

Unnamed: 0,Borough,Town,Post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Croydon,CROYDON,CR0
3,Croydon,CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
...,...,...,...
526,Greenwich,LONDON,SE18
527,"Sutton, Kingston upon Thames",WORCESTER PARK,KT4
528,Hammersmith and Fulham,LONDON,W12
529,Hillingdon,HAYES,UB4


The dataframe df1 contains data from various towns, but we're only interested in London. Therefore, we get rid of the irrelevent data.

In [5]:
df1 = df1[df1['Town'].str.contains('LONDON')]
df1

Unnamed: 0,Borough,Town,Post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
6,City,LONDON,EC3
7,Westminster,LONDON,WC2
9,Bromley,LONDON,SE20
...,...,...,...
521,Redbridge,LONDON,"IG8, E18"
522,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8
525,Barnet,LONDON,N12
526,Greenwich,LONDON,SE18


#### 1.2 Geopositions of London Neighborhoods

The latitudes and longitudes of London neighborhoods are required to plot in the geopgraphical map. We will utilize the ArcGIS package in this regards.

In [6]:
pip install arcgis

Note: you may need to restart the kernel to use updated packages.


In [33]:
from arcgis.geocoding import geocode
from arcgis.gis import GIS

In [8]:
gis = GIS()

To help us extract the locations more easily, the following function will be used to extract the latitudes and longitudes from ArcGIS.

In [9]:
def get_x_y_uk(addr):
   lat_coords = 0
   lng_coords = 0
   g = geocode(address='{}, London, England, GBR'.format(addr))[0]
   lng_coords = g['location']['x']
   lat_coords = g['location']['y']
   return str(lat_coords) +","+ str(lng_coords)

Now, we will extract the latitudes and longitudes of the neighborhoods in London. After getting the latitudes and longitudes, we will append the data to the original dataframe to facilitate our analysis down the road.

In [10]:
geo_coordinates_uk = df1['Post_code']   
coordinates_latlng_uk = geo_coordinates_uk.apply(lambda x: get_x_y_uk(x))
lat_uk = coordinates_latlng_uk.apply(lambda x: x.split(',')[0])
lng_uk = coordinates_latlng_uk.apply(lambda x: x.split(',')[1])

In [11]:
london_merged = pd.concat([df1,lat_uk.astype(float), lng_uk.astype(float)], axis=1)
london_merged.columns= ['Borough','Town','Post_code','Latitude','Longitude']
london_merged

Unnamed: 0,Borough,Town,Post_code,Latitude,Longitude
0,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.51324,-0.26746
6,City,LONDON,EC3,51.51200,-0.08058
7,Westminster,LONDON,WC2,51.51651,-0.11968
9,Bromley,LONDON,SE20,51.41009,-0.05683
...,...,...,...,...,...
521,Redbridge,LONDON,"IG8, E18",51.58977,0.03052
522,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8,51.50642,-0.12721
525,Barnet,LONDON,N12,51.61592,-0.17674
526,Greenwich,LONDON,SE18,51.48207,0.07143


### 2. Visualizing London

- Use folium package
- Use Foursquare API to get the venue and venue categories around each neighbourhood in London

We use the folium package imported eariler to help visualize London.

In [12]:
# Location of London
london = geocode(address='London, England, GBR')[0]
london_lng_coords = london['location']['x']
london_lat_coords = london['location']['y']

# Creating the map of London
map_London = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)

# adding markers to map
for latitude, longitude, borough, town in zip(london_merged['Latitude'], london_merged['Longitude'], london_merged['Borough'], london_merged['Town']):
    label = '{}, {}'.format(town, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='red',
        fill=True
        ).add_to(map_London) 
    
map_London

Next, we will get the venues and their categories from the Foursquare API around the neighborhoods in London. The following function is defined to help us gather the necessary data.

In [16]:
CLIENT_ID = '3KDUJKWALJSGC2JPLBMFY50GXD4SIATCEDLS5OW3RSC0HB5U' 
CLIENT_SECRET = 'VFJ315YCNEJDONDEQEJEIDEF1KUV5HIHFFU0G324R5P1FKCQ'
VERSION = '20180605' # Foursquare API version
LIMIT=100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

In [17]:
venues_in_London = getNearbyVenues(london_merged['Borough'], london_merged['Latitude'], london_merged['Longitude'])
venues_in_London

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,"Bexley, Greenwich",51.49245,0.12127,Lesnes Abbey,Historic Site
1,"Bexley, Greenwich",51.49245,0.12127,Sainsbury's,Supermarket
2,"Bexley, Greenwich",51.49245,0.12127,Lidl,Supermarket
3,"Bexley, Greenwich",51.49245,0.12127,Abbey Wood Railway Station (ABW),Train Station
4,"Bexley, Greenwich",51.49245,0.12127,Bean @ Work,Coffee Shop
...,...,...,...,...,...
10405,Hammersmith and Fulham,51.50645,-0.23691,Mleczko Polish Deli,Deli / Bodega
10406,Hammersmith and Fulham,51.50645,-0.23691,Nut Case,Gourmet Shop
10407,Hammersmith and Fulham,51.50645,-0.23691,New Sweet'n'Sour Chinese Takeaway,Chinese Restaurant
10408,Hammersmith and Fulham,51.50645,-0.23691,The Vine Leaves Taverna,Greek Restaurant


### 3. Grouping by Venue Categories and K-Means Clustering

- Encode venues with dummy variables
- Group by neighborhood and calculate mean for each column
- Find most popular venues in each neighborhood
- Cluster with k-means (example: 5 clusters)

To get better results for clustering, we utilize the dummy variables. Then, we group the data by neighborhood and calculate the mean of each dummy variable for each neighborhood.

In [20]:
London_venue_cat = pd.get_dummies(venues_in_London[['Venue Category']], prefix="", prefix_sep="")

# adding the Neighborhood column to the Dataframe
London_venue_cat['Neighbourhood'] = venues_in_London['Neighbourhood'] 

# moving neighborhood column to the first column
fixed_columns = [London_venue_cat.columns[-1]] + list(London_venue_cat.columns[:-1])
London_venue_cat = London_venue_cat[fixed_columns]

# group the data by neighborhood and calculate the mean values of each column
London_grouped = London_venue_cat.groupby('Neighbourhood').mean().reset_index()
London_grouped.head()

Unnamed: 0,Neighbourhood,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Warehouse Store,Watch Shop,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,Barnet,0.0,0.0,0.001859,0.0,0.0,0.0,0.007435,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Barnet, Brent, Camden",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bexley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bexley, Greenwich",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


To find the most common venue category, we write the following helper function to help us analyze the data better. Since there are many venues, we will look at the top 5 most popular venues.

In [22]:
# return the specified number of most common venues
# row: the specified row in the Dataframe
# num_top_venues: number of top most common venues of interest
def return_most_common_venues(row, num_top_venues): 
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [28]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe for London
neighborhoods_venues_sorted_london = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted_london['Neighbourhood'] = London_grouped['Neighbourhood']

for ind in np.arange(London_grouped.shape[0]):
    neighborhoods_venues_sorted_london.iloc[ind, 1:] = return_most_common_venues(London_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted_london.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Barnet,Coffee Shop,Café,Grocery Store,Pub,Italian Restaurant
1,"Barnet, Brent, Camden",Clothing Store,Supermarket,Hardware Store,Gym / Fitness Center,Convenience Store
2,Bexley,Supermarket,Historic Site,Convenience Store,Coffee Shop,Train Station
3,"Bexley, Greenwich",Daycare,Convenience Store,Food Service,Golf Course,Massage Studio
4,"Bexley, Greenwich",Supermarket,Historic Site,Coffee Shop,Train Station,Platform


#### K-Means Clustering Model

We will classify with 5 clusters with K-means technique. 

In [29]:
# set number of clusters
k_num_clusters = 5

London_grouped_clustering = London_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans_london = KMeans(n_clusters=k_num_clusters, random_state=0).fit(London_grouped_clustering)

# inserting the cluster labels to the Dataframe of most common venues
neighborhoods_venues_sorted_london.insert(0, 'Cluster Labels', kmeans_london.labels_ +1)

# merging data with latitudes and longitudes
london_data = london_merged
london_data = london_data.join(neighborhoods_venues_sorted_london.set_index('Neighbourhood'), on='Borough')
london_data

Unnamed: 0,Borough,Town,Post_code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127,4,Supermarket,Historic Site,Coffee Shop,Train Station,Platform
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",51.51324,-0.26746,2,Grocery Store,Indian Restaurant,Train Station,Breakfast Spot,Hotel
6,City,LONDON,EC3,51.51200,-0.08058,1,Hotel,Coffee Shop,Italian Restaurant,Gym / Fitness Center,Pub
7,Westminster,LONDON,WC2,51.51651,-0.11968,1,Hotel,Coffee Shop,Café,Pub,Sandwich Place
9,Bromley,LONDON,SE20,51.41009,-0.05683,1,Supermarket,Grocery Store,Hotel,Convenience Store,Fast Food Restaurant
...,...,...,...,...,...,...,...,...,...,...,...
521,Redbridge,LONDON,"IG8, E18",51.58977,0.03052,1,Pub,Grocery Store,Café,Pizza Place,Bakery
522,"Redbridge, Waltham Forest","LONDON, WOODFORD GREEN",IG8,51.50642,-0.12721,1,Theater,Hotel,Pub,Monument / Landmark,Garden
525,Barnet,LONDON,N12,51.61592,-0.17674,1,Coffee Shop,Café,Grocery Store,Pub,Italian Restaurant
526,Greenwich,LONDON,SE18,51.48207,0.07143,1,Pub,Grocery Store,Bus Stop,Indian Restaurant,Coffee Shop


### 4. Visualize the cluster

After getting the clusters, they are visualized on the map using the folium package.

In [32]:
london_data_nonan = london_data.dropna(subset=['Cluster Labels'])
map_clusters_london = folium.Map(location=[london_lat_coords, london_lng_coords], zoom_start=12)

# set color scheme for the clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_data_nonan['Latitude'], london_data_nonan['Longitude'], london_data_nonan['Borough'], london_data_nonan['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster) +1) + '\n' + str(poi) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters_london)
        
map_clusters_london

### Results and Discussions

London has become more multicultural over the years. It has a variety of restaurants, bars, coffee shops, and breakfast places. In terms of shopping, London has many attractions, such as flower shops, fishing stores, and clothing stores. The most popular transportation methods are buses and trains. To relax, visitors and residents can go to parks, zoos, gyms, and historic sites. The city of London offers a multicultural and entertaining experience for all groups of people.

### Conclusions

The goal of this project is to explore London and look at how attractive it is to visitors and potential immigrants. It was explored based on the postal codes and common venues present in each of the neighbourhoods are then determined. Finally, we used K-means to cluster similar neighbourhoods together. London has a wide variety and uniqueness of experiences to offer. The cultural diversity allows ones to develop a sense of belonging in no time.