# Capstone Project - Opening a new business venue in a different city.


## Business Problem

An entrepreneur owns a shop in a specific area of a city. The shop has a specific theme, and the area the shop is located had been selected because of its character. The existing venues and amenities give to the surroundings a specific vibe and characterisation that the entrepreneur deems fit for the theme of the shop.

The entrepreneur wants to open a new location in a different city and would like to select an area which has the same characteristics of the one in the previous city.

For this specific project we will consider the following instance:

An entrepreneur owns an arty/concept coffee-shop in SoHo, Manhattan, New York and would like to open a new location in London. The  entrepreneur wants to select an area which has the same characteristics and vibes of SoHo, an area known for its commercialization and eclectic mix of venues ranging from restaurants and coffee-shops, to shopping boutiques and art galleries.


## Data

The following procedure is used to create the datasets used in the project:

* A list of all areas of London, containing area name and OS Grid Ref is scraped from Wikipedia (https://en.wikipedia.org/wiki/List_of_areas_of_London)
* The centre of each London areas is located by converting OS Grid Reference number in geodetic coordinates (latitude and longitude).
* SoHo, New York centre in located and added to the dataset
* A list of venues around a radius of 500mt around every area centre is retrieved using the Foursquare API.
Venue categories hierarchies are scraped from the Foursquare documentation (https://developer.foursquare.com/docs/build-with-foursquare/categories/)
* For every venue retrieved using the Foursquare API, the number of category keywords available is expanded using the category hierarchies.

The datasets created:

* A list of areas in all the areas of London + SoHo, New York, with centre coordinates
* A list of venues in SoHo, New York, with venue coordinates and venue characteristics.
* A list of venues in all the areas of London, with venue coordinates and venue characteristics.

In [84]:
""" 
===================================
        Libraries
===================================
"""

from os import chdir; chdir("/home/ernestino/Python/Capstone/London")
import gc # gc.collect()

import numpy as np

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from geopy.geocoders import Nominatim 

from matplotlib import cm
import matplotlib.colors as colors
import folium


""" 
===================================
        Importing datasets
===================================
"""

area_radius_meters = 500

london_areas = pd.read_csv('London_Areas.csv')
london_areas['areaID'] = london_areas.index
london_areas = london_areas[['london_borough', 'location','areaID','latitude', 'longitude']]
london_areas.columns = ['borough','area','areaID','latitude','longitude']

newyork_areas = pd.read_csv('NewYork_Areas.csv')
newyork_areas = newyork_areas[newyork_areas['neighborhood'] == 'Soho']
newyork_areas['areaID'] = -1
newyork_areas = newyork_areas.iloc[:,[0,1,4,2,3]]
newyork_areas.columns = ['borough','area','areaID','latitude','longitude']

newyork_venues = pd.read_csv('NewYork_venues.csv')
london_venues = pd.read_csv('London_venues.csv')

newyork_venues['areaID'] = -1

all_venues = newyork_venues.append(london_venues)

del newyork_venues, london_venues

In [85]:
london_areas.head()

Unnamed: 0,borough,area,areaID,latitude,longitude
0,Bexley,Abbey Wood,0,51.485964,0.110225
1,Ealing,Acton,1,51.51008,-0.263398
2,Croydon,Addington,2,51.362403,-0.024762
3,Croydon,Addiscombe,3,51.381096,-0.067074
4,Bexley,Albany Park,4,51.434403,0.126554


In [86]:
newyork_areas.head()

Unnamed: 0,borough,area,areaID,latitude,longitude
122,Manhattan,Soho,-1,40.722184,-74.000657


London and New York venues combined:

In [87]:
all_venues.head()

Unnamed: 0,venueID,venue,latitude,longitude,categoryID,areaID,Category,Category_lv1,Tokens
0,4bc11de1abf49521cf98c093,Dance With Me SoHo,40.722578,-74.001363,4bf58dd8d48988d134941735,-1,Dance Studio,Arts & Entertainment,"dance studio,dance,studio,performing arts venu..."
1,4b6705a3f964a5207e352be3,Sam Brocato Salon,40.722371,-74.002562,4bf58dd8d48988d110951735,-1,Salon / Barbershop,Shop & Service,"barbershop,service,shop,salon,shop & service,s..."
2,4b96c70ff964a520dfe334e3,Hair Toto Group,40.718629,-73.999593,4bf58dd8d48988d110951735,-1,Salon / Barbershop,Shop & Service,"barbershop,service,shop,salon,shop & service,s..."
3,45e98bacf964a52080431fe3,MarieBelle,40.723101,-74.002477,4bf58dd8d48988d1d0941735,-1,Dessert Shop,Food,"dessert shop,dessert,shop,food"
4,52eddc12498e40bb655e0d7a,Ladurée,40.724314,-74.002453,4bf58dd8d48988d1d0941735,-1,Dessert Shop,Food,"dessert shop,dessert,shop,food"


## Methodology

The main objective is to find which area of London has the same characteristics and vibes of SoHo, New York.

Few assumptions are made: 

* Characteristics of an area can be inferred by the characteristics of the venues present in said area.
* A theme is a specific mix of venue characteristics which are somehow related together. (e.g. the theme “Sport & Fitness”  refers to all the venues related to sport and fitness, such as gyms, fitness centres, SPAs, golf courses, etc.).
* An area is defined by a combination of themes (e.g. an area could be 0.8 Sport & Fitness and 0.2 Residential)
* Similarity between two areas is calculated based on the proportion of the same themes they share.

In the available dataset, every venue is defined by a series of tokens, created using the Foursquare categories hierarchies. These tokens are labels expressing the business and main qualities of the venue (e.g. food, drinks, restaurant, Italian).

The tokens are used to calculate 10 distinct themes using Latent Semantic analysis. These themes are then used to calculate the similarity between Soho, New York and every Area of London.

Most similar areas are defined as the ones with a similarity higher than 0.9. 

As some of the most similar areas are geographically neighbouring or even partially overlapping, they are clustered by proximity using DBSCAN algorithm. The resulting clusters are groups of areas close to each other and can be treated as a single larger area where a shop could potentially be opened.

As a cluster could be composed by areas with different similarities, 5 Nearest Neighbours algorithm 
in order to calculated which cluster is the most similar to the NewYork area.



## Analysis

In the available dataset, every venue is defined by a series of tokens, created using the Foursquare categories hierarchies. These tokens are labels expressing the business and main qualities of the venue (e.g. food, drinks, restaurant, Italian).

A Tf-idf matrix is built using the tokens. Such matrix contains the occurrences of any token for every area (including SoHo New York)

In [88]:
""" 
=============================================================
           building a Tf-idf matrix
=============================================================
"""

all_venues.set_index('areaID', inplace = True)

documents = []

for neigh in all_venues.index.unique():

    document = []
    
    if len(all_venues.loc[neigh].shape) > 1:
        
        for elem in all_venues.loc[neigh, 'Tokens']:
            document.extend(elem.split(','))
            
    else:
        document.extend(all_venues.loc[neigh, 'Tokens'].split(','))

    documents.append(document)


venue_tokens = pd.DataFrame({'areaID':all_venues.index.unique(), 'tokens':documents})
venue_tokens.set_index('areaID', inplace = True)
venue_tokens.sort_index(inplace=True)

del neigh, elem, document, documents, all_venues
gc.collect()


from sklearn.feature_extraction.text import TfidfVectorizer
def keep_my_tokens(doc): return doc


tfidf_maker = TfidfVectorizer(analyzer='word',
                              tokenizer=keep_my_tokens,
                              preprocessor=keep_my_tokens,
                              token_pattern=None,
                              max_df = 0.9,
                              min_df = 0.01) 
 
tfidf_maker.fit(venue_tokens['tokens'].values)

tfidf_matrix = tfidf_maker.transform(venue_tokens['tokens'].values)

In [89]:
print( 'tfidf_matrix contains' , tfidf_matrix.shape[1], 'tokens for', tfidf_matrix.shape[0], 'areas')

tfidf_matrix contains 432 tokens for 528 areas


Example of tokens:

In [90]:
tfidf_maker.get_feature_names()[1:10]

['african restaurant',
 'alley',
 'american',
 'american restaurant',
 'antique',
 'antique shop',
 'area',
 'argentinian',
 'argentinian restaurant']

The 432 tokens are used to calculate 10 distinct themes using Latent Semantic analysis (LSA). LSA is used to reduce the numbers of features (tokens) that are used to calculated the similarity between areas. Instead, groups of these features are created (themes), reducing the feature space down to 10.

In [91]:

""" 
===================================================
       Getting Area Theme using LSA
===================================================
"""


from sklearn.decomposition import TruncatedSVD

lsa = TruncatedSVD(n_components=10, n_iter=20)

area_themes_df = pd.DataFrame(lsa.fit_transform(tfidf_matrix))
area_themes_df.set_index(venue_tokens.index.values, inplace=True)

def print_theme_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = 'THEME '+ str(topic_idx)+': '
        message += ' / '.join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])                
                             
        print(message)
        print('')
    print()
    
    
print_theme_top_words(lsa, tfidf_maker.get_feature_names(), 15)


THEME 0: food / shop / restaurant / shop & service / service / nightlife / nightlife spot / bar / outdoors / recreation / outdoors & recreation / store / drink / food & drink shop / pub

THEME 1: outdoors / recreation / outdoors & recreation / athletics / athletics & sports / sports / park / transport / travel / travel & transport / nightlife / nightlife spot / gym / gym / fitness center / fitness

THEME 2: station / transport / travel / travel & transport / train station / train / service / shop & service / platform / store / shop / bus / bus stop / stop / convenience store

THEME 3: transport / travel / travel & transport / restaurant / station / train / train station / food / nightlife / nightlife spot / bus / bar / platform / bus stop / stop

THEME 4: nightlife / nightlife spot / bar / pub / service / shop & service / store / shop / convenience / convenience store / construction / construction & landscaping / landscaping / arts & entertainment / entertainment

THEME 5: arts / enter

The groups created can be interpreted as:

In [92]:
themes = {0:'Shops, Food & Drinks',
          1:'Fitness, Outdoor & Travel',   
          2:'Commuting & Shops',         
          3:'Commuting & Food',
          4:'Nightlife & Entertainment',          
          5:'Shops, Art & Entertainment',
          6:'Transport',          
          7:'Outdoor, Recreation & Shops',              
          8:'Residential & Outdoor sport',
          9:'Residential & Ethnic food'
          }

Similarity between two areas is calculated based on the proportion of the same themes they share. Cosine similarity is used to calculate the similarity between Soho, New York and all the Areas of London

In [93]:
from sklearn.metrics.pairwise import cosine_similarity

cos_sim = cosine_similarity(area_themes_df.values)

area_themes_df['similarity'] = cos_sim[0]

london_areas = london_areas.merge(area_themes_df.loc[0:, 'similarity'], left_index=True, right_index=True)

london_areas.head()

Unnamed: 0,borough,area,areaID,latitude,longitude,similarity
0,Bexley,Abbey Wood,0,51.485964,0.110225,0.615897
1,Ealing,Acton,1,51.51008,-0.263398,0.622542
2,Croydon,Addington,2,51.362403,-0.024762,0.12897
3,Croydon,Addiscombe,3,51.381096,-0.067074,0.650411
4,Bexley,Albany Park,4,51.434403,0.126554,0.44708


In [94]:
""" 
====================================================
    Finds the most similar areas (similarity >= 0.9)
====================================================
"""

top_similar_areas = london_areas[london_areas['similarity'] >= 0.9].copy()
top_similar_areas.sort_values('similarity', ascending=False, inplace=True)

top_similar_areas[['borough','area','similarity']]


Unnamed: 0,borough,area,similarity
400,Hackney,Shoreditch,0.973975
172,Camden,Fitzrovia,0.972802
299,Westminster,Marylebone (also St Marylebone),0.965995
269,Kingston upon Thames,Kingston upon Thames,0.962092
397,Hammersmith and Fulham,Shepherd's Bush,0.960121
406,Westminster,Soho,0.956681
449,Camden,Swiss Cottage,0.952429
348,Bromley,Orpington,0.952374
241,Islington,Holloway,0.947718
207,Hammersmith and Fulham,Hammersmith,0.947667


Many of the above areas are part of the same administrative zone (borough) and are located near each other

In [96]:
top_similar_areas_by_borough = top_similar_areas.groupby('borough').agg({'similarity':[np.max, np.mean],
                                                                         'area':np.size})

top_similar_areas_by_borough.columns = ['similarity_max','similarity_mean','nr_areas']
top_similar_areas_by_borough.sort_values('similarity_max', ascending=False, inplace=True)
top_similar_areas_by_borough[0:15]

Unnamed: 0_level_0,similarity_max,similarity_mean,nr_areas
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Hackney,0.973975,0.973975,1
Camden,0.972802,0.930782,6
Westminster,0.965995,0.928376,6
Kingston upon Thames,0.962092,0.962092,1
Hammersmith and Fulham,0.960121,0.949979,3
Bromley,0.952374,0.93702,2
Islington,0.947718,0.927855,2
Southwark,0.942715,0.942715,1
Haringey,0.940209,0.940209,1
Barnet,0.936639,0.936639,1


In [97]:
""" 
=============================================
       Create map for all areas
==============================================
"""


location = 'London, England, United Kingdom'
map_centre = Nominatim(user_agent="myapp").geocode(location)

map_london = folium.Map(location=[map_centre.latitude,
                                  map_centre.longitude],
                        tiles = 'Stamen Terrain', #'Stamen Toner',
                        zoom_start=11)

# set color scheme

target_colour_column = 'area'

x = len(top_similar_areas[target_colour_column].unique())
colors_array = cm.rainbow(np.linspace(0.1, 0.9, x))
rainbow = [colors.rgb2hex(i) for i in colors_array]
rainbow = pd.DataFrame({target_colour_column:top_similar_areas[target_colour_column].unique(), 'colour':rainbow})
rainbow.set_index(target_colour_column, inplace = True)


# add markers to the map
    
for name, row in top_similar_areas.iterrows():

    label = folium.Popup(str(str(row['area']) +' '+str(row['similarity'])), parse_html=True)
    folium.Circle(location=[row['latitude'], row['longitude']],
                  radius=area_radius_meters,
                  popup=label,
                  color=rainbow.loc[row[target_colour_column], 'colour'],
                  fill=True,
                  fill_color=rainbow.loc[row[target_colour_column], 'colour'],
                  fill_opacity=(row['similarity']-0.89)*1/(1-0.89)
                  ).add_to(map_london)


In [98]:
map_london

Map of London: every circle is an area (radius 500m). Colour is unique per area, colour transparency identify the similarity (more solid the colour, higher the similarity)

As some of the most similar areas are geographically neighbouring or even partially overlapping, they are clustered by proximity using DBSCAN algorithm. The resulting clusters are groups of areas close to each other and can be treated as a single larger area where a shop could potentially be opened.

In [100]:
""" 
====================================================
    Group areas using density clustering 
====================================================
"""


from sklearn.cluster import DBSCAN
from sklearn.metrics.pairwise import haversine_distances

top_similar_areas_coords_radians = np.radians(top_similar_areas[['latitude', 'longitude']].values)

kms_per_radian = 6371.0088
epsilon = 1 / kms_per_radian

DensityClustering = DBSCAN(eps=epsilon, 
                           min_samples=1,
                           algorithm='ball_tree', 
                           metric='haversine').fit(top_similar_areas_coords_radians)


top_similar_areas['cluster'] = DensityClustering.labels_


top_similar_areas_by_cluster = top_similar_areas.groupby('cluster').agg({'similarity':[np.max, np.mean],
                                                                         'area':np.size,
                                                                         'borough': lambda x: x.nunique()})

top_similar_areas_by_cluster.columns = ['similarity_max','similarity_mean','nr_areas', 'nr_boroughs']
top_similar_areas_by_cluster.sort_values('similarity_max', ascending=False, inplace=True)

top_similar_areas_by_cluster[0:15]

Unnamed: 0_level_0,similarity_max,similarity_mean,nr_areas,nr_boroughs
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.973975,0.973975,1,1
1,0.972802,0.939781,6,2
2,0.962092,0.948802,2,2
3,0.960121,0.951135,2,1
4,0.952429,0.933862,2,2
5,0.952374,0.952374,1,1
6,0.947718,0.947718,1,1
7,0.947667,0.947667,1,1
8,0.942715,0.942715,1,1
9,0.940209,0.940209,1,1


The above table shows how clusters can be composed by several neighbouring areas belonging to different administrative zones (borough).

Example: details of cluster 1

In [101]:
top_similar_areas[top_similar_areas['cluster']==1]

Unnamed: 0,borough,area,areaID,latitude,longitude,similarity,cluster
172,Camden,Fitzrovia,172,51.518021,-0.136242,0.972802,1
299,Westminster,Marylebone (also St Marylebone),299,51.517305,-0.147803,0.965995,1
406,Westminster,Soho,406,51.517076,-0.133397,0.956681,1
53,Camden,Bloomsbury,53,51.521478,-0.127451,0.924837,1
113,Westminster,Covent Garden,113,51.5115,-0.122095,0.912547,1
300,Westminster,Mayfair,300,51.508317,-0.148168,0.905823,1


As a cluster could be composed by areas with different similarities, 5 Nearest Neighbours (KNN) algorithm in order to calculated which cluster is the most similar to the NewYork area. KNN gives the probability for SoHo to belong to one of the top cluster identified

In [103]:
""" 
=======================================================================
    Finds best clusters using KNN
    best is defined as a weighting of most similar and denser
=======================================================================
"""


from sklearn.metrics.pairwise import cosine_distances
from sklearn.neighbors import KNeighborsClassifier
    
newyork_themes = area_themes_df.drop(['similarity'], axis=1).loc[-1]
london_themes = area_themes_df.drop(['similarity'], axis=1).loc[top_similar_areas['areaID']]


cos_dist_london = cosine_distances(london_themes)
cos_dist_newyork = cosine_distances(london_themes,newyork_themes.values.reshape(1,-1))


find_best_clusters = KNeighborsClassifier(weights='distance',metric='precomputed',n_neighbors=5)
find_best_clusters.fit(cos_dist_london, top_similar_areas['cluster'])


best_clusters = pd.DataFrame({'cluster':list(find_best_clusters.classes_),
                              'probability':find_best_clusters.predict_proba(cos_dist_newyork.reshape(1,-1))[0]
                              }).query('probability>0').sort_values(['probability'], ascending = False)

best_clusters

Unnamed: 0,cluster,probability
1,1,0.42405
0,0,0.246223
2,2,0.169041
3,3,0.160686


In [104]:
best_clusters_areas = top_similar_areas.set_index('cluster').loc[best_clusters['cluster'].values, ['areaID', 'area']]
best_clusters_areas.reset_index(inplace=True)
best_clusters_areas.set_index('areaID', inplace=True)

best_clusters_areas

Unnamed: 0_level_0,cluster,area
areaID,Unnamed: 1_level_1,Unnamed: 2_level_1
172,1,Fitzrovia
299,1,Marylebone (also St Marylebone)
406,1,Soho
53,1,Bloomsbury
113,1,Covent Garden
300,1,Mayfair
400,0,Shoreditch
269,2,Kingston upon Thames
212,2,Hampton Wick
397,3,Shepherd's Bush


The above tables show the most similar clusters and the areas that compose them

As the similarity calculation was based on the proportion of themes (as defined above) composing an area, it is possible to visualise the cluster theme composition and compare it to SoHo theme composition

In [105]:
""" 
=================================================
    Finds Themes for best clusters
=================================================
"""

# Areas in the best clusters


best_clusters_themes = best_clusters_areas.merge(area_themes_df,left_index=True,right_index=True)

best_clusters_themes = best_clusters_themes.groupby('cluster').mean()

best_clusters_themes = best_clusters_themes.merge(best_clusters, left_on='cluster', right_on='cluster')

best_clusters_themes.drop(['cluster'], inplace=True, axis=1)
best_clusters_themes = best_clusters_themes.sort_values('probability', ascending=False)

# Adding New York for comparison
clusters_themes = pd.DataFrame(area_themes_df.loc[-1]).T.append(best_clusters_themes)

clusters_themes.columns = list(themes.values())+['similarity','probability']

# Ordering the table

temp = clusters_themes.drop(['similarity','probability'], axis=1).T.sort_values(-1, ascending=False)
temp2 = clusters_themes[['similarity','probability']].T

clusters_themes = temp.append(temp2)

clusters_themes

Unnamed: 0,-1,1,0,2,3
"Shops, Food & Drinks",0.791025,0.791164,0.824956,0.818866,0.777527
"Shops, Art & Entertainment",0.286002,0.256987,0.171053,0.190239,0.295624
Residential & Outdoor sport,0.050814,-0.051587,0.046084,0.098623,-0.011683
"Outdoor, Recreation & Shops",0.020506,0.078989,0.020517,0.156006,0.037745
Commuting & Shops,-0.078593,-0.101971,-0.123416,-0.078793,0.036479
Residential & Ethnic food,-0.086795,-0.065468,-0.095306,-0.173619,-0.106929
Transport,-0.097798,-0.068457,-0.104546,-0.046476,-0.055465
Nightlife & Entertainment,-0.112151,-0.033794,0.010567,-0.041894,-0.010793
"Fitness, Outdoor & Travel",-0.129454,-0.169312,-0.06458,-0.297501,-0.143002
Commuting & Food,-0.159602,0.020927,-0.090477,-0.058833,-0.19329


In the table above SoHo, New york is identified with the column -1. every other column identifies a cluster. It is possible to notice how SoHo is characterised by being predominantly an area of Shops, Food & Drinks (0.79), and Shops, Art & Entertainment (0.286002), and not really a fitness location (-0.12) nor commuting and fast food (-0.15). Although the mean similarity for cluster 1 (0.939781) is lower than cluster 0 (0.973975), KNN(5) suggest that cluster 1 is a more denser area, with an higher probability for SoHo to belong to it (0.424050)

Map with clusters

In [106]:
""" 
========================================================
     Cluster centroids and size of the centroid circle
========================================================
"""


clusters_centroids = top_similar_areas.groupby('cluster')[['latitude', 'longitude']].mean()
clusters_centroids_radians = np.radians(clusters_centroids.values)


hav_dis = haversine_distances(top_similar_areas_coords_radians, clusters_centroids_radians,)*kms_per_radian*1000

cluster_radius = pd.DataFrame(hav_dis)
cluster_radius['cluster'] = top_similar_areas['cluster'].values

cluster_radius = pd.melt(cluster_radius, id_vars=['cluster'], var_name='cluster_to_join', value_name='radius')
cluster_radius = cluster_radius.query('cluster == cluster_to_join')
cluster_radius = cluster_radius.groupby('cluster')['radius'].max()+area_radius_meters


clusters_centroids = clusters_centroids.merge(cluster_radius, left_index=True, right_index=True)

In [108]:
""" 
=============================================
       Create map for best areas
==============================================
"""

location = 'London, England, United Kingdom'
map_centre = Nominatim(user_agent="myapp").geocode(location)

map_london = folium.Map(location=[map_centre.latitude,
                                  map_centre.longitude],
                        tiles = 'Stamen Terrain', #'Stamen Toner',
                        zoom_start=11)

map_dataframe = top_similar_areas.loc[best_clusters_areas.index.values]


# set color scheme
target_colour_column = 'cluster'
x = len(map_dataframe[target_colour_column].unique())
colors_array = cm.rainbow(np.linspace(0.1, 0.5, x))
rainbow = [colors.rgb2hex(i) for i in colors_array]
rainbow = pd.DataFrame({target_colour_column:map_dataframe[target_colour_column].unique(), 'colour':rainbow})
rainbow.set_index(target_colour_column, inplace = True)


# add markers to the map
    
for name, row in map_dataframe.iterrows():

    label = folium.Popup(str(str(row['cluster']) +' '+str(row['area']) +' '+str(row['similarity'])), parse_html=True)
    folium.Circle(location=[row['latitude'], row['longitude']],
                  radius=area_radius_meters,
                  popup=label,
                  color=rainbow.loc[row[target_colour_column], 'colour'],
                  fill=True,
                  fill_color=rainbow.loc[row[target_colour_column], 'colour'],
                  fill_opacity=(row['similarity']-0.89)*1/(1-0.89)
                  ).add_to(map_london)

for name, row in clusters_centroids.loc[map_dataframe['cluster'].unique()].iterrows():

    folium.Circle(location=[row['latitude'], row['longitude']],
                  radius=row['radius'],
                  color=rainbow.loc[name, 'colour']
                  ).add_to(map_london)

In [109]:
map_london

## Results and Discussion

4 main optimal zones are identified, by clustering the most similar areas of London and grouping them by proximity.

In [110]:
best_clusters_areas

Unnamed: 0_level_0,cluster,area
areaID,Unnamed: 1_level_1,Unnamed: 2_level_1
172,1,Fitzrovia
299,1,Marylebone (also St Marylebone)
406,1,Soho
53,1,Bloomsbury
113,1,Covent Garden
300,1,Mayfair
400,0,Shoreditch
269,2,Kingston upon Thames
212,2,Hampton Wick
397,3,Shepherd's Bush


* Cluster 0, a cluster composed only by the areas of Shoreditch, a popular and fashionable part of London, particularly associated with the creative industries. Art galleries, bars, restaurants, media businesses are common in the area.
* Cluster 1, a large cluster composed by London areas belonging to the very central boroughs of Westminster and par of Camden. This cluster is dense (boundaries of the areas overlapping) and cover the biggest area in meters. This zone famous for the reputation as a major entertainment district of London . It is filled with galleries, bars, restaurants and major theatres.
* Cluster 2, a cluster composed by two areas belonging to the borough of Kingston Upon Thames. Although far from London's centre, Kingston is identified as a metropolitan area and is today a major retail centre, one of the biggest in the UK, receiving 18 million visitors a year.
* Cluster 3, a cluster composed by two areas hosting a major luxury retail centre and campus of universities

Several additional external factor could be taken in account to choose which cluster is optimal (population of the areas, percentage of tourists, proximity to tourist attraction, average cost of rental). Anyway, for this project only the previously defined internal metrics are used: mean similarity of the cluster and cluster density/areas coverage.

In [114]:
print(best_clusters)

   cluster  probability
1        1     0.424050
0        0     0.246223
2        2     0.169041
3        3     0.160686


In [115]:
top_similar_areas.groupby('cluster')['similarity'].max().sort_values(ascending=False)[0:4]

cluster
0    0.973975
1    0.972802
2    0.962092
3    0.960121
Name: similarity, dtype: float64

In [116]:
cluster_radius[0:4]

cluster
0     500.000000
1    1676.553043
2     815.427496
3     960.810945
Name: radius, dtype: float64

Cluster 1 appears to be the optimal choice. Although it has an average lower similarity (0.93), it is composed by at least one areas with a similarity almost identical to the only area composing cluster 0 (0.97). Furthermore, as cluster 1 gives the client a more options on where to open a venue as it covers the biggest area, with a radius of 1676 meters compared to 500m radius area of cluster 0.

## Conclusion

The purpose of the project was to identify the area of London that could best reproduce the characteristics and atmosphere of Soho New York.

To compare London areas to SoHo, a measure of similarity was defined.

Few assumptions were made: 

* Characteristics of an area can be inferred by the characteristics of the venues present in said area.
* A theme is a specific mix of venue characteristics which are somehow related together. (e.g. the theme “Sport & Fitness”  refers to all the venues related to sport and fitness, such as gyms, fitness centres, SPAs, golf courses, etc.).
* An area is defined by a combination of themes (e.g. an area could be 0.8 Sport & Fitness and 0.2 Residential)
* Similarity between two areas is calculated based on the proportion of the same themes they share.

Using geographical data and venues dataset from the Foursquare API, themes composition of all the London areas were derived, and similarities were calculated using cosine similarity.

Geographically closer areas were grouped into clusters.

A Cluster composed by the central areas of London (Fitzrovia, Marylebone, Soho, Bloomsbury, Covent Garden, Mayfair) was selected as the optimal cluster; having the optimal qualities of being composed by areas the most similar areas to SoHo and covering the biggest area in meters, giving gives the client a more options on where to open a venue.