# Problem statement

People move frequently from one city to another, perhaps one they have never been to before, be it for work or any other reason, and a lot of us would a place that most resembles the one they moved from, as it would be easier to adapt and live in. For this purpose, this project proposes an algorithm that measures the similarity between boroughs of different cities, based on their venues, and lists the most similar five boroughs in another city. Such an algorithm could be very interesting for those who move a lot because of their work, as well as adventurous individuals who love dynamic life style and move frequently between cities.

In [1]:
import re
import folium
import requests
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.cm as cm
import ipywidgets as widgets
import matplotlib.colors as colors
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
from ipywidgets import interact, interact_manual
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity

%matplotlib inline
pd.set_option('display.max_rows', 1000)

# Data

## Description

This project is an illustration of the idea and how it works, and therefore it won’t contain all the major cities in the world, however, two cities, London and Paris, are selected to demonstrate how the model works. To that end, the boroughs of city of London will be compared with those of Paris considering the type of venues each lie in each borough. The data required for this tasks is a list of London boroughs, a list of Paris boroughs, and lists of the venues in every borough. The data of the borough of both cities is extracted from Wikipedia pages, while the venues data is obtained using the Foursquare API.

In [2]:
# Define Foursquare Credentials and Version
CLIENT_ID = 'W2K3FISCHVADQ5Q0EU4YYR50ASUUEHSVK0LLNSQ03P3YN320' # your Foursquare ID
CLIENT_SECRET = 'VZQYGDL05PJ5SSLQ0NDKHF0F4Y44C5AN0K2Y2CRBZ1K2U23O' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: W2K3FISCHVADQ5Q0EU4YYR50ASUUEHSVK0LLNSQ03P3YN320
CLIENT_SECRET:VZQYGDL05PJ5SSLQ0NDKHF0F4Y44C5AN0K2Y2CRBZ1K2U23O


## Loading and Preprocessing

In this section, data will be:
    - Loaded
    - Preprocessed
    - Visualized

### Loading and cleaning boroughs data

The data of the boroughs is extracted using the Pandas workframe, once the tables are extracted, they are stored in a dataframe for preprocessing to take place. The preprocessing includes correcting the columns headers, dropping the unnecessary columns, and cleaning the cells that contain alphanumeric values that are irrelevant.

#### London data

In [3]:
# extract tables from wikipedia
url = 'https://en.wikipedia.org/wiki/List_of_London_boroughs'
wikitables_london = pd.read_html(url,  attrs={"class":"wikitable"})
print ("Extracted {num} wikitables".format(num=len(wikitables_london)))

Extracted 2 wikitables


In [4]:
# Renaming the headers
london = wikitables_london[0]
new_header = london.iloc[0]
london = london[1:]
london.columns = new_header
london.reset_index(inplace=True)

# Dropping unnecessary columns
london.drop(['index', 'Inner', 'Status', 'Co-ordinates', 'Local authority', 'Political control', 'Headquarters', 'Area (sq mi)',
         'Population (2013 est)[1]', 'Nr. in map'], axis=1, inplace=True)

london.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Unnamed: 0,Borough
0,Barking and Dagenham [note 1]
1,Barnet
2,Bexley
3,Brent
4,Bromley


In [5]:
# Cleaning cells with alphanumeric values, and preserving the boroughs names only
london['Borough'] = london['Borough'].apply(lambda x: re.sub(r'\W\w\w\w\w \d\W$', '', x))
london.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Borough
0,Barking and Dagenham
1,Barnet
2,Bexley
3,Brent
4,Bromley


In [6]:
# Splitting the cells that contain more than one borough into several cells
result = []

for k, item in enumerate(london['Borough']):
    if len(item.split(' and ')) == 2:
        for i in range(2):
            result.append(item.split(' and ')[i])
        london.drop(k, axis=0, inplace=True)
        
            
for i, _ in enumerate(result):
    london.loc[1] = [result[i]]
    london.index = london.index + 1

# Sorting the dataframe alphabetically by the boroughs names
london = london.sort_index()
london.sort_values(by='Borough', axis=0, ascending=True, inplace=True)
london.reset_index(inplace=True)
london.drop('index', axis=1, inplace=True)
london.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


Unnamed: 0,Borough
0,Barking
1,Bexley
2,Brent
3,Bromley
4,Camden


#### Paris data

In [7]:
# extract tables from wikipedia
url = 'https://en.wikipedia.org/wiki/Arrondissements_of_Paris'
wikitables_paris = pd.read_html(url,  attrs={"class":"wikitable sortable"}) 
print ("Extracted {num} wikitables".format(num=len(wikitables_paris)))

Extracted 1 wikitables


In [8]:
# Renaming the headers
paris = wikitables_paris[0]
new_header = paris.iloc[0]
paris = paris[1:]

# Dropping unnecessary columns
paris.columns = new_header
paris.drop(['Arrondissement (R for Right Bank, L for Left Bank)', 'Area (km2)', 'Population(March 1999 census)', 
        'Population(July 2005 estimate)', 'Density (2005)(inhabitants per km2)', 'Peak of population', 'Mayor'], axis=1, inplace=True)
paris.columns = ['Borough']
paris.head()

Unnamed: 0,Borough
1,Louvre
2,Bourse
3,Temple
4,Hôtel-de-Ville
5,Panthéon


In [9]:
# Sorting the dataframe alphabetically by the boroughs names
paris.sort_values(by='Borough', axis=0, ascending=True, inplace=True)
paris.reset_index(inplace=True)
paris.drop('index', axis=1, inplace=True)
paris.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Borough
0,Batignolles-Monceau
1,Bourse
2,Butte-Montmartre
3,Buttes-Chaumont
4,Entrepôt


### Loading venues data from Foursquare API

To load the venues data of each borough, the geographical coordinates of the borough are necessary. For that purpose, the geopy library which could obtain the coordinates data based on the borough name is used. A function is defined (coordinates_finder) which takes the address as an input and spits out the latitude and the longitude of the address. This function is then used to find the coordinates of each borough, and the results are stored in a list, which is then joined with the dataframe that contains the boroughs names.

A second function is defined (getNearbyVenues) to extract a specific number of venues that are closest to the specified coordinates, as the retrieval of all the venues is not very practical. This function's parameters are the boroughs name, its latitude, longitude, and the radius which the extracted venues should be in. The results are then stored in the same dataframe that contains the boroughs data.

In [10]:
# Defining a function to compute the coordinates of a location using the geopy workframe
def coordinates_finder(address):
    geolocator = Nominatim(user_agent="my-application")
    location = geolocator.geocode(address, timeout=30)
    latitude = location.latitude
    longitude = location.longitude
    return [latitude, longitude]

In [11]:
# Extracting nearby venues data for each borough
def getNearbyVenues(names, latitudes, longitudes, radius=5000, LIMIT=100):
    
    i = 0
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        
        i += 1

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                  'Borough Latitude', 
                  'Borough Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### London boroughs venues data

In [12]:
# Obtain the coordinates of each borough
london_lat_list = []
london_long_list = []

for i in london['Borough'][:]:
    result = coordinates_finder(i)
    london_lat_list.append(result[0])
    london_long_list.append(result[1])

# Add the Latitude and Longitude columns to the boroughs dataframe
london['Latitude'] = london_lat_list
london['Longitude'] = london_long_list
london.head()

Unnamed: 0,Borough,Latitude,Longitude
0,Barking,51.538992,0.080424
1,Bexley,39.969238,-82.936864
2,Brent,30.471943,-87.246916
3,Bromley,51.402805,0.014814
4,Camden,39.94484,-75.119891


In [13]:
# Get the venues names, categories, and coordinates
london_venues = getNearbyVenues(names=london['Borough'],
                                   latitudes=london['Latitude'],
                                   longitudes=london['Longitude']
                                  )

london_venues.head()

Unnamed: 0,Borough,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Barking,51.538992,0.080424,Barking Park,51.545217,0.086134,Park
1,Barking,51.538992,0.080424,McDonald's,51.534031,0.053797,Fast Food Restaurant
2,Barking,51.538992,0.080424,Cristina's,51.536523,0.076672,Steakhouse
3,Barking,51.538992,0.080424,Eastbury Manor House,51.532973,0.099741,History Museum
4,Barking,51.538992,0.080424,Capital Karts,51.531792,0.118739,Go Kart Track


In [14]:
print('There are {} uniques categories.'.format(len(london_venues['Venue Category'].unique())))

There are 282 uniques categories.


#### Paris boroughs venues data

In [15]:
# Obtain the coordinates of each borough
paris_lat_list = []
paris_long_list = []

for i in paris['Borough'][:]:
    result = coordinates_finder(i)
    paris_lat_list.append(result[0])
    paris_long_list.append(result[1])

# Add the Latitude and Longitude columns to the boroughs dataframe
paris['Latitude'] = paris_lat_list
paris['Longitude'] = paris_long_list
paris.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


Unnamed: 0,Borough,Latitude,Longitude
0,Batignolles-Monceau,48.880292,2.308593
1,Bourse,48.86863,2.341474
2,Butte-Montmartre,48.892126,2.348178
3,Buttes-Chaumont,48.878396,2.381201
4,Entrepôt,48.876008,2.360445


In [16]:
# Get the venues names, categories, and coordinates
paris_venues = getNearbyVenues(names=paris['Borough'],
                                   latitudes=paris['Latitude'],
                                   longitudes=paris['Longitude']
                                  )

paris_venues.head()

Unnamed: 0,Borough,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Batignolles-Monceau,48.880292,2.308593,Parc Monceau,48.87958,2.30918,Park
1,Batignolles-Monceau,48.880292,2.308593,Musée Jacquemart-André,48.875434,2.310541,Art Museum
2,Batignolles-Monceau,48.880292,2.308593,La Marée,48.877336,2.300488,Seafood Restaurant
3,Batignolles-Monceau,48.880292,2.308593,Hôtel Le Royal Monceau Raffles,48.875931,2.30027,Hotel
4,Batignolles-Monceau,48.880292,2.308593,Boulangerie-Pâtisserie Lohezic,48.883329,2.298937,Bakery


In [17]:
print('There are {} uniques categories.'.format(len(paris_venues['Venue Category'].unique())))

There are 165 uniques categories.


### Visualizing data

To visualize the boroughs locations on the map and how disperse they are, folium library of python is used, which enables drawing on maps.

In [18]:
# create map of New York using latitude and longitude values
london_latitude, london_longitude = coordinates_finder('London')
map_london = folium.Map(location=[london_latitude, london_longitude], zoom_start=11)

# add markers to map
for lat, lng, borough in zip(london['Latitude'], london['Longitude'], london['Borough']):
    label = '{}'.format(borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  

map_london

In [19]:
# create map of New York using latitude and longitude values
paris_latitude, paris_longitude = coordinates_finder('Paris')
map_paris = folium.Map(location=[paris_latitude, paris_longitude], zoom_start=13)

# add markers to map
for lat, lng, borough in zip(paris['Latitude'], paris['Longitude'], paris['Borough']):
    label = '{}'.format(borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_paris)  

map_paris

# Methodology

One-hot encoding method is used to convert the categorical values of the venues column in the dataframe to a numerical value. A column is created for each venue, if such a venue lies in the borough, a value of 1 is awarded, otherwise a value of 0 is given. Later on, the venues that are missing from London borough and exist in Paris boroughs are added to the dataframe and parsed with zeros, and the same is done for Paris venues. Finally, a similarity measuring algorithms, that employs the Euclidean distance and Cosine similarity, are used to compute the similarity between a given borough of London and all the boroughs of Paris, the results sorted by the similarity index, and the top five are selected and displayed to the user. The results of the two algorithms are discussed in the next sections.

## One-hot encoding

### London

In [20]:
# one hot encoding
london_onehot = pd.get_dummies(london_venues[['Venue Category']], prefix="", prefix_sep="")

# add borough column back to dataframe
london_onehot['Borough'] = london_venues['Borough'] 

# move borough column to the first column
fixed_columns = [london_onehot.columns[-1]] + list(london_onehot.columns[:-1])
london_onehot = london_onehot[fixed_columns]

london_onehot.head()

Unnamed: 0,Borough,ATM,African Restaurant,Airport,Airport Service,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Warehouse Store,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,Barking,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Barking,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Barking,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Barking,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Barking,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# Grouping venues by borough
london_grouped = london_onehot.groupby('Borough').max().reset_index()
london_grouped.head()

Unnamed: 0,Borough,ATM,African Restaurant,Airport,Airport Service,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Warehouse Store,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,Barking,0,0,1,1,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,Bexley,1,0,0,0,1,0,0,1,0,...,0,0,0,0,0,1,1,1,0,0
2,Brent,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,Bromley,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Camden,0,0,0,0,1,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0


### Paris

In [22]:
# one hot encoding
paris_onehot = pd.get_dummies(paris_venues[['Venue Category']], prefix="", prefix_sep="")

# add borough column back to dataframe
paris_onehot['Borough'] = paris_venues['Borough']

# move borough column to the first column
fixed_columns = [paris_onehot.columns[-1]] + list(paris_onehot.columns[:-1])
paris_onehot = paris_onehot[fixed_columns]

paris_onehot.head()

Unnamed: 0,Borough,Accessories Store,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Asian Restaurant,Automotive Shop,BBQ Joint,Bakery,...,Trattoria/Osteria,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Batignolles-Monceau,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Batignolles-Monceau,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Batignolles-Monceau,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Batignolles-Monceau,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Batignolles-Monceau,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [23]:
# Grouping venues by borough
paris_grouped = paris_onehot.groupby('Borough').max().reset_index()
paris_grouped.head()

Unnamed: 0,Borough,Accessories Store,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Asian Restaurant,Automotive Shop,BBQ Joint,Bakery,...,Trattoria/Osteria,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Batignolles-Monceau,0,0,0,0,1,0,0,0,1,...,0,1,0,0,0,0,1,0,0,1
1,Bourse,0,0,1,0,1,0,0,0,1,...,0,1,1,0,0,0,1,1,0,1
2,Butte-Montmartre,0,0,1,0,1,1,0,0,1,...,0,1,1,0,1,0,1,1,0,1
3,Buttes-Chaumont,0,0,1,1,1,1,0,0,1,...,1,0,1,0,1,0,1,0,0,0
4,Entrepôt,0,0,1,1,1,1,0,0,1,...,1,0,1,0,1,0,1,1,0,0


### Deploying the algorithm

In [24]:
# Parsing the missing venues in London boroughs with zeros
london_titles = []
titles = list(london_grouped.columns)

for i in paris_grouped.columns:
    if i not in london_grouped.columns:
        london_titles.append(i)

for i in london_titles:
    titles.append(i)

london_missing_venues = {x: [0.0 for _ in range(london_grouped.shape[0])] for x in london_titles}
london_missing_venues = pd.DataFrame(london_missing_venues, index=london_grouped.index)
london_grouped = pd.concat([london_grouped, london_missing_venues], axis=1)
london_grouped.set_index(['Borough'], inplace=True)
london_grouped.head()

Unnamed: 0_level_0,ATM,African Restaurant,Airport,Airport Service,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Souvenir Shop,Speakeasy,Swiss Restaurant,Szechuan Restaurant,Tailor Shop,Tech Startup,Temple,Tourist Information Center,Trattoria/Osteria,Udon Restaurant
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Barking,0,0,1,1,1,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bexley,1,0,0,0,1,0,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Brent,0,0,0,0,1,0,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bromley,0,0,0,0,1,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Camden,0,0,0,0,1,0,1,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
# Parsing the missing venues in Paris boroughs with zeros
paris_titles = []
titles = list(paris_grouped.columns)

for i in london_grouped.columns:
    if i not in paris_grouped.columns:
        paris_titles.append(i)

for i in paris_titles:
    titles.append(i)

paris_missing_venues = {x: [0.0 for _ in range(paris_grouped.shape[0])] for x in paris_titles}
paris_missing_venues = pd.DataFrame(paris_missing_venues, index=paris_grouped.index)
paris_grouped = pd.concat([paris_grouped, paris_missing_venues], axis=1)
paris_grouped.set_index(['Borough'], inplace=True)
paris_grouped.head()

Unnamed: 0_level_0,Accessories Store,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Asian Restaurant,Automotive Shop,BBQ Joint,Bakery,Bar,...,Tennis Stadium,Thrift / Vintage Store,Town Hall,Turkish Restaurant,Video Game Store,Water Park,Waterfront,Whisky Bar,Yoga Studio,Zoo Exhibit
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Batignolles-Monceau,0,0,0,0,1,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bourse,0,0,1,0,1,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Butte-Montmartre,0,0,1,0,1,1,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Buttes-Chaumont,0,0,1,1,1,1,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Entrepôt,0,0,1,1,1,1,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
# Computing the similarity index using the euclidean algorithm
# NB: This cell's output will only appear when jupyter notebook is used 
london_boroughs_list = []
similarity_index_list = []

# Fetching london boroughs one by one
for i in london_grouped.index:
    london_boroughs_list.append(i)
    
@interact
def show_top_five_similar_boroughs(London=london_boroughs_list):
    if similarity_index_list != []:
        del similarity_index_list[:]
        
    for i in paris_grouped.index:
        similarity_index_list.append(euclidean_distances(london_grouped[ london_grouped.index == London], paris_grouped[ paris_grouped.index == i])[0][0])
        
    similarity_index = pd.DataFrame({'Euclidean similarity': similarity_index_list}, index=paris_grouped.index)
    similarity_index.sort_values(by='Euclidean similarity', axis=0, ascending=True, inplace=True)
    
    return similarity_index.head()

interactive(children=(Dropdown(description='London', options=('Barking', 'Bexley', 'Brent', 'Bromley', 'Camden…

In [27]:
# Computing the similarity index using the cosine similarity algorithm
# NB: This cell's output will only appear when jupyter notebook is used 
@interact
def show_top_five_similar_boroughs(London=london_boroughs_list):
    if similarity_index_list != []:
        del similarity_index_list[:]
        
    for i in paris_grouped.index:
        similarity_index_list.append(cosine_similarity(london_grouped[ london_grouped.index == London],
                                                       paris_grouped[ paris_grouped.index == i])[0][0])
        
    similarity_index = pd.DataFrame({'Cosine similarity': similarity_index_list}, index=paris_grouped.index)
    similarity_index.sort_values(by='Cosine similarity', axis=0, ascending=False, inplace=True)
    
    return similarity_index.head()

interactive(children=(Dropdown(description='London', options=('Barking', 'Bexley', 'Brent', 'Bromley', 'Camden…

The results of the first algorithm are sorted ascendingly because the Euclidean similarity algorithm measures, in essence, dissimilarity, and not similarity, while the results of the Cosine similarity algorithm are sorted in a descending manner as the algorithm measures similarity, and hence, the larger the output the more similar the result is to the input.

# Discussion

It’s noticeable that the results of the two algorithms are somewhat significantly different, as only two of the boroughs suggested are common between the two algorithms’ results. In most cases less than three boroughs are common between the suggestions of the two algorithms, and many times, there will be only one borough or perhaps none they both recommend. This is due to the fact that Euclidean algorithm measures the distance between the vectors that represent the boroughs to be compared, while the Cosine algorithm measures the angle between the two vectors.
Therefore, if the two vectors were to be in the same direction, but with different magnitudes the first algorithm would still capture the difference between the two, while the latter would consider them completely similar. In other words, the first algorithm takes magnitude of the vector into account while latter doesn’t, and since the magnitude of the vectors vary from one to another, the similarity index should account for it. This leads to prioritizing the results yielded by the Euclidean distance, as it is more suitable to the type of data we are dealing with here.


# Conclusion

In conclusion, the approach described here might useful in quantifying the similarity between cities or boroughs, however, it should be noted that only the venues in a specific borough are used to characterize the similarity, which might not be ideal as there are other parameters that should be considered such as the weather, how polluted it is, the language spoken and culture practiced by the locals, and others, and that could be the scope of the future works that would like to contribute to this problem.