# Clustering Capitals From Europe
In the first part of this Notebook we will try to cluster the capitals from Europe based in the frequency of its Venues categories. For that, we will use the K-Means algorithm from sklearn.
In the second part, we will choose a city in the world, and we will try answer wich capitals from Europe are more similars to this city based in its more frequent venues categories.


# Part 1: Clustering Capitals From Europe

### 1.1 Importing libraries

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim
import folium
import json
import matplotlib.pyplot as plt

### 1.2 Reading our Data

In [2]:
europe_venues= pd.read_csv('europe_venues_complete.csv')
europe_venues = europe_venues.drop('Unnamed: 0',axis=1)
europe_venues.head()

Unnamed: 0,Country,Capital,Capital Latitude,Capital Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Albania,Tirana,41.327946,19.818532,Artigiano at Vila,41.322315,19.822859,Italian Restaurant
1,Albania,Tirana,41.327946,19.818532,à la Santé,41.32076,19.814221,French Restaurant
2,Albania,Tirana,41.327946,19.818532,Artigiano,41.319398,19.818842,Italian Restaurant
3,Albania,Tirana,41.327946,19.818532,SALT,41.32194,19.817314,Bistro
4,Albania,Tirana,41.327946,19.818532,Era Restaurant & Pizzeria,41.320253,19.814534,Pizza Place


### 1.3 Preparing our Data
First let's join some categories that mean almost the same thing. 

In [3]:
europe_venues = europe_venues.replace(['Coffee Shop','Café'],'Coffee')
europe_venues = europe_venues.replace(['Gym / Fitness Center'],'Gym')

The code below makes one column per Venue Category

In [4]:
europe_onehot = pd.get_dummies(europe_venues['Venue Category'],prefix="",prefix_sep="")
europe_onehot['Capital'] = europe_venues['Capital']
europe_onehot = europe_onehot[[europe_onehot.columns[-1]]+list(europe_onehot.columns[0:-1])]
europe_onehot.head()

Unnamed: 0,Capital,ATM,Accessories Store,Adult Boutique,African Restaurant,Agriturismo,American Restaurant,Amphitheater,Antique Shop,Apres Ski Bar,...,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Tirana,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Tirana,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Tirana,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Tirana,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Tirana,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The code below groups rows by Capital, taking the mean of the frequency of each venue category

#### Removing some categories from our analysis

In [5]:
europe_onehot = europe_onehot.drop('Coffee',axis=1)
europe_onehot = europe_onehot.drop('Supermarket',axis=1)
europe_onehot = europe_onehot.drop('Park',axis=1)
europe_onehot = europe_onehot.drop('Gym',axis=1)
europe_onehot.shape

(25322, 474)

#### Renaming some columns

In [7]:
europe_grouped = europe_onehot.groupby('Capital').mean().reset_index()
europe_grouped.head()

Unnamed: 0,Capital,ATM,Accessories Store,Adult Boutique,African Restaurant,Agriturismo,American Restaurant,Amphitheater,Antique Shop,Apres Ski Bar,...,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Amsterdam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.001742,0.008711,0.0,0.0,0.0,0.001742,0.020906,0.001742,0.010453
1,Andorra la Vella,0.0,0.002545,0.0,0.0,0.0,0.002545,0.0,0.0,0.007634,...,0.0,0.0,0.002545,0.0,0.0,0.0,0.002545,0.0,0.0,0.0
2,Ankara,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.001675,0.001675,0.0,0.0,0.00335,0.00335,0.0,0.0
3,Astana,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.005764,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Athens,0.0,0.0,0.0,0.0,0.0,0.003311,0.0,0.0,0.0,...,0.0,0.006623,0.013245,0.0,0.0,0.0,0.001656,0.003311,0.0,0.0


The function below sort row based in the frequency of the venue's category and return the 'n' top venues.

In [8]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

The code below creates a new dataframe with the 'n' most popular venues of each Capital.

In [15]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Capital']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))


capitals_venues_sorted = pd.DataFrame(columns=columns)
capitals_venues_sorted['Capital'] = europe_grouped['Capital']

for ind in np.arange(europe_grouped.shape[0]):
    capitals_venues_sorted.iloc[ind, 1:] = return_most_common_venues(europe_grouped.iloc[ind, :], num_top_venues)

capitals_venues_sorted.head(10)

Unnamed: 0,Capital,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amsterdam,Bar,Marijuana Dispensary,Restaurant,Museum,Yoga Studio,Art Museum,Clothing Store,Plaza,Music Venue,Sandwich Place
1,Andorra la Vella,Restaurant,Ski Area,Plaza,Ski Chairlift,Bar,Pub,Clothing Store,Perfume Shop,Sporting Goods Shop,Ski Trail
2,Ankara,Pub,Bar,Theater,Dance Studio,Shopping Mall,Music Venue,Grocery Store,History Museum,Art Gallery,Restaurant
3,Astana,Shopping Mall,Restaurant,Electronics Store,Italian Restaurant,Asian Restaurant,Grocery Store,Hotel Bar,Plaza,Eastern European Restaurant,Bar
4,Athens,Bar,Theater,Cocktail Bar,Clothing Store,Plaza,Movie Theater,Greek Restaurant,Electronics Store,Historic Site,History Museum
5,Baku,Lounge,Restaurant,Pub,Movie Theater,Tea Room,Grocery Store,Shopping Mall,Concert Hall,Italian Restaurant,Big Box Store
6,Belgrade,Bar,Restaurant,Art Gallery,Theater,Museum,Cosmetics Shop,Jazz Club,Pizza Place,Plaza,Farmers Market
7,Berlin,Bar,Cocktail Bar,Wine Bar,Plaza,Clothing Store,Indie Movie Theater,History Museum,Organic Grocery,Art Museum,Art Gallery
8,Bern,Plaza,Bar,Swiss Restaurant,Grocery Store,Restaurant,Italian Restaurant,Shopping Mall,Movie Theater,Bakery,Scenic Lookout
9,Bratislava,Art Gallery,Pub,Theater,Wine Bar,Bar,Plaza,Clothing Store,Vegetarian / Vegan Restaurant,Brewery,Italian Restaurant


In [10]:
%matplotlib
def plot_category_frequency(data,t_title):
    count ={}
    names = []
    columns = len(data.columns)
    for index,row in data.iterrows():
        for num in range(1,columns):
            if(row[num] not in count):
                count[row[num]] = 1
                names.append(row[num])
            else:
                count[row[num]] += 1
    frequency = pd.DataFrame(count,index=['count']).sort_values('count',ascending=False,axis=1).T
    frequency = frequency.iloc[:5,:]
    frequency.plot.bar(rot=0, fontsize= 7, title= t_title )
    plt.xlabel('Categories')
    plt.ylabel('Num of Countries')


Using matplotlib backend: Qt5Agg


In [11]:
plot_category_frequency(capitals_venues_sorted,'Top 5 most Frequent Venues Categories')

## 1.4. Clustering with Kmeans

In [16]:
from sklearn.cluster import KMeans
k = 10
europe_clusters = europe_grouped.drop('Capital',axis=1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(europe_clusters)
kmeans.labels_

array([9, 1, 2, 7, 8, 2, 9, 0, 5, 2, 9, 2, 0, 2, 5, 2, 5, 0, 0, 9, 0, 4,
       0, 2, 6, 0, 9, 5, 0, 1, 0, 5, 9, 3, 3, 7, 9, 9, 5, 2, 0, 7, 1, 0,
       0, 2, 0, 2, 9], dtype=int32)

In [17]:
capitals_venues_sorted.insert(0,'Cluster Labels',kmeans.labels_)

europe_merged = europe_venues.drop_duplicates('Country').iloc[:,0:4]

europe_merged = europe_merged.join(capitals_venues_sorted.set_index('Capital'), on = 'Capital')

europe_merged

Unnamed: 0,Country,Capital,Capital Latitude,Capital Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Albania,Tirana,41.327946,19.818532,7,Bar,Shopping Mall,Italian Restaurant,Cocktail Bar,Plaza,Restaurant,Eastern European Restaurant,Nightclub,Lounge,Pizza Place
235,Andorra,Andorra la Vella,42.506939,1.521247,1,Restaurant,Ski Area,Plaza,Ski Chairlift,Bar,Pub,Clothing Store,Perfume Shop,Sporting Goods Shop,Ski Trail
628,Armenia,Yerevan,40.177612,44.512585,2,Restaurant,Pub,Plaza,Grocery Store,Shopping Mall,Clothing Store,Bakery,Theater,Museum,Italian Restaurant
1033,Austria,Vienna,48.208354,16.372504,0,Plaza,Austrian Restaurant,Restaurant,Cocktail Bar,Bar,Theater,History Museum,Concert Hall,Zoo Exhibit,Italian Restaurant
1602,Azerbaijan,Baku,40.375443,49.832675,2,Lounge,Restaurant,Pub,Movie Theater,Tea Room,Grocery Store,Shopping Mall,Concert Hall,Italian Restaurant,Big Box Store
2155,Belarus,Minsk,53.902334,27.561879,2,Shopping Mall,Big Box Store,Movie Theater,Theater,Grocery Store,Restaurant,Hookah Bar,Bookstore,Concert Hall,Bar
2716,Belgium,Brussels,50.846557,4.351697,9,Bar,Plaza,Sandwich Place,Italian Restaurant,Tea Room,Concert Hall,Wine Bar,Art Museum,Cocktail Bar,History Museum
3330,Bosnia and Herzegovina,Sarajevo,43.851977,18.386687,7,Shopping Mall,Restaurant,Eastern European Restaurant,Bar,History Museum,Pub,Grocery Store,Plaza,Italian Restaurant,Hookah Bar
3685,Bulgaria,Sofia,42.697863,23.322179,9,Bar,Bakery,Restaurant,Theater,Italian Restaurant,Cocktail Bar,Dance Studio,Dessert Shop,Art Gallery,Plaza
4253,Croatia,Zagreb,45.813177,15.977048,9,Bar,Plaza,Restaurant,Theater,Grocery Store,Dessert Shop,Museum,Trail,BBQ Joint,Pub


## 1.5 Visualizing the Results on a Map

In [18]:
import matplotlib.cm as cm
import matplotlib.colors as colors

adress = "Europe"
geolocator = Nominatim(user_agent = "europe_explorer",timeout=3)
local = geolocator.geocode(adress)
latitude = local.latitude
longitude = local.longitude

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=3)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, capital, cluster in zip(europe_merged['Capital Latitude'], europe_merged['Capital Longitude'], europe_merged['Capital'], europe_merged['Cluster Labels']):
    if( np.isnan(cluster)): cluster = -1
    label = folium.Popup(str(capital) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 1.5 Seeing the cluster for each label

### Cluster 0

In [120]:
pd.set_option('display.max_columns', 30)
europe_merged.loc[europe_merged['Cluster Labels']== 0,:].reset_index().drop(['index','Cluster Labels','Capital Latitude','Capital Longitude'],axis=1)
plot_category_frequency(europe_merged.loc[europe_merged['Cluster Labels']== 0,'Cluster Labels':'5th Most Common Venue'],'Cluster 1 Top 5 Venues')

### Cluster 1

In [121]:
europe_merged.loc[europe_merged['Cluster Labels']== 1,:].reset_index().drop(['index','Cluster Labels','Capital Latitude','Capital Longitude'],axis=1)
plot_category_frequency(europe_merged.loc[europe_merged['Cluster Labels']== 1,'Cluster Labels':],'Cluster 2 Top Venues')

### Cluster 2

In [122]:
europe_merged.loc[europe_merged['Cluster Labels']== 2,:].reset_index().drop(['index','Cluster Labels','Capital Latitude','Capital Longitude'],axis=1)
plot_category_frequency(europe_merged.loc[europe_merged['Cluster Labels']== 2,'Cluster Labels':],'Cluster 3 Top Venues')

### Cluster 3

In [34]:
europe_merged.loc[europe_merged['Cluster Labels']== 4,:].reset_index().drop(['index','Cluster Labels','Capital Latitude','Capital Longitude'],axis=1)

#plot_category_frequency(europe_merged.loc[europe_merged['Cluster Labels']== 3,'Cluster Labels':],'Cluster 4 Top Venues')

Unnamed: 0,Country,Capital,Capital Latitude,Capital Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
11597,Italy,Rome,41.894802,12.485338,3,Plaza,Italian Restaurant,Art Museum,Cocktail Bar,Wine Bar
19744,San Marino,San Marino,43.945862,12.458306,3,Italian Restaurant,Plaza,Beach,Bar,Nightclub


### Cluster 4

In [53]:
europe_merged.loc[europe_merged['Cluster Labels']== 4,:].reset_index().drop(['index','Cluster Labels','Capital Latitude','Capital Longitude'],axis=1)
#plot_category_frequency(europe_merged.loc[europe_merged['Cluster Labels']== 4,'Cluster Labels':],'Cluster 4 Top Venues')

Unnamed: 0,Country,Capital,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Luxembourg,Luxembourg,Italian Restaurant,Campground,French Restaurant,Bar,Shopping Mall,Restaurant,Bakery,Castle,Pizza Place,Scenic Lookout


### Cluster 5

In [54]:
europe_merged.loc[europe_merged['Cluster Labels']== 5,:].reset_index().drop(['index','Cluster Labels','Capital Latitude','Capital Longitude'],axis=1)
#plot_category_frequency(europe_merged.loc[europe_merged['Cluster Labels']== 5,'Cluster Labels':],'Cluster 6 Top Venues')

Unnamed: 0,Country,Capital,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Denmark,Copenhagen,Grocery Store,Cocktail Bar,Theater,Music Venue,Scandinavian Restaurant,Plaza,Wine Bar,Beer Bar,Bar,Bakery
1,Finland,Helsinki,Grocery Store,Scandinavian Restaurant,Bar,Theater,Art Museum,Bakery,History Museum,Dance Studio,Wine Bar,Pizza Place
2,Iceland,Reykjavík,Bar,Grocery Store,Seafood Restaurant,Pool,Restaurant,Theater,Pizza Place,Art Museum,Scandinavian Restaurant,Burger Joint
3,Norway,Oslo,Grocery Store,Bar,History Museum,Theater,Bakery,Cocktail Bar,Ski Lodge,Music Venue,Movie Theater,Burger Joint
4,Sweden,Stockholm,Scandinavian Restaurant,Grocery Store,Plaza,Theater,Museum,Movie Theater,Art Gallery,History Museum,Bakery,Cocktail Bar
5,Switzerland,Bern,Plaza,Bar,Swiss Restaurant,Grocery Store,Restaurant,Italian Restaurant,Shopping Mall,Movie Theater,Bakery,Scenic Lookout


### Cluster 6

In [55]:

europe_merged.loc[europe_merged['Cluster Labels']== 6,:].reset_index().drop(['index','Cluster Labels','Capital Latitude','Capital Longitude'],axis=1)
#plot_category_frequency(europe_merged.loc[europe_merged['Cluster Labels']== 6,'Cluster Labels':],'Cluster 6 Top Venues')

Unnamed: 0,Country,Capital,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Monaco,Monaco,Beach,French Restaurant,Italian Restaurant,Bar,Restaurant,Museum,Art Museum,Clothing Store,Garden,Mediterranean Restaurant


### Cluster 7

In [56]:
europe_merged.loc[europe_merged['Cluster Labels']== 7,:].reset_index().drop(['index','Cluster Labels','Capital Latitude','Capital Longitude'],axis=1)
#plot_category_frequency(europe_merged.loc[europe_merged['Cluster Labels']== 7,'Cluster Labels':],'Cluster 8 Top Venues')

Unnamed: 0,Country,Capital,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Albania,Tirana,Bar,Shopping Mall,Italian Restaurant,Cocktail Bar,Plaza,Restaurant,Eastern European Restaurant,Nightclub,Lounge,Pizza Place
1,Bosnia and Herzegovina,Sarajevo,Shopping Mall,Restaurant,Eastern European Restaurant,Bar,History Museum,Pub,Grocery Store,Plaza,Italian Restaurant,Hookah Bar
2,Kazakhstan,Astana,Shopping Mall,Restaurant,Electronics Store,Italian Restaurant,Asian Restaurant,Grocery Store,Hotel Bar,Plaza,Eastern European Restaurant,Bar


### Cluster 8

In [57]:
europe_merged.loc[europe_merged['Cluster Labels']== 8,:].reset_index().drop(['index','Cluster Labels','Capital Latitude','Capital Longitude'],axis=1)

Unnamed: 0,Country,Capital,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Greece,Athens,Bar,Theater,Cocktail Bar,Clothing Store,Plaza,Movie Theater,Greek Restaurant,Electronics Store,Historic Site,History Museum


### Cluster 9

In [59]:
europe_merged.loc[europe_merged['Cluster Labels']== 9,:].reset_index().drop(['index','Cluster Labels','Capital Latitude','Capital Longitude'],axis=1)
plot_category_frequency(europe_merged.loc[europe_merged['Cluster Labels']== 9,'Cluster Labels':],'Cluster 10 Top Venues')

# Part 2: Comparing a City with the European Capitals

### Let's compare the Capitals from Europe with Toronto, Canada.

## 2.1 Getting the Latitude and Longitude of Toronto

In [161]:
World = ['Canada']
City= ['Toronto']
Lat = []
Lon = []
for w, c in zip(World,City):
    adress = "{}, {}".format(c,w)
    geolocator = Nominatim(user_agent = "europe_explorer",timeout=3)
    local = geolocator.geocode(adress)
    latitude = local.latitude
    longitude = local.longitude
    print(latitude,longitude)
    Lat.append(latitude)
    Lon.append(longitude)

43.653963 -79.387207


In [162]:
df_cities = pd.DataFrame({'Country':World,'City':City,'Latitude':Lat,'Longitude':Lon})
df_cities.head()

Unnamed: 0,Country,City,Latitude,Longitude
0,Canada,Toronto,43.653963,-79.387207


## 2.2 getting the Venues with Foursquare API

In [163]:
with open('Foursquare_Developer.json') as fs:
    credentials = json.load(fs)
CLIENT_ID = credentials["Client ID"] 
CLIENT_SECRET = credentials["Client SECRET"] 
VERSION = '20180605'
RADIUS = 20000
LIMIT = 200

In [164]:
def getNearbyVenues(countries, cities, latitudes, longitudes, radius):
    
    venues_list=[]
    section = ['food','drinks','coffee','shops','arts','outdoors','sights'] 
    for country, city, lat, lng in zip(countries, cities, latitudes, longitudes):
        #print(country)
        for s in section:
            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&section={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                radius, 
                LIMIT,
                s)
            
            # make the GET request
            results = requests.get(url).json()["response"]['groups'][0]['items']
        
            # return only relevant information for each nearby venue
            venues_list.append([(
                country,
                city, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Country',
                             'Capital', 
                  'Capital Latitude', 
                  'Capital Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Getting the venues

In [165]:
world_venues = getNearbyVenues(countries =df_cities['Country'],
                                cities = df_cities['City'], 
                               latitudes= df_cities['Latitude'],
                               longitudes=df_cities['Longitude'], radius = RADIUS)
world_venues.shape

(700, 8)

Removing possible duplicates

In [197]:
world_venues = world_venues.drop_duplicates()
world_venues.shape

(576, 8)

## 2.3 Preparing our Data

The sames steps from the Part 1 were made to prepare the data

In [198]:
world_venues = world_venues.replace(['Coffee Shop','Café'],'Coffee')
world_venues = world_venues.replace(['Gym / Fitness Center'],'Gym')

In [199]:
world_onehot = pd.get_dummies(world_venues['Venue Category'],prefix="",prefix_sep="")
world_onehot['Capital'] = world_venues['Capital']
world_onehot = world_onehot[[world_onehot.columns[-1]]+list(world_onehot.columns[0:-1])]
world_onehot = world_onehot.drop('Coffee',axis=1)
world_onehot = world_onehot.drop('Supermarket',axis=1)
world_onehot = world_onehot.drop('Park',axis=1)
world_onehot = world_onehot.drop('Gym',axis=1)



In [200]:
world_grouped = world_onehot.groupby('Capital').mean().reset_index()
world_grouped.head()

Unnamed: 0,Capital,American Restaurant,Aquarium,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,...,Street Art,Sushi Restaurant,Taco Place,Tapas Restaurant,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Yoga Studio
0,Toronto,0.001736,0.001736,0.001736,0.020833,0.003472,0.005208,0.003472,0.003472,0.003472,0.010417,0.036458,0.001736,0.003472,0.001736,...,0.001736,0.003472,0.003472,0.003472,0.006944,0.006944,0.029514,0.001736,0.001736,0.006944,0.001736,0.010417,0.001736,0.006944,0.005208


The code below is necessary to standard the data in a way that our model created in the Part 1 can classify it.

In [201]:
test= pd.DataFrame(columns = europe_grouped.columns)
for c in europe_grouped.columns:
    if c in world_grouped.columns:
        test[c] = world_grouped[c]
    else:
        test[c] = 0.0
test.head()

Unnamed: 0,Capital,ATM,Accessories Store,Adult Boutique,African Restaurant,Agriturismo,American Restaurant,Amphitheater,Antique Shop,Apres Ski Bar,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Vineyard,Volleyball Court,Warehouse Store,Water Park,Waterfall,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Toronto,0.0,0.0,0.0,0.0,0.0,0.001736,0.0,0.0,0.0,0.001736,0.0,0.0,0.020833,0.003472,...,0.0,0.0,0.010417,0.0,0.0,0.0,0.001736,0.006944,0.0,0.0,0.0,0.0,0.005208,0.0,0.0


In [202]:
world_grouped = test

In [203]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Capital']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))


world_venues_sorted = pd.DataFrame(columns=columns)
world_venues_sorted['Capital'] = world_grouped['Capital']

for ind in np.arange(world_grouped.shape[0]):
    world_venues_sorted.iloc[ind, 1:] = return_most_common_venues(world_grouped.iloc[ind, :], num_top_venues)

world_venues_sorted.head(10)

Unnamed: 0,Capital,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Toronto,Grocery Store,Bar,Theater,Movie Theater,Art Gallery,Music Venue,Clothing Store,Gastropub,Beer Bar,Italian Restaurant


## 2.4 Comparing the cities with the European Capitals 

In [204]:
world_clusters = world_grouped.drop('Capital',axis=1)

In [205]:
labels = kmeans.predict(world_clusters)
print(labels)
world_venues_sorted.insert(0,'Cluster Labels',labels)

[5]


In [209]:
world_venues_sorted = world_venues_sorted.set_index('Capital')
world_venues_sorted.head()

Unnamed: 0_level_0,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
Capital,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Toronto,5,Grocery Store,Bar,Theater,Movie Theater,Art Gallery,Music Venue,Clothing Store,Gastropub,Beer Bar,Italian Restaurant


## 2.5 Results

As we can see in the table below, all cities that we choose are in the same cluster.

### MAP

In [210]:
adress = "Europe"
geolocator = Nominatim(user_agent = "europe_explorer",timeout=3)
local = geolocator.geocode(adress)
latitude = local.latitude
longitude = local.longitude
print(latitude,longitude)
cluster = europe_merged.loc[europe_merged['Cluster Labels']== world_venues_sorted.loc['Toronto','Cluster Labels']]
europe = folium.Map(location=[latitude,longitude], zoom_start = 3)
for lat,long,country,city in zip(cluster['Capital Latitude'],cluster['Capital Longitude'],cluster['Country'],cluster['Capital']):
    label = "{}, {}".format(city,country)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,long],
        radius = 5,
        popup = label,
        color= 'red',
        fill = True,
        fill_opacity=0.7,
        parse_html= False).add_to(europe)
europe

51.0 10.0
