# Introduction
This notebook aims to compare major italian cities with regards to venue types. The analysis is performed in an analogous way to the previous analyses of this course, which were targeted at neighbourhoods in two different large cities.

In theory, the outcome of this analysis can be used to select one or more city which has similar characteristics to other known ones. For instance, it can be argued that if a particular business has succeeded in a given city, the potential for success in another similar city is higher than in different ones.

Unfortunately, this analysis suffers from an issue in the foursquare API, which only returns 100 results, therefore not guaranteeing a significant sample for comparison purposes. We will ignore this problem, because the assignment explicitly requires to use this API. Of course, because of this, the outcome won't be particularly significant.

## Business problem
In order to expand a chain of Mexican restaurants in Italy, we will identify similar cities among Italy's largest ones. We can then join the information about which existing branches succeeded and which not, and choose the next city based on its similarity to successful previous ones.

# Initial setup (libraries)

In [1]:
import pandas as pd
import numpy as np

In [2]:
import requests

In [3]:
import json
from pandas.io.json import json_normalize

## Foursquare API config

In [4]:
CLIENT_ID = 'O3R0SYR4X4XVKABB0Z5L2EDCNNX2XHCAA5BWAXK5S4QJSZ5P' # your Foursquare ID
CLIENT_SECRET = '3SYACT4SXOICXJNGT3G2OAF1G5LLK1X1BKO5VVZB2OI2BWX3' # your Foursquare Secret
VERSION = '20191201'

In [5]:
# Test the Foursquare query
query = "near=Ilanz"
lat,lng = 47.377846, 8.540076
response = requests.get("https://api.foursquare.com/v2/venues/?client_id=" + CLIENT_ID + \
                        "&client_secret=" + CLIENT_SECRET + \
                        "&v=" + VERSION + \
                        "&" + query)

In [6]:
lat,lng = 47.377846, 8.540076
radius = 10000
limit = 5000

In [7]:
url = "https://api.foursquare.com/v2/venues/explore?client_id=" + CLIENT_ID + \
                        "&client_secret=" + CLIENT_SECRET + \
                        "&v=" + VERSION + \
                        "&ll=" + f'{lat},{lng}' + \
                        "&radius=" + f'{radius}' + \
                        "&limit=" + f'{limit}'

response = requests.get(url)

In [146]:
url

'https://api.foursquare.com/v2/venues/explore?client_id=O3R0SYR4X4XVKABB0Z5L2EDCNNX2XHCAA5BWAXK5S4QJSZ5P&client_secret=3SYACT4SXOICXJNGT3G2OAF1G5LLK1X1BKO5VVZB2OI2BWX3&ll=45.46796000000006,9.18178000000006&v=20191201&radius=500&limit=103'

In [9]:
venues = response.json()['response']['groups'][0]['items']

In [10]:
nearby_venues = json_normalize(venues) # flatten JSON

In [11]:
nearby_venues.columns

Index(['referralId', 'reasons.count', 'reasons.items', 'venue.id',
       'venue.name', 'venue.location.address', 'venue.location.lat',
       'venue.location.lng', 'venue.location.labeledLatLngs',
       'venue.location.distance', 'venue.location.postalCode',
       'venue.location.cc', 'venue.location.city', 'venue.location.state',
       'venue.location.country', 'venue.location.formattedAddress',
       'venue.categories', 'venue.photos.count', 'venue.photos.groups',
       'venue.location.crossStreet', 'venue.location.neighborhood',
       'venue.venuePage.id'],
      dtype='object')

In [12]:
cleaned_venues = nearby_venues[['venue.name','venue.location.lat', 'venue.location.lng', 'venue.location.postalCode']]

In [13]:
cleaned_venues = cleaned_venues.join(nearby_venues['venue.categories'].str[0].apply(pd.Series)['name'])

In [14]:
cleaned_venues

Unnamed: 0,venue.name,venue.location.lat,venue.location.lng,venue.location.postalCode,name
0,Dachterrasse Hiltl,47.375686,8.539650,8001,Vegetarian / Vegan Restaurant
1,Läderach,47.376537,8.539653,8001,Chocolate Shop
2,Globus Delicatessa,47.375557,8.538357,,Gourmet Shop
3,Grande Café & Bar,47.375479,8.543395,8001,Bar
4,Lindenhof,47.373005,8.540883,8001,Pedestrian Plaza
...,...,...,...,...,...
95,Café Henrici,47.372516,8.543686,8001,Café
96,Raclette-Stube,47.374807,8.544690,8001,Swiss Restaurant
97,Old Fashion Bar,47.368625,8.540764,8001,Bar
98,Seebad Enge,47.361547,8.536754,8002,Pool


# Data
## Retrieve a list of major italian cities

We will retrieve the list of all italian cities with more than 50000 inhabitants from the Wikipedia page:
https://en.wikipedia.org/wiki/List_of_cities_in_Italy

In [147]:
# Fill a list containing towns in this format:
# towns = ["Turin, Italy", "Milan, Italy"]

In [90]:
italian_cities = pd.read_html("https://en.wikipedia.org/wiki/List_of_cities_in_Italy")[0]["City"]
towns = italian_cities + ", Italy"

In [91]:
import geocoder # import geocoder
import json

In [92]:
# Check if this provider works
geojson = geocoder.arcgis("Turin, Italy").json
geojson

{'address': 'Turin',
 'bbox': {'northeast': [45.14435000000004, 7.76193000000007],
  'southwest': [44.98035000000004, 7.59793000000007]},
 'confidence': 2,
 'lat': 45.06235000000004,
 'lng': 7.67993000000007,
 'ok': True,
 'quality': 'Locality',
 'raw': {'name': 'Turin',
  'extent': {'xmin': 7.59793000000007,
   'ymin': 44.98035000000004,
   'xmax': 7.76193000000007,
   'ymax': 45.14435000000004},
  'feature': {'geometry': {'x': 7.67993000000007, 'y': 45.06235000000004},
   'attributes': {'Score': 100, 'Addr_Type': 'Locality'}}},
 'score': 100,
 'status': 'OK'}

In [93]:
# The google provider doesn't work, this one seems to do (and it doesn't need the silly while loop)
def getLatLng(address):
    print("Locating " + address + "... ", end="")
    g = geocoder.arcgis('{}'.format(address)).json
    lat = g["lat"]
    lng = g["lng"]
    print("done.")
    return lat, lng

In [94]:
# Test the geolocation function
lat1,lng1 = getLatLng("Turin, Italy")
lat1,lng1

Locating Turin, Italy... done.


(45.06235000000004, 7.67993000000007)

In [95]:
townsDf = pd.DataFrame(towns)

In [96]:
townsDf.columns = ["CityCountry"]
townsDf

Unnamed: 0,CityCountry
0,"Rome, Italy"
1,"Milan, Italy"
2,"Naples, Italy"
3,"Turin, Italy"
4,"Palermo, Italy"
...,...
139,"Battipaglia, Italy"
140,"Rho, Italy"
141,"Chieti, Italy"
142,"Scafati, Italy"


In [97]:
# Split the comma-separated list in 2 columns (I could have separated them at the beginning, 
# but it's good for learning). I will re-join them anyway further below...

townsDf1 = townsDf["CityCountry"].str.split(',', expand=True)

In [98]:
townsDf1.columns = ["City", "Country"]

In [99]:
townsDf1

Unnamed: 0,City,Country
0,Rome,Italy
1,Milan,Italy
2,Naples,Italy
3,Turin,Italy
4,Palermo,Italy
...,...,...
139,Battipaglia,Italy
140,Rho,Italy
141,Chieti,Italy
142,Scafati,Italy


## Get locations of towns in list

In [100]:
# Geolocate cities with Arcgis service
townsDf1["lat","lng"] = townsDf1.apply(lambda x: getLatLng(x["City"] + "," + x["Country"]), axis=1)

Locating Rome, Italy... done.
Locating Milan, Italy... done.
Locating Naples, Italy... done.
Locating Turin, Italy... done.
Locating Palermo, Italy... done.
Locating Genoa, Italy... done.
Locating Bologna, Italy... done.
Locating Florence, Italy... done.
Locating Bari, Italy... done.
Locating Catania, Italy... done.
Locating Venice, Italy... done.
Locating Verona, Italy... done.
Locating Messina, Italy... done.
Locating Padua, Italy... done.
Locating Trieste, Italy... done.
Locating Brescia, Italy... done.
Locating Taranto, Italy... done.
Locating Parma, Italy... done.
Locating Prato, Italy... done.
Locating Modena, Italy... done.
Locating Reggio Calabria, Italy... done.
Locating Reggio Emilia, Italy... done.
Locating Perugia, Italy... done.
Locating Livorno, Italy... done.
Locating Ravenna, Italy... done.
Locating Cagliari, Italy... done.
Locating Foggia, Italy... done.
Locating Rimini, Italy... done.
Locating Salerno, Italy... done.
Locating Ferrara, Italy... done.
Locating Sassari, 

In [101]:
townsDf1

Unnamed: 0,City,Country,"(lat, lng)"
0,Rome,Italy,"(41.90322000000003, 12.495650000000069)"
1,Milan,Italy,"(45.46796000000006, 9.18178000000006)"
2,Naples,Italy,"(40.840140000000076, 14.252260000000035)"
3,Turin,Italy,"(45.06235000000004, 7.67993000000007)"
4,Palermo,Italy,"(38.122070000000065, 13.361120000000028)"
...,...,...,...
139,Battipaglia,Italy,"(40.60963000000004, 14.980920000000026)"
140,Rho,Italy,"(45.53266000000008, 9.038920000000076)"
141,Chieti,Italy,"(42.346640000000036, 14.165270000000078)"
142,Scafati,Italy,"(40.74824000000007, 14.528350000000046)"


In [102]:
townsDf1[["lat", "lng"]] = pd.DataFrame(   townsDf1[("lat", "lng")].tolist()    )

In [103]:
townsDf2 = townsDf1[["City", "Country", "lat", "lng"]]

In [104]:
townsDf2

Unnamed: 0,City,Country,lat,lng
0,Rome,Italy,41.90322,12.49565
1,Milan,Italy,45.46796,9.18178
2,Naples,Italy,40.84014,14.25226
3,Turin,Italy,45.06235,7.67993
4,Palermo,Italy,38.12207,13.36112
...,...,...,...,...
139,Battipaglia,Italy,40.60963,14.98092
140,Rho,Italy,45.53266,9.03892
141,Chieti,Italy,42.34664,14.16527
142,Scafati,Italy,40.74824,14.52835


## Get venues of each town (append all into a single dataframe)

In [105]:
CLIENT_ID = 'O3R0SYR4X4XVKABB0Z5L2EDCNNX2XHCAA5BWAXK5S4QJSZ5P' # your Foursquare ID
CLIENT_SECRET = '3SYACT4SXOICXJNGT3G2OAF1G5LLK1X1BKO5VVZB2OI2BWX3' # your Foursquare Secret
VERSION = '20191201'



print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: O3R0SYR4X4XVKABB0Z5L2EDCNNX2XHCAA5BWAXK5S4QJSZ5P
CLIENT_SECRET:3SYACT4SXOICXJNGT3G2OAF1G5LLK1X1BKO5VVZB2OI2BWX3


In [106]:
limit=103
radius=500

lat = townsDf2.loc[1, "lat"]
lng = townsDf2.loc[1, "lng"]

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, radius, limit)
url

'https://api.foursquare.com/v2/venues/explore?client_id=O3R0SYR4X4XVKABB0Z5L2EDCNNX2XHCAA5BWAXK5S4QJSZ5P&client_secret=3SYACT4SXOICXJNGT3G2OAF1G5LLK1X1BKO5VVZB2OI2BWX3&ll=45.46796000000006,9.18178000000006&v=20191201&radius=500&limit=103'

In [107]:
results = requests.get(url).json()

In [108]:
# function that extracts the category of the venue
def get_category_type(row):
    categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [109]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Piazza Castello,Plaza,45.468965,9.181312
1,Castello Sforzesco,Castle,45.469545,9.180424
2,Fontana del Castello Sforzesco,Fountain,45.469237,9.180917
3,Signorvino,Wine Bar,45.467095,9.183597
4,Giovanni Cova & C.,Bakery,45.468816,9.184121


In [110]:
# Function to collect information about the venues in each neighbourhood
# Iterate over passed lists of neighbourhoods, latitude and longitude;
# 

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [111]:
venues = getNearbyVenues(names=townsDf2['City'],
                        latitudes=townsDf2['lat'],
                        longitudes=townsDf2['lng']
                        )



Rome
Milan
Naples
Turin
Palermo
Genoa
Bologna
Florence
Bari
Catania
Venice
Verona
Messina
Padua
Trieste
Brescia
Taranto
Parma
Prato
Modena
Reggio Calabria
Reggio Emilia
Perugia
Livorno
Ravenna
Cagliari
Foggia
Rimini
Salerno
Ferrara
Sassari
Latina
Giugliano in Campania
Monza
Syracuse
Bergamo
Pescara
Trento
Forlì
Vicenza
Terni
Bolzano
Novara
Piacenza
Ancona
Andria
Udine
Arezzo
Cesena
Lecce
Pesaro
Barletta
Alessandria
La Spezia
Pistoia
Pisa
Catanzaro
Guidonia Montecelio
Lucca
Brindisi
Torre del Greco
Treviso
Busto Arsizio
Como
Marsala
Grosseto
Sesto San Giovanni
Pozzuoli
Varese
Fiumicino
Casoria
Asti
Cinisello Balsamo
Caserta
Gela
Aprilia
Ragusa
Pavia
Cremona
Carpi
Quartu Sant'Elena
Lamezia Terme
Altamura
Imola
L’Aquila
Massa
Trapani
Viterbo
Cosenza
Potenza
Castellammare di Stabia
Afragola
Vittoria
Crotone
Pomezia
Vigevano
Carrara
Caltanissetta
Viareggio
Fano
Savona
Matera
Olbia
Legnano
Acerra
Marano di Napoli
Benevento
Molfetta
Agrigento
Faenza
Cerignola
Moncalieri
Foligno
Manfredonia
Ti

In [112]:
venues

Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rome,41.90322,12.49565,The Romehello,41.903395,12.493861,Hostel
1,Rome,41.90322,12.49565,The St. Regis Rome,41.904072,12.494873,Hotel
2,Rome,41.90322,12.49565,Piazza della Repubblica,41.902422,12.496367,Plaza
3,Rome,41.90322,12.49565,Basilica di Santa Maria degli Angeli e dei Mar...,41.903002,12.496685,Church
4,Rome,41.90322,12.49565,la Feltrinelli International,41.903610,12.495308,Bookstore
...,...,...,...,...,...,...,...
3643,Scandicci,43.75807,11.18469,Bar Marisa,43.759733,11.181011,Café
3644,Scandicci,43.75807,11.18469,Farmacia Comunale n. 3,43.759583,11.180561,Pharmacy
3645,Scandicci,43.75807,11.18469,Conad,43.754290,11.183523,Supermarket
3646,Scandicci,43.75807,11.18469,Rosticceria Cinese Il Panda,43.754501,11.188424,Chinese Restaurant


# Methodology
The Foursquare API provides the category of each venue. We want to use this information to find similarities between cities. 

Therefore, we created a data set (DataFrame) which lists the frequency of each category for each city. 

This data has then been used to cluster cities using the k-means machine learning algorithm, with 5 centroids (chosen randomly).

## Get categories of venues for each town
This section contains a few exploratory tests, in order to check if the data is consistent and presents a sufficient number of venue categories.

In [113]:
# Let's have a look at how many venue categories are there in each city
venues.groupby(["Venue Category", "City"]).count().head(12)

Unnamed: 0_level_0,Unnamed: 1_level_0,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,City,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Accessories Store,Milan,1,1,1,1,1
Airport,Guidonia Montecelio,1,1,1,1,1
American Restaurant,Bergamo,1,1,1,1,1
American Restaurant,Parma,2,2,2,2,2
Argentinian Restaurant,Gallarate,1,1,1,1,1
Argentinian Restaurant,Sanremo,1,1,1,1,1
Argentinian Restaurant,Turin,1,1,1,1,1
Art Gallery,Bergamo,1,1,1,1,1
Art Gallery,Catanzaro,1,1,1,1,1
Art Gallery,Como,1,1,1,1,1


In [118]:
venues.groupby(["City", "Venue Category"]).count()["Venue"].head(30)

City         Venue Category    
Acerra       Café                  1
             Flower Shop           1
             Gastropub             1
             Museum                1
Acireale     Café                  5
             Italian Restaurant    3
             Plaza                 2
             Pub                   2
             Restaurant            1
Afragola     Clothing Store        1
             Locksmith             1
             Mobile Phone Shop     1
             Shipping Store        1
Agrigento    Café                  2
             Food                  1
             Italian Restaurant    3
             Pub                   1
             Restaurant            1
             Theater               1
Alessandria  Bar                   2
             Breakfast Spot        1
             Café                  2
             Cosmetics Shop        1
             Dessert Shop          1
             Ice Cream Shop        1
             Italian Restaurant    2
      

## Get frequencies of categories for each town

In [119]:
# Get the one-hot encoding of venue categories
cities_onehot = pd.get_dummies(venues[["Venue Category"]], prefix="", prefix_sep="")

In [120]:
# Add back the city
cities_onehot["City"] = venues["City"]

In [121]:
cities_onehot

Unnamed: 0,Accessories Store,Airport,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,...,Vegetarian / Vegan Restaurant,Veneto Restaurant,Video Game Store,Video Store,Warehouse Store,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3643,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3644,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3645,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3646,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [122]:
# Move City column to first column
fixed_columns = [cities_onehot.columns[-1]] + list(cities_onehot.columns[:-1])

cities_onehot = cities_onehot[fixed_columns]

In [123]:
cities_onehot.head()

Unnamed: 0,Women's Store,Accessories Store,Airport,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,...,Umbrian Restaurant,Vegetarian / Vegan Restaurant,Veneto Restaurant,Video Game Store,Video Store,Warehouse Store,Wine Bar,Wine Shop,Winery,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [124]:
cities_grouped = cities_onehot.groupby("City").mean().reset_index()
cities_grouped

Unnamed: 0,City,Women's Store,Accessories Store,Airport,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Umbrian Restaurant,Vegetarian / Vegan Restaurant,Veneto Restaurant,Video Game Store,Video Store,Warehouse Store,Wine Bar,Wine Shop,Winery,Wings Joint
0,Acerra,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
1,Acireale,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
2,Afragola,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
3,Agrigento,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
4,Alessandria,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.052632,0.0,0.0,0.000000,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
139,Viareggio,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
140,Vicenza,0.0,0.0,0.0,0.0,0.0,0.0,0.027397,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.054795,0.0,0.013699,0.0
141,Vigevano,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
142,Viterbo,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0


## Cluster towns based on frequencies of categories

In [125]:
# Run k-means to cluster cities. The results, unfortunately, will be meaningless anyway, because the API
# limits the number of results to 100, which aren't of course a statistically relevant sample
# of the venues. 
# Therefore all this exercise is only for the sake of concluding this Coursera exercise.

kClusters = 5

cities_grouped_clustering = cities_grouped.drop("City", axis=1)

In [126]:
from sklearn.cluster import KMeans
kMeans = KMeans(n_clusters=kClusters, random_state=37)
kMeans.fit(cities_grouped_clustering)
kMeans.labels_

array([3, 3, 4, 4, 4, 3, 4, 1, 4, 4, 4, 4, 1, 3, 2, 4, 1, 4, 1, 4, 1, 3,
       4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 2, 4, 3, 3, 3, 3, 4, 3, 4, 4, 4, 4,
       3, 0, 1, 3, 4, 1, 4, 4, 4, 1, 1, 4, 1, 4, 4, 4, 4, 2, 4, 4, 3, 1,
       4, 1, 4, 4, 4, 3, 3, 4, 4, 4, 1, 3, 4, 1, 1, 1, 3, 4, 1, 4, 4, 2,
       4, 4, 4, 1, 4, 4, 4, 4, 1, 4, 3, 1, 3, 1, 1, 4, 1, 4, 4, 1, 4, 1,
       3, 4, 1, 1, 2, 4, 4, 1, 4, 1, 4, 4, 4, 1, 1, 1, 4, 0, 4, 4, 1, 4,
       4, 4, 4, 1, 1, 4, 4, 4, 4, 1, 4, 3], dtype=int32)

In [127]:
cities_grouped.insert(0, "Cluster label", kMeans.labels_)

In [128]:
cities_grouped_geolocated = cities_grouped.join(townsDf2.set_index("City"), on="City")
cities_grouped_geolocated

Unnamed: 0,Cluster label,City,Women's Store,Accessories Store,Airport,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Video Game Store,Video Store,Warehouse Store,Wine Bar,Wine Shop,Winery,Wings Joint,Country,lat,lng
0,3,Acerra,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,Italy,40.94717,14.37254
1,3,Acireale,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,Italy,37.61251,15.16561
2,4,Afragola,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,Italy,40.92312,14.31010
3,4,Agrigento,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,Italy,37.31087,13.57650
4,4,Alessandria,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.052632,0.0,0.0,0.000000,0.0,0.000000,0.0,Italy,44.90724,8.61156
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
139,4,Viareggio,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,Italy,43.88138,10.23526
140,4,Vicenza,0.0,0.0,0.0,0.0,0.0,0.0,0.027397,0.0,...,0.000000,0.0,0.0,0.054795,0.0,0.013699,0.0,Italy,45.54800,11.54947
141,1,Vigevano,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,Italy,45.31706,8.85870
142,4,Viterbo,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,Italy,42.42094,12.10982


In [129]:
# Drop one-hot columns from the new dataframe, we don't need them anymore
columns = ["Cluster label", "City", "lat", "lng"]
cities_grouped_geolocated = cities_grouped_geolocated[columns]
cities_grouped_geolocated

Unnamed: 0,Cluster label,City,lat,lng
0,3,Acerra,40.94717,14.37254
1,3,Acireale,37.61251,15.16561
2,4,Afragola,40.92312,14.31010
3,4,Agrigento,37.31087,13.57650
4,4,Alessandria,44.90724,8.61156
...,...,...,...,...
139,4,Viareggio,43.88138,10.23526
140,4,Vicenza,45.54800,11.54947
141,1,Vigevano,45.31706,8.85870
142,4,Viterbo,42.42094,12.10982


# Results
In this section, we display the major italian cities, clustered by their similarities in venue categories. 

Similar cities are represented by the same color. No particular pattern is immediately evident, maybe also due to the mentioned issue with Foursquare's API which limits the number of results to 100, making the result set not significant. 

## Analyze clusters to see similarities between cities

In [145]:
# create map
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors


lat_Rome = 41.9028
lng_Rome = 12.4964
map_clusters = folium.Map(location=[lat_Rome, lng_Rome], zoom_start=5.5)

# set color scheme for the clusters
x = np.arange(kClusters)
ys = [i + x + (i*x)**2 for i in range(kClusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(cities_grouped_geolocated['lat'], cities_grouped_geolocated['lng'], cities_grouped_geolocated['City'], cities_grouped_geolocated['Cluster label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Discussion
Ignoring the mentioned Foursquare API issue, we can observe that a few clusters are less spread through the whole country: for example
* a small cluster including only one town in Puglia and one in Sardinia (light blue, cluster 2); 
* and one including Casal del Greco, in Campania, and Crotone, a town in Calabria (red, cluster 0). 

It could be interesting to further analyze the specific reasons for this clustering.

We can now use this information to match the client's list of restaurants and easily cross-check the ones that are well-working with the clusters. This will reduce the number of choices for a new city to install a new restaurant of the chain.

In a subsequent step, we could limit our search only to the cities in the chosen cluster, and further cluster them based on the same principles, or using other criteria.

# Conclusion
It can be argued that the methodology adopted in this work can be used to recognize "business similarities" between locations. 

This information can therefore be used to facilitate choosing a new city for the next restaurant location, thus minimising the risk of failure.

As an example, if we limit our search to the island of Sardinia, we immediately recognize that Sassari is "closer" to Cagliari than Olbia. Therefore, if an existing restaurant in the chain in Cagliari is already working well, we will suggest to establish another one in Sassari.

Choosing the right city will be only the first step in this decision; further steps may be necessary, such as:
* cross checking with market data for existing restaurants of the chain
* further restricting the number of choices using the same algorithm with the same data type
* considering other types of information, e.g. statistical data about tourism
* identifying the presence of "marker" venues, such as similar restaurants, which could be good predictors for our restaurant's success.