The following code analysis the city of Milan. It first finds the geospatial points of interest and spreads them on a folium map, the through the foursquare API the venues are retrived and clustered to observe the distribution of venues in Milan. Finally, particular attention is directed to two key aspects of Milan city life, the fashion and the nightlife. Through data analysis these two districts shall be identified and localized.

Firstly we import the libraries

In [2]:
import numpy as np
import pandas as pd
import json
import urllib.request
from io import BytesIO
from zipfile import ZipFile
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.2 MB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.21.0-py_0

The following packages will be UPDATED:

  openssl                                 1.1.1f-h516909a_0 --> 1.1.1g-h51

The geospatial data shall be taken from the official site of the municipality of the city of Milan. The data is in json format and zipped. 

In [3]:
link_json = 'http://dati.comune.milano.it/dataset/5c6519f6-6d26-41c9-b53b-6106e08d1b90/resource/cc9a206d-aac1-42c7-a8cc-ba11e500e488/download/ds634_civici_coordinategeografiche_20200403_json.zip'
access_url = urllib.request.urlopen(link_json)

# its a zipped json file so we first unzip it then read it
zf = ZipFile(BytesIO(access_url.read()))
zdata = zf.read('ds634_civici_coordinategeografiche_20200403.json')

# zdata is type bytes so we cast it to string format
s = str(zdata,'utf-8')

# the string contains all the entries separated by \n, split these and create a series of lists, one for each datapoint
s = str(zdata,'utf-8')
lists = s.split('\n')

Each datapoint is a building/house in the district of Milan, so the database is very big. It is first read as an entire string then split in lists, and inserted in a dataframe

In [4]:
column_names = ['District', 'Postal Code', 'Road Name', 'Longitude', 'Latitude']
df_milan = pd.DataFrame(columns=column_names)

District = []       
Name = []           
Post_code = []      
Lng =[]            
Lat = []            
for i in lists:
    # convert to dict with json
    dic = json.loads(i)
    # from dict extract the keys we want, but only chose the lines with postal code and from the central districts
    if dic['CAP'] != None:
        Dist = dic['MUNICIPIO']
        PC = dic['CAP']
        Name = dic['TIPO'] + ' ' + dic['DESCRITTIVO']
        Lng = dic['LONG_WGS84']
        Lat = dic['LAT_WGS84']
        df_milan = df_milan.append({'District':Dist, 'Postal Code':PC, 'Road Name':Name, 'Longitude':Lng, 'Latitude':Lat}, ignore_index=True)

df_milan = df_milan.groupby(['District', 'Postal Code', 'Road Name'], as_index=False).mean()
df_milan.head()


Unnamed: 0,District,Postal Code,Road Name,Longitude,Latitude
0,1,20121,Bastioni DI PORTA NUOVA,9.189394,45.480053
1,1,20121,Bastioni DI PORTA VENEZIA,9.202396,45.475062
2,1,20121,Bastioni DI PORTA VOLTA,9.182029,45.479434
3,1,20121,Corso DI PORTA NUOVA,9.19171,45.475896
4,1,20121,Corso GIACOMO MATTEOTTI,9.1952,45.466907


The dataframe is reduced to only district number 1, which is the city center. However, it is still a very big dataframe, so to easy the analysis and to ensure i have enough Foursquare calls, the dataframe is reduced to 100 elements. To ensure that the elements are evenly spread out in the city center, i shall create 100 clusters and pick one element from each cluster.

In [5]:
# We shall pick only district number 1, which is the city center
df_centro = df_milan[df_milan['District'] == '1']

# To reduce the size of the dataset (so it can be passed to foursquare) we cluster the data so it is distributed evenly
k = 100
df_cluster = df_centro.drop('Road Name', axis=1)
kmeans = KMeans(init='k-means++', n_clusters=k, n_init=12)
kmeans.fit(df_cluster)
df_centro.insert(0, 'Cluster Labels', kmeans.labels_)

In [8]:
# of each of the 100 clusters we randomly select 1 so we are left with 100 evenly distributed points of interest
df_reduced = df_centro.groupby('Cluster Labels').apply(lambda x: x.sample(1)).reset_index(drop=True)
df_reduced = df_reduced.drop('District', axis=1)
df_reduced.rename
df_reduced.head()

Unnamed: 0,Cluster Labels,Postal Code,Road Name,Longitude,Latitude
0,0,20121,Via LUIGI ALBERTINI,9.182543,45.474717
1,1,20145,Via GIOVANNI RANDACCIO,9.167483,45.476815
2,2,20154,Piazza ERCOLE LUIGI MORSELLI,9.174736,45.478654
3,3,20129,Corso VENEZIA,9.204959,45.47435
4,4,20123,Via PIETRO AZARIO,9.168378,45.460085


Let's create a folium map to visualize these 100 points of interest in Milan

In [9]:
# lets visualize these 100 points on a map
address = 'Milan, Italy'
geolocator = Nominatim(user_agent='milan_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# now we can create a map centered around these coordinates:
map_mi = folium.Map(location=[latitude, longitude], zoom_start=12)

# adding markesrs to the map
markers_colors = []
for lat, lon, pc, nei, cluster in zip(df_reduced['Latitude'], df_reduced['Longitude'], df_reduced['Postal Code'], df_reduced['Road Name'], df_reduced['Cluster Labels']):
    label = folium.Popup(str(pc) + str(nei) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon],radius=5,popup=label,color='blue',fill=True,fill_color='blue',fill_opacity=0.7).add_to(map_mi)


In [10]:
map_mi

We can see that as a result of the clustering, the datapoints are fairly and randomly spread out. Now we can pass these 100 datapoints to Foursquare and find venues for each

In [11]:
# lets find the venues for each point of interest in the city

id = '3IY3C0UTVM2HTXALLM2TS15SIDCCLS3AEREPWW3ROYC0AKQE'
pw = 'ZIL15ZGLNYOH5WI3DGVYWAAB1YHEX4T0IM0FKAX2FKFRIFEQ'
version = '20180605'
radius = 500
limit = 100

def NearbyVenues(postal_code, roads, latitudes, longitudes):
    venues_list = []
    for pc, name, lat, lng in zip(postal_code, roads, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(id, pw, version, lat, lng, radius, limit)
        results = requests.get(url).json()
        results = results['response']['groups'][0]['items']
        venues_list.append([(pc, name, lat, lng, v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng'], v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal code', 'Road Name', 'Road Latitude', 'Road Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

    return (nearby_venues)

milan_venues = NearbyVenues(postal_code=df_reduced['Postal Code'], roads=df_reduced['Road Name'], latitudes=df_reduced['Latitude'], longitudes=df_reduced['Longitude'])

milan_venues.head()

Unnamed: 0,Postal code,Road Name,Road Latitude,Road Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,20121,Via LUIGI ALBERTINI,45.474717,9.182543,Garibaldi Crème,45.474355,9.183466,Ice Cream Shop
1,20121,Via LUIGI ALBERTINI,45.474717,9.182543,Temakinho Brera,45.474651,9.183356,Sushi Restaurant
2,20121,Via LUIGI ALBERTINI,45.474717,9.182543,La Prosciutteria,45.474152,9.183449,Sandwich Place
3,20121,Via LUIGI ALBERTINI,45.474717,9.182543,Piccolo Teatro Studio Melato,45.47257,9.182809,Theater
4,20121,Via LUIGI ALBERTINI,45.474717,9.182543,Sugarwax,45.47376,9.181654,Spa


To understand the data it can be grouped by venue category

In [35]:
# group the venues with one hot encoding and group by point of interest:

onehot = pd.get_dummies(milan_venues[['Venue Category']], prefix="", prefix_sep="")
onehot['Road Name'] = milan_venues['Road Name']
first_col = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[first_col]
milan_grouped = onehot.groupby('Road Name').mean().reset_index()
milan_grouped.head()

Unnamed: 0,Road Name,Abruzzo Restaurant,Accessories Store,African Restaurant,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Video Store,Vietnamese Restaurant,Watch Shop,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Bastioni DI PORTA VENEZIA,0.0,0.01087,0.032609,0.0,0.0,0.054348,0.01087,0.0,0.0,...,0.01087,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.0
1,Corso DI PORTA VIGENTINA,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.114286,0.0,0.0,0.0,0.0,0.0
2,Corso VENEZIA,0.0,0.011236,0.044944,0.0,0.0,0.044944,0.0,0.0,0.0,...,0.011236,0.0,0.0,0.0,0.011236,0.0,0.0,0.0,0.0,0.0
3,Corso VITTORIO EMANUELE II,0.0,0.01,0.0,0.0,0.0,0.02,0.01,0.0,0.0,...,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0
4,Foro BUONAPARTE,0.0,0.018182,0.0,0.0,0.0,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.036364,0.0,0.0,0.0,0.0,0.0


We can see the top 5 venue categories around each point of interest

In [38]:
# we create a dataframe with the top 5 categories for each postal code
def common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)

    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 5
indicators = ['st', 'nd', 'rd']
columns = ['Road Name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
top_venues = pd.DataFrame(columns=columns)
top_venues['Road Name'] = milan_grouped['Road Name']
for ind in np.arange(milan_grouped.shape[0]):
    top_venues.iloc[ind, 1:] = common_venues(milan_grouped.iloc[ind, :], num_top_venues)

top_venues.head()

Unnamed: 0,Road Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Bastioni DI PORTA VENEZIA,Italian Restaurant,Hotel,Pizza Place,Art Gallery,African Restaurant
1,Corso DI PORTA VIGENTINA,Restaurant,Wine Bar,Pizza Place,Italian Restaurant,Bistro
2,Corso VENEZIA,Italian Restaurant,Pizza Place,Café,African Restaurant,Art Gallery
3,Corso VITTORIO EMANUELE II,Boutique,Plaza,Italian Restaurant,Sporting Goods Shop,Monument / Landmark
4,Foro BUONAPARTE,Italian Restaurant,Café,Plaza,Ice Cream Shop,Platform


Again we can cluster these categories

In [39]:
# divide in 5 different clusters and fit
k = 5
milan_clusters = milan_grouped.drop('Road Name', 1)
kmeans = KMeans(init='k-means++', n_clusters = k, n_init=12)
kmeans.fit(milan_clusters)

# display in a dataframe the cluster label
top_venues.insert(0, 'Cluster Labels', kmeans.labels_)
#milan_merged.drop(['index'], axis=1, inplace=True)

# merge the top 5 dataframe and the neighorood dataframe on postal code
milan_merged = pd.merge(df_reduced, top_venues, on='Road Name')

milan_merged

Unnamed: 0,Cluster Labels_x,Postal Code,Road Name,Longitude,Latitude,Cluster Labels_y,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,0,20121,Via LUIGI ALBERTINI,9.182543,45.474717,3,Italian Restaurant,Ice Cream Shop,Cocktail Bar,Café,Japanese Restaurant
1,1,20145,Via GIOVANNI RANDACCIO,9.167483,45.476815,1,Italian Restaurant,Cocktail Bar,Pizza Place,Japanese Restaurant,Hotel
2,2,20154,Piazza ERCOLE LUIGI MORSELLI,9.174736,45.478654,1,Italian Restaurant,Cocktail Bar,Pizza Place,Chinese Restaurant,Tram Station
3,3,20129,Corso VENEZIA,9.204959,45.474350,3,Italian Restaurant,Pizza Place,Café,African Restaurant,Art Gallery
4,4,20123,Via PIETRO AZARIO,9.168378,45.460085,3,Italian Restaurant,Café,Pizza Place,Supermarket,Pub
...,...,...,...,...,...,...,...,...,...,...,...
95,95,20123,Via ZEBEDIA,9.187206,45.460736,2,Plaza,Italian Restaurant,Café,Hotel,Pizza Place
96,96,20123,Via NIRONE,9.178217,45.464816,2,Italian Restaurant,Café,Ice Cream Shop,Plaza,Sandwich Place
97,97,20123,Via BERNARDINO ZENALE,9.169980,45.464346,2,Italian Restaurant,Café,Ice Cream Shop,Plaza,Pastry Shop
98,98,20121,Via SOLFERINO,9.187849,45.477073,3,Italian Restaurant,Ice Cream Shop,Restaurant,Plaza,Japanese Restaurant


We can group the clusters to see what each cluster represents

In [40]:
# let's see what the 5 clusters represent
top_venues.insert(0, 'Cluster Labels_y', kmeans.labels_)
clusters = top_venues.groupby('Cluster Labels_y').agg(lambda x:x.value_counts().index[0])
clusters.drop(['Road Name'], axis=1, inplace=True)
clusters

Unnamed: 0_level_0,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
Cluster Labels_y,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,Hotel,Hotel,Plaza,Italian Restaurant,Japanese Restaurant
1,1,Italian Restaurant,Cocktail Bar,Pizza Place,Ice Cream Shop,Hotel
2,2,Italian Restaurant,Plaza,Café,Hotel,Ice Cream Shop
3,3,Italian Restaurant,Ice Cream Shop,Café,Café,Pizza Place
4,4,Boutique,Plaza,Italian Restaurant,Women's Store,Monument / Landmark


We have three clear areas: the shopping area (cluster 4), the hotel area (cluster 0) and clusters 1, 2 and 3 which represent various combinations of restaurants, cocktail bars etc...

These can be marked on a map

In [41]:
# lets visualize them on a map

address = 'Milan, Italy'
geolocator = Nominatim(user_agent='milan_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

clusters_milan = folium.Map(location=[latitude, longitude], zoom_start=11)

# colors for clusters
x = np.arange(5)
ys = [i + x + (i * x) ** 2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# adding markesrs to the map
markers_colors = []

for lat, lon, poi, cluster in zip(milan_merged['Latitude'], milan_merged['Longitude'], milan_merged['Road Name'], milan_merged['Cluster Labels_y']):
    label = folium.Popup(str(poi) + '\nCluster: ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon],radius=5,popup=label,color=rainbow[int(cluster)-1],fill=True,fill_color=rainbow[int(cluster)-1],fill_opacity=0.7).add_to(clusters_milan)

clusters_milan

It is interesting to observe the distribution of the various categories.
It is possible to clearly observe Milan's famous fashion district in orange. The blue cluster represents the more high end restaurants whereas the purple categories are where the cocktail bars and the nightlife is.

So let's dig deeper into these two key aspects of Milan, the high-end fashion and the nightlife.

In [42]:
df_boutique = milan_venues[milan_venues['Venue Category'] == 'Boutique']
df_clothing = milan_venues[milan_venues['Venue Category'] == 'Clothing Store']

df_boutique.head()

Unnamed: 0,Postal code,Road Name,Road Latitude,Road Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
31,20121,Via LUIGI ALBERTINI,45.474717,9.182543,Cos,45.472172,9.187444,Boutique
396,20135,Viale REGINA MARGHERITA,45.460149,9.206745,Etro,45.459495,9.20799,Boutique
481,20122,Largo AUGUSTO,45.46319,9.197318,Louis Vuitton,45.467176,9.196643,Boutique
494,20122,Largo AUGUSTO,45.46319,9.197318,Just Cavalli,45.467087,9.196686,Boutique
504,20122,Largo AUGUSTO,45.46319,9.197318,Louis Vuitton,45.465224,9.191796,Boutique


In [45]:
df_pubs = milan_venues[milan_venues['Venue Category'] == 'Pub']
df_clubs = milan_venues[milan_venues['Venue Category'] == 'Nightclub']
df_clubs.head()

Unnamed: 0,Postal code,Road Name,Road Latitude,Road Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
896,20122,Via SANTA CROCE,45.455196,9.183111,Volt,45.456737,9.184066,Nightclub
1339,20123,Via GIAN GIACOMO MORA,45.458909,9.17991,Volt,45.456737,9.184066,Nightclub
1406,20122,Via SAN SENATORE,45.457838,9.18975,Volt,45.456737,9.184066,Nightclub
1436,20122,Via SAN SENATORE,45.457838,9.18975,TOM - The Organic Market,45.456746,9.183933,Nightclub
2290,20121,Piazza SEMPIONE,45.475479,9.172686,Cavalli Club Milano,45.473403,9.173201,Nightclub


We can display these two categories on another map

In [46]:
# lets display the nightlife and fashion districts on a map
address = 'Milan, Italy'
geolocator = Nominatim(user_agent='milan_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
district_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# adding fashion boutiques
markers_colors = []
for lat, lon, shop, road in zip(df_boutique['Venue Latitude'], df_boutique['Venue Longitude'], df_boutique['Venue'], df_boutique['Road Name']):
    label = folium.Popup(str(shop) +'\n'+ str(road), parse_html=True)
    folium.CircleMarker([lat, lon],radius=5,popup=label,color='red',fill=True,fill_color='red',fill_opacity=0.7).add_to(district_map)

# adding clothing stores
markers_colors = []
for lat, lon, shop, road in zip(df_clothing['Venue Latitude'], df_clothing['Venue Longitude'], df_clothing['Venue'], df_clothing['Road Name']):
    label = folium.Popup(str(shop) +'\n'+ str(road), parse_html=True)
    folium.CircleMarker([lat, lon],radius=5,popup=label,color='orange',fill=True,fill_color='orange',fill_opacity=0.7).add_to(district_map)

# adding pubs
markers_colors = []
for lat, lon, shop, road in zip(df_pubs['Venue Latitude'], df_pubs['Venue Longitude'], df_pubs['Venue'], df_pubs['Road Name']):
    label = folium.Popup(str(shop) +'\n'+ str(road), parse_html=True)
    folium.CircleMarker([lat, lon],radius=5,popup=label,color='blue',fill=True,fill_color='blue',fill_opacity=0.7).add_to(district_map)

# adding clubs
markers_colors = []
for lat, lon, shop, road in zip(df_clubs['Venue Latitude'], df_clubs['Venue Longitude'], df_clubs['Venue'], df_clubs['Road Name']):
    label = folium.Popup(str(shop) +'\n'+ str(road), parse_html=True)
    folium.CircleMarker([lat, lon],radius=5,popup=label,color='purple',fill=True,fill_color='purple',fill_opacity=0.7).add_to(district_map)

    
    
district_map

We can clearly the see a distinction on the map between these two aspects of Milan' city life. The boutiques and clothing shops in red and orange respectively are mainly distributed in the north-east part of the city center, defined as 'Quadrilatero della Moda' (literally fashion district). Whereas the pubs and the nightlife (in blue and purple) is concentrated in the south-west of the map, the famous 'Navigli' ares